<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Intrinsic Plagiarism Detection Using Character n-gram Profiles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Efstathios Stamatatos</string-name>
          <email>stamatatos@aegean.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of the Aegean 83200 - Karlovassi</institution>
          ,
          <addr-line>Samos</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <fpage>38</fpage>
      <lpage>46</lpage>
      <abstract>
        <p>The task of intrinsic plagiarism detection deals with cases where no reference corpus is available and it is exclusively based on stylistic changes or inconsistencies within a given document. In this paper a new method is presented that attempts to quantify the style variation within a document using character n-gram profiles and a style change function based on an appropriate dissimilarity measure originally proposed for author identification. In addition, we propose a set of heuristic rules that attempt to detect plagiarism-free documents and plagiarized passages, as well as to reduce the effect of irrelevant style changes within a document. The proposed approach is evaluated on the recently-available corpus of the 1st Int. Competition on Plagiarism Detection with promising results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Textual plagiarism (the unacknowledged use of
the work of another author either as an exact
copy or a slightly modified version) is a major
problem in modern world affecting education
and research mainly. The rapid development of
WWW made billions of web pages easily
accessible to anyone providing plenty of
potential sources for plagiarism. As a result,
automated plagiarism analysis and detection
receives increasing attention in both academia
and software industry
        <xref ref-type="bibr" rid="ref7">(Maurer et al, 2006)</xref>
        .
      </p>
      <p>
        There are two basic tasks in plagiarism
analysis. In external plagiarism detection a
reference corpus is given and the task is to
identify pairs of identical or very similar
passages from a suspicious document and some
texts of the reference corpus. Most of the
research studies in plagiarism analysis deal with
this task
        <xref ref-type="bibr" rid="ref13 ref2">(Hoad and Zobel, 2003; Stein, 2005)</xref>
        .
      </p>
    </sec>
    <sec id="sec-2">
      <title>On the other hand, intrinsic plagiarism</title>
      <p>
        detection is more ambitious since no reference
corpus is given
        <xref ref-type="bibr" rid="ref12 ref12 ref4 ref8 ref8">(Meyer zu Eissen et al., 2007;
Stein and Meyer zu Eissen, 2007)</xref>
        . This task is
applied in cases where it is not possible to have
a representative reference corpus. In addition,
the comparison of a suspicious document with
all the texts of a very large corpus may be
impractical in terms of computational time cost.
It can also serve as a preprocessing step to an
external plagiarism detection tool in order to
reduce the time cost.
      </p>
      <p>To handle the intrinsic plagiarism detection
task one has to detect plagiarized passages of a
suspicious document exclusively based on
irregularities or inconsistencies within the
document. Such inconsistencies or anomalies
are mainly of stylistic nature.</p>
      <p>
        The attempts to quantify writing style, a line
of research known as ‘stylometry’, have a long
history
        <xref ref-type="bibr" rid="ref3">(Holmes, 1998)</xref>
        . A great variety of
measures that represent some kind of stylistic
information have been proposed especially in
the framework of authorship attribution
research. In a recent survey,
        <xref ref-type="bibr" rid="ref10">Stamatatos (2009)</xref>
        distinguishes the following types of stylometric
features: lexical features (word frequencies,
word n-grams, vocabulary richness, etc.),
character features (character types, character
ngrams), syntactic features (part-of-speech
frequencies, types of phrases, etc.), semantic
features (synonyms, semantic dependencies,
etc.), and application-specific features
(structural, content-specific, language-specific).
      </p>
      <p>
        Although the lexical features are still the
most popular, a number of independent recent
studies have demonstrated the effectiveness of
character n-grams for quantifying writing style
        <xref ref-type="bibr" rid="ref10 ref10 ref11 ref12 ref4 ref4 ref5 ref6">(Keselj et al., 2003; Stamatatos, 2006;
Stamatatos, 2007; Kanaris and Stamatatos,
2007; Koppel et al., 2009)</xref>
        . This type of features
can be easily measured in any text and it is
language and domain independent since it does
not require any text pre-processing. These
measures are also robust to noise. Note that in
plagiarism analysis the efforts of an author to
slightly modify a plagiarized passage may be
considered as noise insertion.
        <xref ref-type="bibr" rid="ref1">Graham et al.
(2005)</xref>
        were the first to use character n-grams to
detect stylistic inconsistencies in texts.
However, their results were poor. One reason
for this is that they only used character bigrams.
Another reason is that the distance measure
they used (cosine distance) was unreliable for
very short texts. Note also that
        <xref ref-type="bibr" rid="ref1">Graham et al.
(2005)</xref>
        were based on predefined text segments
(paragraphs) and their task was to identify
whether two consecutive paragraphs differ in
style or not.
      </p>
      <p>In this paper, we propose a method for
intrinsic plagiarism detection based on
character n-gram profiles (the set of character
n-gram normalized frequencies of a text) and an
appropriate dissimilarity measure originally
proposed for author identification. Our method
automatically segments documents according to
stylistic inconsistencies and decide whether or
not a document is plagiarism-free. A set of
heuristic rules is introduced that attempt to
detect plagiarism on either the document level
or the text passage level as well as to reduce the
effect of irrelevant stylistic changes within a
document.</p>
      <p>The rest of the paper is organized as follows.
Section 2 describes the method of quantifying
stylistic changes within a document. Then,
Section 3 includes the plagiarism detection
heuristics while Section 4 describes the
evaluation procedure. Finally, Section 5
discusses the main points of this study and
proposes future work directions.</p>
      <p>2</p>
      <sec id="sec-2-1">
        <title>The style change function</title>
        <p>The main idea of the proposed approach is
to define a sliding window over the text length
and compare the text in the window with the
whole document. Thus, we get a function that
quantifies the style changes within the
document. Then, we can use the anomalies of
that function to detect the plagiarized sections.
In particular, the peaks of that function
(corresponding to text sections of great
dissimilarity with the whole document) indicate
likely plagiarized sections. Therefore, what we
need is a means to compare two texts knowing
that one of the two (the text in the window) is
shorter or much shorter than the other (the
whole document).</p>
        <p>
          Following the practice of recent successful
methods in author identification
          <xref ref-type="bibr" rid="ref10 ref10 ref11 ref4 ref5 ref6">(Keselj et al.,
2003; Stamatatos, 2006; Stamatatos, 2007;
Koppel et al., 2009; Stamatatos, 2009)</xref>
          , each
text is considered as a bag-of-character
ngrams. That is, given a predefined n that
denotes the length of strings, we build a vector
of normalized frequencies (over text length) of
all the character n-grams appearing at least once
in the text. This vector is called the profile of
the text. Note that the size of the profile
depends on the text length (longer texts have
bigger profiles). An important question is the
value of n. A high n corresponds to long strings
and better capture intra-word and inter-word
information. On the other hand, a high n
considerably increases the dimensionality of the
profile. To keep dimensionality relatively low
and based on preliminary experiments as well
as on previous work on author identification
          <xref ref-type="bibr" rid="ref10 ref4 ref6">(Stamatatos, 2007; Koppel et al., 2009)</xref>
          we used
character 3-grams in this study. The complete
set of parameter settings for the proposed
method is given in Table 1. These settings were
estimated using a small part (~200 documents)
of the evaluation corpus (see section 4).
        </p>
        <p>Description
Character n-gram length
Sliding window length
Sliding window step
Threshold of
plagiarismfree criterion
Real window length
threshold
Sensitivity of plagiarism
detection
Symbol
n
l
s
t1
t2
a</p>
        <p>Value</p>
        <p>3
1,000
200
0.02
1,500
2</p>
        <p>
          Let P(A) and P(B) be the profiles of two
texts A and B, respectively.
          <xref ref-type="bibr" rid="ref10">Stamatatos (2007)</xref>
          studied the performance of various distance
measures that quantify the similarity between
two character n-gram profiles in the framework
of author identification experiments. The
following distance (or dissimilarity) measure
has been found to be both accurate and robust
when the two texts significantly differ in length.
where fA(g) and fB(g) are the frequency of
occurrence (normalized over text length) of the
n-gram g in text A and text B, respectively,
Note that d1 is not a symmetric function
(typically, this means it cannot be called
distance function). That is, only the n-grams of
the first text are taken into account in the sum.
This function is designed to handle cases where
text A is shorter than text B.
          <xref ref-type="bibr" rid="ref10">Stamatatos (2007)</xref>
          showed that d1 is quite stable even when text A
is much shorter than text B. This is exactly the
case in the proposed method for intrinsic
plagiarism detection where we want to compare
a short text passage with the whole document
that may be quite long. In this paper, we
modified this measure as follows:
where |P(A)| is the size of the profile of text A.
The denominator ensures that the values of
dissimilarity function lie between 0 (highest
similarity) and 1. We call this measure
normalized d1 (or nd1).
        </p>
        <p>Let w be a sliding window of length l (in
characters) and step s (in characters). That is,
each time the window is moved to the right by s
characters and the profile of the next l
characters is extracted. If l&gt;s the windows are
overlapping. Then, we can define the style
change function (sc) of a document D as
follows:</p>
        <p>sc(i,D)=nd1(wi, D), i=1…|w|
where |w| is the total amount of windows (it
depends on text-length). Given a text of x
characters |w| is computed as follows:
w = ⎢⎢1 + x − l ⎥</p>
        <p>⎣ s ⎦⎥</p>
        <p>Examples of style change functions can be
seen in figures 1, 2, 3, and 4.
3
3.1</p>
      </sec>
      <sec id="sec-2-2">
        <title>Detecting plagiarism</title>
        <sec id="sec-2-2-1">
          <title>Plagiarism on the document level</title>
          <p>The first important question that must be
answered is whether or not a given document
contains any plagiarized passages. This is
crucial to keep the precision of our method
high. If we are unable to find documents that
are plagiarism-free, it is quite likely for the
plagiarism detection method to identify a
number of text passages as the result of
potential plagiarism for any given document.
Thus, the credibility of the method would be
very low.</p>
          <p>There are two options to decide whether or
not a document contains plagiarized sections:</p>
          <p>By pre-processing: A criterion must be
defined to indicate a plagiarism-free document.
If this is the case, there is no further detection
of plagiarized sections.</p>
          <p>By post-processing: The algorithm detects
any likely plagiarized sections and then a
decision is taken based on these results.</p>
          <p>
            Typically, the detected sections are
compared to other sections of the document to
decide whether there are significant differences
between them
            <xref ref-type="bibr" rid="ref12 ref4 ref8">(Stein and Meyer zu Eissen,
2007)</xref>
            .
          </p>
          <p>In this study we followed the former
approach. The criterion we used is based on the
variance of the style change function. If the
document is written by one author, we expect
the style change function to remain relatively
stable. On the other hand, if there are
plagiarized sections, the style change function
will be characterized by peaks that significantly
deviate from the average value. The existence
of such peaks is indicated by the standard
deviation. Let S denote the standard deviation
of the style change function. If S is lower than a
predefined threshold, then the document is
considered plagiarism-free.</p>
          <p>Plagiarism-free criterion: S&lt;t1</p>
          <p>The value of the threshold t1 was determined
empirically at 0.02. Recall that the dissimilarity
function we use is normalized. So, the
definition of such a common threshold for all
the documents is possible. However, the nd1
measure is not independent of text length. Very
short documents tend to have low style change
function values. Moreover, very long texts are
likely to contain stylistic changes made
intentionally by the author. In both these cases
this criterion will not be very accurate.</p>
          <p>Figures 2 and 3 show the style change
function of documents 00017 and 00034 of
IPAT-DC (see section 4) that fall under the
plagiarism-free criterion. The former is a
successful case where no plagiarism exists. On
the other hand, in the case of document 00034,
200
400</p>
          <p>600
Sliding window position
800</p>
          <p>Sliding window position
despite the presence of two plagiarized
passages, the style change function fails to
produce significant peaks that would increase
its standard deviation. Note also that 00017 is
longer than 00034 (more sliding windows in the
x-axis) and the average style change function of
00017 is higher than that of 00034.
Additionally, Figure 4 shows the style change
of document 00022 of IPAT-DC. Although this
document is plagiarism-free, the standard
0
100
200
300
400</p>
          <p>Sliding window position
deviation of its style function is greater than the
used threshold (false positive).
3.2</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>Identifying plagiarized passages</title>
          <p>Given the style change function of a document,
the task of plagiarism detection can be viewed
as detecting peaks of that function
corresponding to text sections that significantly
differ from the rest of the document. One big
problem in plagiarism detection is that it is not
possible to estimate the percentage of
plagiarized text beforehand. In intrinsic
plagiarism detection the problem is much
harder since if the plagiarized sections are too
long the stylistic anomalies would correspond
to the style of the alleged author rather than the
plagiarized sections. In this study we suppose
that at least half of the text is not plagiarized so
that the average of style change function would
indicate the style of that author. However, the
calculation of the average sc value would
inevitably involve the plagiarized passages as
well.</p>
          <p>Let M and S denote the mean and standard
deviation of sc, respectively. To reduce this
problem we first remove from sc all the text
windows with value greater than M+S. These
text sections are highly likely to correspond to
plagiarized sections. Let sc(i′,D) denote the
style change function after the removal of these
sections. Let M′ and S′ be the mean and
standard deviation of sc(i′,D). Then, we define
the following criterion to detect plagiarism:</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Plagiarized passage criterion:</title>
      <p>sc(i′,D) &gt;M′+a*S′</p>
      <p>The parameter a determines the sensitivity
of the plagiarism detection method. The higher
the value of a, the less (and more likely
plagiarized) sections are detected. The value of
a was determined empirically at 2.0 to attain a
good combination of precision and recall.
Figures 1 and 4 show the result of applying the
proposed criterion in two documents.
3.3</p>
      <sec id="sec-3-1">
        <title>Detecting irrelevant style changes</title>
        <p>An important factor that affects style using the
character n-gram representation is the
formatting of documents. A document written
in uppercase with many space characters,
punctuation symbols will have a quite different
character n-gram profile than the same
document in lowercase after the removal of any
extra space and punctuation characters. The
proposed method for the quantification of style
changes is very general and is sensitive to such
stylistic changes that are irrelevant to
plagiarism. In fact, a very common technique to
disguise plagiarism is to change the formatting
of text. So, any plagiarism detection tool should
attempt to reduce the formatting factor.</p>
        <p>To deal with this problem, we performed a
number of processes. First, each document is
transformed to lowercase. Although the
uppercase information is important for
representing adequately the style of an author, it
can be easily used to fool a plagiarism detection
tool. Then, we removed from the profile of a
text every character n-gram that contains no
letter characters (a-z, or any lowercase
character of foreign languages) at all. This way,
any character n-gram that contains only digit,
space, or punctuation characters, that is
irrelevant to the content of text, is excluded and
the formatting factor is reduced. Finally, the
sliding window parameters operate on letter
characters. That is, a window length of l
characters means that the window should
contain l letter characters. Note that all the other
characters (digits, spaces, punctuation, etc.) are
not removed. Therefore, if l=1,000, a window
may contain 1,200 characters (this is the real
window length) in total from which 1,000 are
letter characters. Moreover, a step of s
characters means that the window is moved to
the right by s letter characters. This procedure
ensures that all the text windows will have the
same number of letter (or content) characters
and the formatting of the text will not
significantly affect the style change function.</p>
        <p>Since there is no prior knowledge on the
genre of documents, a given document may be
composed of several sections each one
belonging to a different genre (or sub-genre)
and therefore having different stylistic
characteristics. For example, a table of contents
has different style than the main document. The
character n-gram representation is able to
capture both the style of author and the style of
genre but it is hard to distinguish these factors.
To handle this problem, we make use of the real
window length as defined above. In more detail,
let l′ be the real window length (the total
number of characters included in a window that
contains l letter characters) of a text section.
The real window length is affected by some
genres. For example, the l′ of a table of contents
is higher than the l′ of the main document. This
is demonstrated in figure 5 that shows the style
change function and the real window length of
the last part of document 00046 of IPAT-DC
(for l=1,000). This document ends with an
index. Note that the real window length of this
special section is much higher than the rest of
the document. The stylistic difference between
the index and the rest of the document is
captured by the style change function.
However, this difference has nothing to do with
plagiarism. To take such cases into account, an
additional criterion was used to detect
plagiarized passages:</p>
        <p>Special section criterion: l′&lt;t2</p>
        <p>This criterion is combined with the
plagiarized passage criterion. Based on
empirical evaluation, the value of the threshold
t2 was estimated at 1,500 (or 1.5l). Note that
this criterion excludes text sections with overly
real window length. However, one can take
advantage of this criterion and disguise
plagiarism by inserting many formatting
characters to a text section so that l′ is
considerably increased. Moreover, a plagiarized
section within a special section (e.g. table of
contents) that resembles the style of that section
will not be detected.
4</p>
        <sec id="sec-3-1-1">
          <title>Evaluation</title>
          <p>
            In the framework of the 1st International
competition on plagiarism detection a large
corpus has been released for the Intrinsic
Plagiarism Analysis Task
            <xref ref-type="bibr" rid="ref9">(Potthast et al., 2009)</xref>
            .
This corpus is segmented into a development
part (IPAT-DC) and a competition part
(IPATCC) each one comprising 3,091 documents. An
artificial plagiarism tool has been used to
automatically insert plagiarized passages within
the documents. The following evaluation results
are mainly based on IPAT-DC since this corpus
also provides ground truth data. IPAT-DC
comprises a wide variety of texts covering
many genres and topics. The text length varies
from (roughly) 3,000 characters to 2.5 million
characters. Interestingly, the plagiarized
passages begin in randomly selected positions
covering arbitrary combinations of words,
sentences, and paragraphs.
4.1
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation on the document level</title>
        <p>First, we evaluate the plagiarism-free
criterion that operates on the document level.
Table 2 shows the confusion matrix of
IPATDC after the application of this criterion. It is
important that over 70% of the plagiarism-free
documents were correctly classified. This is
crucial to keep the overall precision on
reasonable level. On the other hand, false
positives (see Figure 4) harm the precision
while false negatives (see Figure 3) harm the
recall.</p>
        <p>Guess
Plagiarism-free</p>
        <p>Plagiarized</p>
        <p>Actual
Plagiarism-free
1102
443</p>
        <p>Plagiarized
545 (22%)
1001 (78%)</p>
        <p>As can be seen, roughly 1/3 of the
plagiarized documents are considered
plagiarism-free. However, taking into account
the number of plagiarized passages within each
document (indicated inside parentheses in the
table), we see that 22% of the plagiarized
passages is missed. So, the upper bound for the
recall on the passage level will be 78%. A
closer look to the false negatives shows that
text-length is a crucial factor. Figure 6 depicts
the distribution of false negatives over
textlength of documents. As can be seen, the
majority of false negatives are relatively short
documents (&lt;30K chars). Moreover, the shorter
a document, the more likely to be false
negative.</p>
        <p>Corpus</p>
        <p>Recall
Precision</p>
        <p>F-score
Granularity</p>
        <p>Overall
score</p>
        <p>To evaluate the plagiarism detection
method, we should first define appropriate
measures. In particular, we used the
performance measures defined in the
framework of the 1st int. competition on
plagiarism detection: recall, precision,
granularity, and overall score. Let r denote a
plagiarized passage and |R| be the set of all
plagiarized passages in the corpus. Moreover,
let p be a detected passage by the proposed
method, |P| be the set of all detected passages,
and |Rp| be the subset of R that overlap with at
least one member of |P|. Finally, let |r| and rˆ
be the length of a plagiarized passage and the
sum of its detected characters by the plagiarism
detection method, respectively. Similarly, |p|
and pˆ are the length of a detected passage and
the sum of their chars that belong to any
plagiarized passage. Then, recall, precision, and
granularity can be defined as follows:
recall = 1 R rˆ</p>
        <p>∑ i</p>
        <p>R i=1 ri
precision = 1 ∑P pˆi</p>
        <p>P i=1 pi
granularity = log2 (1 +
overall =</p>
        <p>F
granularity
1 RP</p>
        <p>∑ ri ∩ P
RP i=1
where |ri∩P| denotes the number of different
detections that overlap with the plagiarized
passage ri, and F is the harmonic mean of recall
and precision. Essentially, the granularity
measure indicates the fragmentation of the
detected passages. A granularity value of 1
means that at most one detected section
overlaps with a plagiarized passage.</p>
        <p>The results of the evaluation of the
plagiarized passage criterion are included in
table 3 on the development and competition
corpus of the intrinsic plagiarism analysis task
(taken by the official results of the
competition). The parameter values shown in
table 1 have been used to produce these results.
As can be seen, the performance of the
proposed method remains stable for both
corpora. Actually, the performance on
IPATCC is better than on IPAT-DC that was used for
estimating the values of parameters. This
indicates that the proposed settings are quite
general and robust.</p>
        <p>Figure 7 provides a closer look in the
recallprecision results on IPAT-DC with respect to
text-length of documents. It is obvious that
recall is dramatically affected by decreasing
text length. The distribution of false negatives
showed in figure 6 offers a reasonable
explanation for this. Precision is more stable.
However, it tends to decrease while text length
increases.
5</p>
        <sec id="sec-3-2-1">
          <title>Discussion</title>
          <p>In this paper a new method for intrinsic
plagiarism detection has been presented. The
proposed approach is based on character n-gram
profiles, a style change function using an
appropriate dissimilarity measure as well as a
set of heuristic rules to detect plagiarized
passages. The evaluation results demonstrate
that it is able to detect roughly half of the
plagiarized sections. On the other hand, the
precision remains low. An important factor for
improving precision is the development of more
sophisticated and accurate plagiarism-free
criteria on the document level. The precision
can also be improved by increasing the
sensitivity parameter a. However, this will
harm recall.</p>
          <p>The proposed method is easy to follow and
requires no language-dependent resources.
Moreover, it requires no text segmentation or
preprocessing. The proposed parameter settings
proved to be effective when the approach was
evaluated on the IPAT-CC. Note that the
parameter values of table 1 were not optimized
for IPAT-DC. However, the application of
machine learning algorithms could improve the
estimation of these parameters. Especially, the
definition of the window length is crucial since
it determines the shortest plagiarized passage
that can be detected. On the other hand, a very
short window would not adequately capture the
stylistic properties of the text.</p>
          <p>Another future work direction is to examine
different schemes for comparing a text window
with the whole document. The approach
followed in this paper is fast since it requires
the calculation of only one profile for the whole
document. Alternative approaches include the
comparison of the text window with the
window complement (the document without the
window) and the comparison of a text window
with all the other text windows.</p>
          <p>Finally, character n-grams of higher order
could be used. Preliminary experiments using
character 4-grams and 5-grams did not show
significant improvement on the performance of
the method. However, this remains to be
carefully examined.
the American Society for information</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Science and Technology, 60(3): 538-556.</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Graham</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Hirst</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Marthi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Segmenting Documents by Stylistic Character</article-title>
          .
          <source>Natural Language Engineering</source>
          ,
          <volume>11</volume>
          (
          <issue>4</issue>
          ):
          <fpage>397</fpage>
          -
          <lpage>415</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Hoad</surname>
            ,
            <given-names>T.C.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Methods for Identifying Versioned and Plagiarised Documents</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          ,
          <volume>54</volume>
          (
          <issue>3</issue>
          ):
          <fpage>203</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>D.I.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>The Evolution of Stylometry in Humanities Scholarship</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          ,
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <fpage>111</fpage>
          -
          <lpage>117</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Kanaris</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Webpage Genre Identification Using Variable-length Character n-grams</article-title>
          ,
          <source>In Proc. of the 19th IEEE Int. Conf. on Tools with Artificial Intelligence, v.2</source>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Keselj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cercone</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Thomas</surname>
          </string-name>
          .
          <year>2003</year>
          ..
          <article-title>N-gram-based Author Profiles for Authorship Attribution</article-title>
          .
          <source>In Proceedings of the Pacific Association for Computational Linguistics</source>
          , pp.
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Computational Methods in Authorship Attribution</article-title>
          ,
          <source>Journal of the American Society for information Science and Technology</source>
          ,
          <volume>60</volume>
          (
          <issue>1</issue>
          ):
          <fpage>9</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Maurer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kappe</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Zaka</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Plagiarism - A Survey</article-title>
          .
          <source>Journal of Universal Computer Science</source>
          ,
          <volume>12</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1050</fpage>
          -
          <lpage>1084</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Meyer zu Eissen</surname>
            , S.,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Stein</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kulig</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Plagiarism Detection without Reference Collections</article-title>
          .
          <source>Advances in Data Analysis</source>
          , pp.
          <fpage>359</fpage>
          -
          <lpage>366</lpage>
          , Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Eiselt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barron</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Plagiarism Corpus PANPC-09. Webis at Bauhaus-Universitaet Weimar and</article-title>
          NLEL at Universidad Polytecnica de Valencia. (http://www.webis.de/research/corpora)
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>A Survey of Modern Authorship Attribution Methods</article-title>
          , Journal of Stamatatos,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>Author Identification Using Imbalanced and Limited Training Texts</article-title>
          .
          <source>In Proceedings of the 4th International Workshop on Text-based Information Retrieval</source>
          , pp.
          <fpage>237</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Ensemble-based Author Identification Using Character N-grams</article-title>
          ,
          <source>In Proc. of the 3rd Int. Workshop on Textbased Information Retrieval</source>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and S. Meyer zu Eissen.
          <year>2007</year>
          .
          <article-title>Intrinsic Plagiarism Analysis with Meta Learning</article-title>
          .
          <source>In Proceedings of the SIGIR Workshop on Plagiarism Analysis</source>
          ,
          <string-name>
            <given-names>Authorship</given-names>
            <surname>Attribution</surname>
          </string-name>
          , and
          <string-name>
            <surname>Near-Duplicate Detection</surname>
          </string-name>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Fuzzy-Fingerprints for TextBased Information Retrieval</article-title>
          .
          <source>In Proceedings of the 5th International Conference on Knowledge Management, J</source>
          .UCS:
          <fpage>572</fpage>
          -
          <lpage>579</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>