<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Intrinsic Plagiarism Detection Using Character Trigram Distance Scores</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiPS Computational Linguistics Group University of Antwerp</institution>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Mike Kestemont</institution>
          ,
          <addr-line>Kim Luyckx, and Walter Daelemans</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <abstract>
        <p>In this paper, we describe a novel approach to intrinsic plagiarism detection. Each suspicious document is divided into a series of consecutive, potentially overlapping 'windows' of equal size. These are represented by vectors containing the relative frequencies of a predetermined set of high-frequency character trigrams. Subsequently, a distance matrix is set up in which each of the document's windows is compared to each other window. The distance measure used is a symmetric adaptation of the normalized distance (nd1) proposed by Stamatatos [17]. Finally, an algorithm for outlier detection in multivariate data (based on Principal Components Analysis) is applied to the distance matrix in order to detect plagiarized sections. In the PAN-PC-2011 competition, this system (second place) achieved a competitive recall (.4279) but only reached a plagdet of .1679 due to a disappointing precision (.1075).</p>
      </abstract>
      <kwd-group>
        <kwd>intrinsic plagiarism detection</kwd>
        <kwd>character n-grams</kwd>
        <kwd>distance scores</kwd>
        <kwd>outlier detection</kwd>
        <kwd>stylometry</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recently, it has been challenged whether this artificial assumption is in fact realistic
for real-world cases of plagiarism [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. On many occasions the potential sources of
plagiarisms are not known beforehand or might not even be freely available, let alone
searchable in a digital format in the public domain (e.g. world wide web). Moreover,
with the increasing amount of easy-access, online information sources nowadays, even a
semi-exhaustive search of an author’s potential sources becomes increasingly
demanding from a computational point of view. In ‘intrinsic plagiarism detection’, the question
is therefore raised whether plagiarized sections can be detected in a suspicious
document, in the absence of any external reference material [
        <xref ref-type="bibr" rid="ref10 ref11 ref18">11,10,18</xref>
        ]. The idea is that,
if an author plagiarized a specific section in a work, one would expect this section to
be stylistically deviant from the non-plagiarized sections in the same document that
were indeed originally written by the author himself. If this expectation holds true, it
should be possible to detect such contaminated sections without having access to an
external reference corpus. This kind of plagiarism research bears close similarities to
computational authorship studies in the field of stylometry, in which scholars study the
correlation between authorial identity and writing style [
        <xref ref-type="bibr" rid="ref16 ref5 ref7">16,5,7</xref>
        ] .
      </p>
      <p>
        Obviously, the intrinsic variant of plagiarism detection is inherently more difficult than
plagiarism detection in studies that depend on external reference material (to a great
extent). Note that in its purest formulation, the intrinsic approach does not presuppose that
a system has access to external, genuine writings by the suspicious document’s main
author [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. The availability of such external training material for supervised learning – a
typical experimental set-up in stylometric authorship attribution [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] – would naturally
facilitate the task of singling out stylistically deviant passages in an unseen suspicious
document. Intrinsic plagiarism detection, however, is not only interesting from a
theoretical perspective but is also directly relevant for a number of practical plagiarism
scenarios. An interesting example from the world of academia would be a master
student hiring a ghost-writer to write one of the chapters of his or her master’s thesis.
An external plagiarism detector might fail to spot the deviant style of the ‘outsourced’
chapter, since the ghost writer himself need not have plagiarized. An intrinsic approach,
however, might more easily detect the authorial ruptures and alarm the student’s
supervisor of the compromised integrity of the thesis.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Document representation</title>
      <p>
        By definition, the intrinsic plagiarism analysis of a suspicious document is limited to an
analysis of the suspicious document itself. The initial representation of the suspicious
document is therefore vital to an intrinsic approach, determining much of a system’s
subsequent procedures. The seminal work in this field has been characterized by a fairly
standard methodology in this respect (cf. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]). Typically, a suspicious document is
segmented into a series of consecutive (potentially partially overlapping but non-identical)
samples or ‘windows’ of equal size. Window sizes have been typically fixed – variable
window sizes have been rarely considered – but the optimal window size is still unclear.
On the one hand, stylistic authorship analyses typically require relatively large samples
of text (at least ca. 2500 words) while, on the other hand, plagiarized sections need not
be all that long (e.g. a single sentence) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. It is therefore common to use segmentation
parameters that offer a trade-off between the issues of granularity and performance.
Subsequently, it is typical for many studies to compile a sort of feature vector
(‘document profile’) on the basis of the suspicious document as a whole [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Next, each
(shorter) document window is compared to the (larger) document profile using some
sort of distance measure. Finally, stylistically deviant windows are identified using an
outlier detection algorithm.
      </p>
      <p>
        In this paper, we experiment with a novel methodology that departs from this standard
document representation and the associated window vs. document profile comparisons.
The latter procedure seems troublesome on a number of levels. First of all, from the
point of view of stylometry as well as linguistic theory, it seems strange that the
relatively smaller windows are compared to the larger profile of the overall document.
The frequency distributions of words as well as other style markers are known to be
affected by a text’s length [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore the stylometric comparison of two samples of
so different a size (the single window vs. the entire document) is hard to justify from
a theoretical perspective. Naturally, distance measures can be used to normalize the
effect on the difference in document length (e.g. cosine distance) but even then, this
problem seems hard to overcome (cf. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]). Moreover, the underlying assumption of
this approach is that the majority of the suspicious document was genuinely written by
a single author. Only in this case, the plagiarized sections would be easily
distinguishable from the overall document’s profile as outliers. If a disproportionate share (e.g.
more than half) of the suspicious document was in fact plagiarized (possibly from
various source documents) it seems unlikely that the incoherent profile of such a document
would constitute a reliable touchstone for stylistic outlier detection.
      </p>
      <p>In designing our approach, we set out from the hypothesis that comparing a single
window to another single window might provide a more reliable methodology than
comparing a single window to a much larger entity. With regard to the base sampling
parameters, our system nevertheless has the same segmentation parameters (expressed
in absolute character counts) as previous approaches: a ‘window size’ (ws) or the length
of each window and a ‘step size’ (ws &gt;= ss &gt; 0) determining the number of characters in
between the starting points of two consecutive windows. Note that the step size should
be larger than zero (to actually proceed through the document while segmenting it) and
is preferably smaller than or equal to the window size (in order not to skip any text).
The n windows that result from this segmentation procedure are then used to create a
covariance matrix or distance table with n x n dimensions. For a document that has been
segmented into n equal-sized windows w1, w2, ..., wn 1, wn this particular
representation is illustrated in Table 1. The cells in this document table are subsequently filled out
with distance scores between windows described in the next section.</p>
      <p>
        Windows are represented in terms of character n-grams, as these have been proven
useful for style-based categorization (e.g. authorship attribution) [
        <xref ref-type="bibr" rid="ref3 ref6">3,6</xref>
        ]. They are also
able to reliably handle limited data, which is an asset considering the variable length of
plagiarized sections (between 50 and 5000 consecutive words).
...
The distance score we have adopted is an adapted version of the normalized distance
(nd1) proposed by Stamatatos [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. This distance operates on the level of character
ngrams, whereby a text is divided into a series of overlapping character groups of length
n. Under a third order character n-gram model (n=3), the word ‘plagiarism’ would for
instance include the following character trigrams (with whitespace represented as an
underscore): {‘_pl’, ‘pla’, ‘lag’, ‘agi’, ‘gia’, ‘iar’, ‘ari’, ‘ris’, ‘ism’, ‘sm_’}. When
calculating the original nd1 between the windows wx and wy, a list is created of all
ngrams found in wx (but not necessarily in wy). This collection is called the ‘profile’
of P(wx) with |P(wx)| denoting the window’s absolute length. This profile is used to
calculate the normalized distance between two windows using the following formula,
where fwx (g) represents the frequency of trigram g in wx:
nd1(wx; wy) =
      </p>
      <p>X
g2P (wx)
2(fwx (g) fwy (g)) 2
fwx (g)+fwy (g)
4jP(wx)j
nd1(wx; wy6=x) 6= nd1(wy6=x; wx).</p>
      <p>
        The denominator ensures that the real number resulting from this dissimilarity
function will lie between 0 (extreme similarity) and 1 (extreme dissimilarity). Note that this
approach is computationally expensive when dealing with large text collections, since
each comparison is based on a different set of n-grams that for each comparison will
have to re-established. Moreover, this measure is not symmetric: because in
calculating nd1(wx,wy6=x) only the trigrams encountered in P (wx) will be considered so that
We therefore propose a novel use of nd1 (denoted here for simplicity), whereby the
original formula is not applied to every g 2 P (wx) but, instead, to every n-gram in a
predefined set of high-frequency n-grams that is the same for all suspicious documents
used in an experiment. This suggestion is inspired by work on authorship attribution
in stylometry which has shown that high-frequency linguistic items (and in particular
n-grams) show excellent performance in comparison to other, more difficult to extract
features [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This adaptation can be justified in terms of efficiency as well and efficacy.
Note that this adaptation of Stamatatos’ normalised distance [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is symmetric,
meaning that (wx,wy6=x) = (wy6=x,wx), while (wx,wx = 0) by definition. As such,
each row in e.g. Table 1 can be considered a vector that describes one window’s
behavior in terms of its distance to all other windows in the same document. Note that
such a representation is reminiscent of the kind of distance matrices used in statistical
clustering.
4
      </p>
    </sec>
    <sec id="sec-3">
      <title>Outlier detection</title>
      <p>
        Based on the distance matrix representing a suspicious document, one can now try to
detect stylistically deviant sections. Given our particular text representation, this task
becomes similar to outlier identification in multivariate data sets. In case of long
documents and small window sizes, our representation can be high dimensional. In our
system, we have therefore used a technique for outlier identification in high dimensions,
proposed by Filzmoser, Maronna and Werner [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as implemented in the mvoutlier
package for the R statistical software package [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This technique has been optimized for
data sets in which the number of dimensions is very high or even severely
outnumbers the number of observations (note that in our case both are equal). The technique
is furthermore computationally efficient because it reduces the size of the data set by
applying a Principal Components Analysis (PCA), a standard technique for
dimensionality reduction, commonly applied in stylometry [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Subsequently, a robust version of
the Mahalanobis distance (commonly used for outlier identification) is used to detect
outliers in the most informative dimensions resulting from the PCA. Interestingly, the
algorithm assigns different weights to different components, because outliers will tend
to be extremely clear in one component, while relatively absent in others. The
software will eventually output a boolean decision for each observation, indicating which
windows can be considered stylistic ‘outliers’ and thus potentially plagiarized. In our
system the characters of all windows returned as ‘outliers’ by the mvoutlier package
were assumed to be plagiarized. Adjacent and overlapping windows, however, were
concatenated into a single plagiarism instance in order to ensure a good granularity of
our approach.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results and Discussion</title>
      <p>In this section, we will report some of our experimental results on the training and test
corpora used in the PAN-PC-2011 competition. All alphabetic characters in the
suspicious documents were lowercased, but apart from this, no other preprocessing steps
(e.g. reduction of multiple whitespaces) were taken. This decision ensured close
comparability with the character indices used to denote plagiarized fragments. Our distance
metric is dependent on a set of high-frequency character n-grams: we extracted a list
of all n-grams from the entire suspicious document’s corpus from the 2010 competition
and ranked them according to their cumulative, absolute frequency. For each
experiment, we selected the n most frequent n-grams from the top of this list (e.g. n=1,000).
Due to a lack of time, we only experimented with character trigrams and mainly focused
on the effect of the segmentation parameters on the document representation.</p>
      <p>
        Table 2 presents the results of a series of experiments, exploring the effect of the
segmentation parameters ws and ss on the overall performance of the system on the
training corpus for the PAN-PC-2011 competition. We report on the figures
concerning precision, recall, granularity as well as the overall plagdet score [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The figures
were calculated using the reference implementation available from the competition’s
website. From this table it is clear that the system can more easily reach a higher
recall (max. 45.09) than a higher precision (max. 29.79). Smaller window sizes seem to
yield better scores and the same seems true for smaller step sizes (e.g. ss &lt;= ws=2),
while the granularity is only slightly worse with these settings. Because of the large
difference between precision and recall, we have subsequently tried to tune the outbound
parameter of the pcout function, ‘a numeric value between 0 and 1 indicating the
outlier boundary for defining values as final outliers (default to 0.25)’ [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Table 3 covers a
number of fairly random experiments that show that a lower outbound value tended to
boost the recall even more, while higher outbound values slightly pushed the system’s
precision (cf. Table 2).
      </p>
      <p>We submitted a test run for the competition with the following settings ws =
5; 000; ss = 2; 500; n = 2; 500; outbound = :20, which on the training corpus reached
a plagdet of 28.60, a recall of 36.37, a precision of 26.7 and a granularity of 1.11. We
chose these parameters, because the average document length in the test corpus was
smaller than in the training corpus and we were speculating that these settings (e.g.
higher n) were better suited to handle the fine-grained analysis of such shorter
documents. An outbound of .20 was selected to ensure high recall. The result of these
settings on the test corpus were a plagdet of 16.79, a recall of 42.79, a precision of
10.75 and a granularity of 1.03, resulting in a second place. Whereas we find a more
than competitive recall, precision for the test corpus is surprisingly low, as compared to
results from the development phase.</p>
      <p>The system identified 18,691 cases of plagiarism (with low precision), where only
11,443 needed to be detected. While a lot of short (&lt; 1,000 characters) plagiarized
passages were not detected, keeping window size and step size relatively high did result
in reasonable scores for longer passages. A rough error analysis shows that the system
did not detect any plagiarism in the majority of cases with manually obfuscated text. A
smaller step size might increase the system’s precision, but would be more
computationally expensive.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>
        In this paper, we introduced a novel type of document representation for the intrinsic
plagiarism detection task. While the standard methodology – comparing a profile of the
full document to every smaller window in the text [
        <xref ref-type="bibr" rid="ref12 ref17">12,17</xref>
        ] – assumes the majority of the
suspicious document was written by a single author, our approach is not hindered by
this premise. By comparing a single window to all equal-sized windows from the
document and applying outlier detection, we can detect stylistic outliers. In order to keep
computational cost within bounds, we rely on a predetermined set of high-frequency
character trigrams and consequently apply a symmetric adaptation of the normalized
distance (nd1) proposed by Stamatatos [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>During the development phase, we experimented with a number of variables, such
as window size, step size, and the outbound parameter of the outlier detection algorithm.
Although our specific selection of parameters returned high recall and reasonable
precision, the actual test run scored competitively in recall but disappointed in terms of
precision. Short and medium-length plagiarized sections seem to be particularly
challenging for our approach.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>Mike Kestemont is a Ph.D fellow of the Research Foundation – Flanders (FWO). The
research of Luyckx and Daelemans is partially funded through the IWT project
AMiCA: Automatic Monitoring for Cyberspace Applications.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Baayen</surname>
            ,
            <given-names>R.H.</given-names>
          </string-name>
          :
          <article-title>Word Frequency Distributions</article-title>
          , Text,
          <source>Speech and Language Technology</source>
          , vol.
          <volume>18</volume>
          .
          <string-name>
            <surname>Kluwer</surname>
          </string-name>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Binongo</surname>
            ,
            <given-names>J.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>The application of principal components analysis to stylometry</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>14</volume>
          (
          <issue>4</issue>
          ),
          <fpage>445</fpage>
          -
          <lpage>466</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Clement</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sharp</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Ngram and Bayesian classification of documents for topic and authorship</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>18</volume>
          (
          <issue>4</issue>
          ),
          <fpage>423</fpage>
          -
          <lpage>447</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Filzmoser</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maronna</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Werner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Outlier identification in high dimensions</article-title>
          .
          <source>Computational Statistics and Data Analysis</source>
          <volume>52</volume>
          (
          <issue>3</issue>
          ),
          <fpage>1694</fpage>
          -
          <lpage>1711</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Juola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Authorship attribution</article-title>
          .
          <source>Found. Trends Inf. Retr</source>
          .
          <volume>1</volume>
          ,
          <fpage>233</fpage>
          -
          <lpage>334</lpage>
          (
          <year>December 2006</year>
          ), http://portal.acm.org/citation.cfm?id=
          <volume>1373450</volume>
          .
          <fpage>1373451</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Keselj</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cercone</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>N-gram-based author profiles for authorship attribution</article-title>
          .
          <source>In: Proceedings of the 6th Conference of the Pacific Association for Computational Linguistics</source>
          . pp.
          <fpage>255</fpage>
          -
          <lpage>264</lpage>
          . Pacific Association for Computational Linguistics, Halifax, Canada (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Koppel</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Argamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Computational methods in authorship attribution</article-title>
          .
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          .
          <volume>60</volume>
          ,
          <fpage>9</fpage>
          -
          <lpage>26</lpage>
          (
          <year>January 2009</year>
          ), http://portal.acm.org/citation.cfm?id=
          <volume>1484611</volume>
          .
          <fpage>1484627</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Luyckx</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Daelemans</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>The effect of author set size and data size in authorship attribution</article-title>
          .
          <source>Literary and Linguistic Computing</source>
          <volume>26</volume>
          (
          <issue>1</issue>
          ),
          <fpage>35</fpage>
          -
          <lpage>55</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Maurer</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kappe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaka</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Plagiarism - A Survey</surname>
          </string-name>
          .
          <source>Journal of Universal Computer Science</source>
          <volume>12</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1050</fpage>
          -
          <lpage>1084</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Meyer zu Eißen,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Intrinsic Plagiarism Detection</article-title>
          . In: Lalmas,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>MacFarlane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rüger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Tombros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Tsikrika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Yavlinsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>Advances in Information Retrieval: Proceedings of the 28th European Conference on IR Research (ECIR 06). Lecture Notes in Computer Science</source>
          , vol.
          <volume>3936</volume>
          LNCS, pp.
          <fpage>565</fpage>
          -
          <lpage>569</lpage>
          . Springer, Berlin Heidelberg New York (
          <year>2006</year>
          ), http://www.springerlink.com/content/x7x483u1k3970863/
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Meyer zu Eißen,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Kulig</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Plagiarism Detection without Reference Collections</article-title>
          . In: Decker,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Lenz</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.J</surname>
          </string-name>
          . (eds.)
          <article-title>Advances in Data Analysis</article-title>
          .
          <source>Selected Papers from the 30th Annual Conference of the German Classification Society (GfKl)</source>
          . pp.
          <fpage>359</fpage>
          -
          <lpage>366</lpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 2nd International Competition on Plagiarism Detection</article-title>
          .
          <source>In: Notebook Papers of CLEF 2010 LABs and Workshops</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>An Evaluation Framework for Plagiarism Detection</article-title>
          .
          <source>In: Proceedings of the 23rd International Conference on Computational Linguistics</source>
          ,
          <string-name>
            <surname>COLING</surname>
          </string-name>
          <year>2010</year>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eiselt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Overview of the 1st International Competition on Plagiarism Detection</article-title>
          .
          <source>In: Proceedings of the 3rd Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>R</given-names>
            <surname>Development Core Team: R: A Language</surname>
          </string-name>
          and
          <article-title>Environment for Statistical Computing</article-title>
          . R Foundation for Statistical Computing, Vienna, Austria (
          <year>2011</year>
          ), http://www.R-project.
          <source>org, ISBN 3-900051-07-0</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Stamatatos</surname>
          </string-name>
          , E.:
          <article-title>A Survey of Modern Authorship Attribution Methods</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <issue>3</issue>
          ),
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.: Intrinsic</given-names>
          </string-name>
          <string-name>
            <surname>Plagiarism Detection Using Character N-gram Profiles</surname>
          </string-name>
          .
          <source>In: Proceedings of the 3rd International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse</source>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipka</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prettenhoffer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Intrinsic Plagiarism Analysis</article-title>
          .
          <source>Natural Language Engineering</source>
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <fpage>63</fpage>
          -
          <lpage>82</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>