<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Intrinsic Plagiarism Detection using Complexity Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leanne Seaward</string-name>
          <email>seaward@yahoo.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stan Matwin</string-name>
          <email>stan@site.uottawa.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Ottawa 2096</institution>
          <addr-line>Madrid Avenue, Ottawa, ON,K2J 0K4 leanne</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <fpage>56</fpage>
      <lpage>61</lpage>
      <abstract>
        <p>We introduce Kolmogorov Complexity measures as a way of extracting structural information from texts for Intrinsic Plagiarism Detection. Kolmogorov complexity measures have been used as features in a variety of machine learning tasks including image recognition, radar signal classification, EEG classification, DNA analysis, speech recognition and some text classification tasks (Chi and Kong, 1998; Zhang, Hu, and Jin, 2003; Bhattacharya, 2000; Menconi, Benci, and Buiatti, 2008; Frank, Chui, and Witten, 2000; Dalkilic et al., 2006; Seaward and Saxton, 2007; Seaward, Inkpen, and Nayak, 2008). Intrinsic Plagiarism detection uses no external corpus for document comparison and thus plagiarism must be detected solely on the basis of style shifts within the text to be analyzed. Given the small amount of text to be analyzed, feature extraction is of particular importance. We give a theoretical background as to why complexity measures are meaningful and we introduce some experimental results on the PAN'09 Intrinsic Plagiarism Corpus. We show complexity features based on the Lempel-Ziv compression algorithm slightly increase performance over features based on normalized counts. Furthermore we believe that more sophisticated compression algorithms which are suited to compressing the English language show great promise for feature extraction for various text classification problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Intrinsic plagiarism analysis involves
analyzing a document for style changes which would
suggest that certain passages have been
written by a different author and are therefore
plagiarized. It is closely related to
authorship attribution and stylometry
        <xref ref-type="bibr" rid="ref10 ref11 ref5 ref9">(Stamatatos,
Fakotakis, and Kokkinakis, 2000; Stein and
Meyer zu Eissen, 2007)</xref>
        . Intrinsic
plagiarism analysis is a very challenging problem
because one has a small amount of text for
global analysis and one must locally analyse
very small portions or chunks of that text
for style shifts. Authorship attribution
normally uses several documents for author
fingerprinting and tests possible authorship on
an entire text document.
      </p>
      <p>
        Because of the limited data available for
this task and the difficulty of the problem,
feature extraction is very important.
Plagiarism analysis tools and authorship
attribution models attempt to fingerprint an
author’s individual writing style using style
features such as normalized counts of lexical and
vocabulary richness features such as nouns,
verbs, stop words, syllables per word etc
        <xref ref-type="bibr" rid="ref10 ref11 ref5 ref9">(Stamatatos, Fakotakis, and Kokkinakis, 2000;
Stein and Meyer zu Eissen, 2007)</xref>
        . In
addition one may analyze a document for topic or
cohesion words. One may also use
readability indexes to determine if the level of writing
shifts
        <xref ref-type="bibr" rid="ref11 ref9">(Stein and Meyer zu Eissen, 2007)</xref>
        .
      </p>
      <p>Features are extracted globally (for the
entire document) and then locally (per sentence
or paragraph chunk). With the exception of
n-gram methods, the text is generally viewed
as a bag-of-words and structure is ignored.
We introduce a method of using compression
to extract Kolmogorov complexity features
which contain information about the
structure of style features within the text.
Extracting such features is scalable and
complexity features can be used in
state-of-theart machine learning algorithms such as
Support Vector Machines, Neural Networks and
Bayesian Classifiers. The small text
sample makes complexity analysis more difficult
than for the authorship attribution problem.
However, this method still shows promise and
given the difficulty of the problem, a modest
improvement is still important.</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction to Plagiarism</title>
    </sec>
    <sec id="sec-3">
      <title>Detection</title>
      <p>There are two main types of plagiarism
analysis - Intrinsic and Extrinsic. Extrinsic
plagiarism analysis compares the document of
interest to a corpus of reference documents
(web pages, text books etc.) and tries to
find passages which were copied from the
reference collection. In contrast, intrinsic
plagiarism detection uses no reference collection
and tries to determine plagiarized passages
by analyzing style changes within the
document. Intrinsic plagiarism detection is closely
related to author fingerprinting or
stylometry.</p>
      <p>Most research in plagiarism analysis
focuses on extrinsic plagiarism analysis. If one
assumes that the reference collection is
complete, then extrinsic plagiarism analysis is a
somewhat easier problem due to the fact that
one must simply find the match between the
plagiarized passage and the corresponding
passage in the reference collection. The
difficulty lies in reducing the computation time
and detecting obfuscation attempts.</p>
      <p>Obtaining a reference collection of all
possible sources of plagiarism is impossible. Not
all books are in electronic format and
indexing all books for inclusion in such a corpus is
a formidable task. There is always the
possibility that a student has plagiarized from
a document which is not available for
indexing such as a paper from another student at
another university.</p>
      <p>One imagines that a robust plagiarism
analysis tool would use both intrinsic and
extrinsic plagiarism analysis. This is similar to
the way a human expert such as a teacher
or professor would analyze student papers for
plagiarism. One may also use intrinsic
plagiarism analysis to pre-select suspicious passages
which can then be passed to an extrinsic
plagiarism detector. It is always more desirable
to have access to the plagiarized document
as this removes all doubt as to the suspected
plagiarism.</p>
      <p>Intrinsic plagiarism analysis is related to
authorship attribution and generally uses
stylometry features which may consist of
normalized counts of lexical features such as
nouns and verbs as well as measures such
as average sentence length and average word
length. Intrinsic plagiarism detection may
also use readability indexes and as measures
which compute the divergence of the
distribution of lexical elements to the expected
probability distribution. With the exception of
readability indexes, features are extracted as
if each chunk in the text is a bag-of-words.
Humans do not read or write Bags-of-words
and so this approach is counterintuitive and
loses information.</p>
      <p>The need arises for a way of measuring
structure of a text in a meaningful way which
can be used as a feature in style analysis. It
is also necessary that such a measure can be
computed in an efficient and scalable manner.</p>
      <p>The structure which we are measuring
must be meaningful for the classification task
at hand. We propose Kolmogorov
Complexity measures as a way of measuring structural
complexity of lexical elements in order to
fingerprint author style.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Kolmogorov Complexity</title>
    </sec>
    <sec id="sec-5">
      <title>Measures</title>
      <p>This paper introduces Kolmogorov
complexity measures as style features in intrinsic
plagiarism analysis. The basic idea is that each
segment of text has a distribution with
respect to a set of word classes. For example
with respect to the word class noun – the
text has a distribution of noun words and
non-noun words. This can be though of as
a binary string which has a 1 for each noun
word and a 0 for each non-noun word. This
binary string represents the distribution of
noun words in the text.</p>
      <p>For example, suppose we have the string:
“Billy walked the dog yesterday.” The nouns
are “Billy” and “dog”, the noun distribution
is ’10010’. Likewise the only verb is “walked”
so the verb distribution is ’01000’. Similarly
if we look at short words (those with one
syllable) vs. long words the distribution is
’11001’. There is a different distribution for
any possible class of word type.</p>
      <p>In Figure 1 we see how a text can be
decomposed into a representation for each word
class. Once we have this decomposition we
would then like to quantify the structure for
use in a machine learning algorithms.</p>
      <p>Two sentences may have the same ratio
for a particular feature but the distribution
could be different. Suppose two sentences
have the following structure for short words
vs. long words.
Both representations have the same
number of long vs. short words (0 vs. 1) but the
first representation is more random and
complex than the second. It is desirable to
quantify this degree of randomness or complexity.
One such method of doing so is Kolmogorov
complexity measures.
4</p>
    </sec>
    <sec id="sec-6">
      <title>Kolmogorov Complexity</title>
      <p>
        Kolmogorov complexity, also known as
algorithmic entropy, stochastic complexity,
descriptive complexity, Kolmogorov-Chaitin
complexity and program-size complexity, is
used to describe the complexity or
degree of randomness of a binary string. It
was independently developed by Andrey N.
Kolmogorov, Ray Solomonoff and Gregory
Chaitin in the late 1960’s
        <xref ref-type="bibr" rid="ref6">(Li and Vitanyi,
1997)</xref>
        .
      </p>
      <p>
        In computer science, all objects can be
viewed as binary strings. Thus we will
refer to objects and strings interchangeably in
this discussion. The Kolmogorov complexity
of a binary string is the length of the
shortest program which can output the string on
a universal Turing machine and then stop
        <xref ref-type="bibr" rid="ref6">(Li
and Vitanyi, 1997)</xref>
        .
      </p>
      <p>
        It is impossible to compute the
Kolmogorov complexity of a binary string.
However there have been methods developed to
approximate it. The Kolmogorov complexity
of a string x, denoted as K(x), can be
approximated using any lossless compression
algorithm
        <xref ref-type="bibr" rid="ref6">(Li and Vitanyi, 1997)</xref>
        . A compression
algorithm is one which transforms a string A,
to another shorter string, B. The associated
decompression algorithm transforms B back
into A or a string very close to A. A lossless
compression algorithm is one in which the
decompression algorithm exactly computes A
from B and a lossy compression algorithm is
one in which A can be approximated given
B. When Kolmogorov Complexity, or K(x),
is approximated, this approximation
corresponds to an upper-bound of K(x)
        <xref ref-type="bibr" rid="ref6">(Li and
Vitanyi, 1997)</xref>
        . Let C be any compression
algorithm and let C(x) be the results of
compressing x using C. The approximate
Kolmogorov complexity of x, using C as a
compression algorithm, denoted Kc(x), can be
defined as follows:
      </p>
      <p>Kc(x) =</p>
      <p>Length(C(x))</p>
      <p>Length(x)
+ q
where q is the length in bits of the program
which implements C. In practice, q is
usually ignored as it is not useful in comparing
complexity approximations and it varies
according to which programming language
implements C. If C was able to compress x a
great deal then Kc(x) is low and thus x has
low complexity. Likewise if C could not
compress x very much then Kc(x) is high and x
has high complexity.
5</p>
    </sec>
    <sec id="sec-7">
      <title>Compression Algorithms and</title>
    </sec>
    <sec id="sec-8">
      <title>Kolmogorov Complexity</title>
    </sec>
    <sec id="sec-9">
      <title>Analysis</title>
      <p>Kolmogorov complexity can be computed
using any lossless compression algorithm. Once
the text to be analyzed has been converted
into a binary form related to a particular
word class distribution, one simply applies a
compression algorithm to determine the
degree to which it is compressed. If it
compresses a great deal then complexity is high
and vice versa.</p>
      <p>Our previous research has used generic
compression algorithms such as run-length
encoding and Lempel-Ziv (Zlib) compression.
For this intrinsic plagiarism detection task,
Zlib compression was used. It may be
especially interesting to investigate
compression algorithms which assume prior
knowledge about the probabilities of lexical
features or which are designed to maximize
compression for language texts in a particular
language.</p>
      <p>
        <xref ref-type="bibr" rid="ref5">Frank et al. (2000)</xref>
        investigated text
categorization using statistical data compression
techniques. They use a corpus of two classes
of documents and train a statistical
compression tool (prediction by partial matching
or PPM) using each corpus. They attempt
to classify documents by determining which
compression model compresses it the most.
They conclude that data compression
techniques perform well but are inferior to state of
the art machine learning techniques such as
SVM or Neural Nets. No attempt was made
to merge compression features with machine
learning algorithms.
      </p>
      <p>Thus we have three possibilities for
compression algorithms:
1. An algorithm which assumes no prior
knowledge and which can be used for any
compression task, text or otherwise.
2. An algorithm which has some knowledge
or prior probabilities and is trained for a
specific compression task (such as
compressing English text).
3. An algorithm which is specifically
trained with respect to a corpus which
corresponds to a class which we want to
predict. This is very closely related to
Kolmogorov Similarity Metrics.</p>
      <p>The question arises as to whether all or
any of these compression algorithms yield
meaningful features and if so why.
Compression analysis for machine learning has
been done in a wide variety of fields. In
fact much of the research has been done by
those outside of machine learning who may or
may not even know they are performing
machine learning and who seem to have done
little research into compression and complexity
analysis and have no idea why their method
works, only that it does.</p>
      <p>
        Compression/complexity analysis has
been used in many classification tasks such
as image recognition, radar signal
classification, EEG classification, DNA analysis,
speech recognition and some text
classification tasks
        <xref ref-type="bibr" rid="ref1 ref10 ref11 ref12 ref2 ref3 ref5 ref7 ref7 ref8 ref8 ref9">(Chi and Kong, 1998; Zhang, Hu,
and Jin, 2003; Bhattacharya, 2000; Menconi,
Benci, and Buiatti, 2008; Frank, Chui, and
Witten, 2000; Dalkilic et al., 2006; Seaward
and Saxton, 2007; Seaward, Inkpen, and
Nayak, 2008)</xref>
        .
      </p>
      <p>The method proposed here is different
then those which use Kolmogorov
Complexity measures to compute the distance
between the object to be classified and a corpus
of training data. As this is intrinsic
plagiarism analysis there is no set of documents for
which we can find a similarity metric. We can
only compare local text to the global
document. We can use a statistical compression
algorithm and this is somewhat related to
similarity metrics but it is not the same. We
are not explicitly using the concept that “like
compresses with like”. Moreover, such
compression measures can be used in a variety
of machine learning algorithms such as
support vector machines, neural networks and
decision trees. We can also use boosting and
meta algorithms such as bagging and
ADAboost. What we are doing is finding a
measure for each different distribution as to how
well its complexity can be described by the
compression algorithm.
6</p>
    </sec>
    <sec id="sec-10">
      <title>Using Compression to Estimate</title>
    </sec>
    <sec id="sec-11">
      <title>Complexity</title>
      <p>Suppose we have a statistical compression
model which has been trained on a variety of
English text and we compress two text
samples and find that one compresses much more
than the other. This means that one text was
much more alike to general English text than
the other was.</p>
      <p>Now suppose we extract the noun
representation of both of those texts and
compress them using a statistical compression
algorithm which has been trained on noun
representations of English text. The one which
compresses the most is closer to the normal
noun distributions of English text.</p>
      <p>What if we use a compression algorithm
that has no prior training such as
LempelZiv? Is it still meaningful? The answer is
Classifier
SVM
Neural network</p>
      <p>Complexity features
no
no
yes
yes
no
no
yes
yes</p>
      <p>Plagiarism
yes
no
yes
no
yes
no
yes
no
yes because we still have an idea of the
complexity of the distribution of nouns. While it
does not directly relate to the norms of the
English language, it is still a meaningful
measure of the complexity of that distribution.
It relates the noun distribution to some
distribution which can be most efficiently
compressed by that compression algorithm (even
if we do not know the distribution).</p>
      <p>
        Research has shown this holds true
        <xref ref-type="bibr" rid="ref11 ref7 ref8 ref9">(Seaward and Saxton, 2007; Seaward, Inkpen,
and Nayak, 2008)</xref>
        .
        <xref ref-type="bibr" rid="ref3">Dalkilic et al. (2006)</xref>
        have shown that Lempel-Ziv compression of
text can be used to distinguish authentic text
from non-authentic or computer generated
text. They show that the compressibility of
real texts is different than that of computer
generated “nonsense” texts due to topic
adherence. The idea is that when one writes a
coherent text, ideas and words are repeated
to increase readability.
      </p>
      <p>
        With respect to text and compression, de
Marcken theorizes that language learning is
essentially a compression problem
        <xref ref-type="bibr" rid="ref4">(De
Marcken, 1996)</xref>
        . If one has a great deal of
knowledge about a language then one can build a
model which maximizes the compressibility
of text written in that language. Thus the
compressibility of a text is a measure of how
closely related the compression algorithm is
to the text representation.
7
      </p>
    </sec>
    <sec id="sec-12">
      <title>Experimental Results</title>
      <p>The PAN 09 intrinsic plagiarism
competition corpus consisted of 3091 annotated texts
for training and 3091 texts for testing
purposes (initially released unannotated). We
extracted normalized counts and complexity
counts for the following word classes:
Nouns
Verbs
Pronouns
Adjectives
Adverbs
Prepositions</p>
      <p>Stopwords
Topic words
Common words
Passive words
Active words</p>
      <p>Word length</p>
      <p>Features were extracted locally and
globally and the standard deviation amongst the
local features was also computed. Zlib was
used for compression.</p>
      <p>A 50/50 training/test split was used on
the training set to analyze the performance
gained from adding complexity measures.
The results were repeated for 10 random
splits and averaged. Two classifiers were used
– Support Vector Machine (SVM) and a
Neural Network. Recall and precision are
calculated per text chunk not per character (see
Table 1).</p>
      <p>For many classifiers tested such as
regressions trees and support vector machines the
F-measure performance gained by using
complexity features was less than 2%. The
neural network showed the most improvement
with complexity measures as F-measure was
increased 3.7-4.4%.</p>
      <p>Previous classification tasks such as
authorship attribution and spam filtering
showed better results. The problem, as it was
discovered, was the high degree of
granularity required by the task. Complexity analysis
does not do well with short text.</p>
      <p>Using various feature selection tools it was
found that complexity features and
normalized count features were found in equal
numbers in the highest ranked features. For
example the top 10 features as determined by
a Chi-squared feature evaluator is shown
below.</p>
      <p>As one can see in Table 2, 6 out of the
10 top ranked features are complexity
features. This indicates that complexity
feaRank
1
2
3
4
5
6
7
8
9
10</p>
      <p>Feature
Adjective complexity (l)
Adjective count (gsd)
Topic word complexity (g)
Verb word complexity (g)
Passive word complexity (g)
Active word complexity (g)
Preposition count (g)
Stop word count (gsd)
Avg. word length per sentence (gsd)</p>
      <p>Topic word complexity (l)
tures are able to discriminate plagiarized vs.
non-plagiarized passages as well as or better
than normalized count features.
8</p>
    </sec>
    <sec id="sec-13">
      <title>Conclusion</title>
      <p>We introduce using compression to find
features based on Kolmogorov complexity
measures. We show why compression of text and
word distributions results in meaningful
features. Results in using complexity analysis in
intrinsic plagiarism detection are promising.
Performance is increased by a small amount
and it seems as though complexity is not
contributing to over fitting. More research
needs to be done in using compression models
which have prior knowledge of the language
to be analyzed and/.or the prior probabilities
of word classes. This would result in more
meaningful complexity features which would
likely aid in the difficult task of intrinsic
plagiarism detection.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Bhattacharya</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>Complexity analysis of spontaneous EEG</article-title>
          .
          <source>Acta Neurobiologiae Experimentalis</source>
          ,
          <volume>60</volume>
          (
          <issue>4</issue>
          ):
          <fpage>495</fpage>
          -
          <lpage>501</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Kong</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Image content classification using a block Kolmogorov complexity measure</article-title>
          .
          <source>In Proceedings of the Fourth International Conference on Signal Processing ICSP</source>
          <year>1998</year>
          , Beijing, China.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Dalkilic</surname>
            ,
            <given-names>M. M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>W. T.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Costello</surname>
          </string-name>
          , and Radivojac P.
          <year>2006</year>
          .
          <article-title>Compression to Identify Classes of Inauthentic Texts</article-title>
          .
          <source>In Proceedings of the SIAM International Conference on Data Mining SDM</source>
          <year>2006</year>
          ,
          <article-title>Bethesda</article-title>
          , MD.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>De Marcken</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>1996</year>
          .
          <article-title>Unsupervised language acquisition</article-title>
          .
          <source>Phd thesis</source>
          , Michigan Institute of Technology. http://www.demarcken.org/carl/papers/ PhD.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Frank</surname>
            , E.,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Chui</surname>
            ,
            <given-names>and I. H.</given-names>
          </string-name>
          <string-name>
            <surname>Witten</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Text categorization using compression models</article-title>
          .
          <source>In Proceedings of DCC00, IEEE Data Compression Conference</source>
          , pages
          <fpage>200</fpage>
          -
          <lpage>209</lpage>
          , Snowbird, USA. IEEE Computer Society Press.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Vitanyi</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>An Introduction to Kolmogorov Complexity and its Applications</article-title>
          . Springer Verlag, Berlin, second edition.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Menconi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Benci</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Buiatti</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Data compression and genomes: a twodimensional life domain map</article-title>
          .
          <source>Journal of Theoretical Biology</source>
          ,
          <volume>253</volume>
          (
          <issue>2</issue>
          ):
          <fpage>281</fpage>
          -
          <lpage>288</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Seaward</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Inkpen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Nayak</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Using the Complexity of the Distribution of Lexical Elements as a Feature in Authorship Attribution</article-title>
          .
          <source>In Proceedings of 6th International Conference on Language Resources</source>
          and
          <string-name>
            <surname>Evaluation</surname>
            <given-names>LREC</given-names>
          </string-name>
          , Marrakech, Morocco.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Seaward</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Saxton</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Filtering spam using Kolmogorov complexity measures</article-title>
          .
          <source>In Proceedings of the 2007 IEEE International Symposium on Data Mining and Information Retrieval (DMIR-07)</source>
          , Niagara Falls.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Stamatatos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Fakotakis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Kokkinakis</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Automatic Text Categorization in Terms of Genre and Author</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>26</volume>
          (
          <issue>4</issue>
          ):
          <fpage>461</fpage>
          -
          <lpage>485</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and Sven Meyer zu Eissen.
          <year>2007</year>
          .
          <article-title>Intrinsic plagiarism analysis with metalearning</article-title>
          .
          <source>In Proceedings of the SIGIR 2007 International Workshop on Plagiarism Analysis</source>
          ,
          <string-name>
            <given-names>Authorship</given-names>
            <surname>Identification</surname>
          </string-name>
          , and
          <string-name>
            <surname>Near-Duplicate</surname>
            <given-names>Detection</given-names>
          </string-name>
          ,
          <string-name>
            <surname>PAN</surname>
          </string-name>
          <year>2007</year>
          , Amsterdam, Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>W.</given-names>
            <surname>Jin</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Complexity feature extraction of radar emitter signals</article-title>
          .
          <source>In Environmental Electromagnetics</source>
          ,
          <year>2003</year>
          .
          <source>Proceedings of the Asia-Pacific Conference on Environmental Electromagnetics CEEM</source>
          <year>2003</year>
          , pages
          <fpage>495</fpage>
          -
          <lpage>498</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>