<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>External and Intrinsic Plagiarism Detection Using Vector Space Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mario Zechner, Markus Muhr, Roman Kern</string-name>
          <email>rkern@know-center.at</email>
          <email>{mzechner, mmuhr, rkern}@know-center.at</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Granitzer</string-name>
          <email>mgranitzer@tugraz.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Know-Center Graz, Graz University of Technology</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Know-Center</institution>
          ,
          <addr-line>8010 Graz</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <fpage>47</fpage>
      <lpage>55</lpage>
      <abstract>
        <p>Plagiarism detection can be divided in external and intrinsic methods. Naive external plagiarism analysis suffers from computationally demanding full nearest neighbor searches within a reference corpus. We present a conceptually simple space partitioning approach to achieve search times sub linear in the number of reference documents, trading precision for speed. We focus on full duplicate searches while achieving acceptable results in the near duplicate case. Intrinsic plagiarism analysis tries to find plagiarized passages within a document without any external knowledge. We use several topic independent stylometric features from which a vector space model for each sentence of a suspicious document is constructed. Plagiarized passages are detected by an outlier analysis relative to the document mean vector. Our system was created for the first PAN competition on plagiarism detection in 2009. The evaluation was performed on the challenge's development and competition corpora for which we report our results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Plagiarism, defined as the theft of
intellectual property
        <xref ref-type="bibr" rid="ref11 ref9">(Maurer, Kappe, and Zaka,
2006)</xref>
        , has been a problem for centuries not
only in academic circles. There exist
different forms of plagiarism, ranging from simply
copying and pasting original passages to more
elaborate paraphrased and translated
plagiarism. Anecdotal evidence and studies such
as
        <xref ref-type="bibr" rid="ref13">(Sheard et al., 2002)</xref>
        strengthen the
suspicion that plagiarism is on the rise, facilitated
by new media such as the World Wide Web.
Growing information sources ease plagiarism
while plagiarism prevention and detection
become harder.
      </p>
      <p>To combat these problems the first
competition on plagiarism detection was held at
the third PAN workshop in 2009 in which we
participated. The competition was split into
two tasks: external plagiarism detection and
intrinsic plagiarism detection. External
plagiarism detection deals with the problem of
finding plagiarized passages in a suspicious
document based on a reference corpus.
Intrinsic plagiarism detection does not use
external knowledge and tries to identify
discrepancies in style within a suspicious
document. For both tasks extensive, machine
generated training corpora were provided upon
which we developed and evaluated our
solutions. Our contribution is as follows:
• We present a robust external plagiarism
detection method based on indexing by
balanced clustering in a high
dimensional term vector space with a focus on
full and very near duplicate detection.
• We present an intrinsic plagiarism
detection method based on a combined set
of stylometric features spanning a
vector space, using outlier analysis to
determine plagiarized passages without a
reference corpus or any other external
knowledge source.
• We report our results for both problems
on the PAN 09 development and
competition corpora.</p>
      <p>This paper is outlined as follows: in
section 2 we present related work. Section 3
describes our system for the external plagiarism
task, section 4 gives insight on our intrinsic
plagiarism detection system. In section 5 we
describe the PAN 09 dataset and report our
results for the external and intrinsic
plagiarism tasks. We conclude and give an outlook
on future work in section 6
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>External plagiarism detection relies on a
reference corpus composed of documents from
which passages might have been plagiarized.
A passage could be made up of paragraphs,
a fixed size block of words, a block of
sentences and so on. A suspicious document is
checked for plagiarism by searching for
passages that are duplicates or near duplicates
of passages in documents within the reference
corpus. An external plagiarism system then
reports these findings to a human controller
who decides whether the detected passages
are plagiarized or not.</p>
      <p>A naive solution to this problem is to
compare each passage in a suspicious document
to every passage of each document in the
reference corpus. This is obviously prohibitive.
The reference corpus has to be large in order
to find as many plagiarized passages as
possible. This fact directly translates to very high
runtimes when using the naive approach.</p>
      <p>
        External plagiarism detection is similar
to textual information retrieval (IR)
        <xref ref-type="bibr" rid="ref1 ref6">(BaezaYates and Ribeiro-Neto, 1999)</xref>
        . Given a set
of query terms an IR system returns a ranked
set of documents from a corpus that best
matches the query terms. The most common
structure for answering such queries is an
inverted index. An external plagiarism
detection system using an inverted index indexes
passages of the reference corpus’ documents.
For each passage in a suspicious document a
query is send to the system and the returned
ranked list of reference passages is analyzed.
Such a system was presented in
        <xref ref-type="bibr" rid="ref8">(Hoad and
Zobel, 2003)</xref>
        for finding duplicate or near
duplicate documents.
      </p>
      <p>
        Another method for finding duplicates
and near duplicates is based on hashing or
fingerprinting. Such methods produce one
or more fingerprints that describe the
content of a document or passage. A
suspicious document’s passages are compared to
the reference corpus based on their hashes
or fingerprints. Duplicate and near duplicate
passages are assumed to have similar
fingerprints. One of the first systems for plagiarism
detection using this schema was presented in
        <xref ref-type="bibr" rid="ref3">(Brin, Davis, and Garcia-Molina, 1995)</xref>
        .
      </p>
      <p>
        External plagiarism detection can also be
viewed as nearest neighbor problem in a
vector space Rd. Passages are represented as
vectors within this vector space. If
passages from the reference corpus are ”near
enough” to a passage of the suspicious
document in this vector space, the passage is
marked as potentially plagiarized. The
dimensions of the vector space are usually
defined by features extracted from the
passages such as terms (usually called the ”bag
of words” vector space model). This often
yields high dimensional vector spaces.
Neighbors are defined either by a distance metric
S : Rd ××RRdd→ R. Smaller values for D signal
D : Rd → R or by a similarity function
nearness while high values for S signal
similarity. Nearest neighbor searches in a
vector space with a distance metric can be made
sub linear by the use of space partitioning
techniques that rely on the triangle inequality
guaranteed by a metric. Famous partitioning
schemes are the Kd-Tree
        <xref ref-type="bibr" rid="ref2">(Bentley, 1975)</xref>
        or
the metric tree
        <xref ref-type="bibr" rid="ref5">(Ciaccia, Patella, and Zezula,
1997)</xref>
        . However, these techniques degrade to
linear searches for high dimensional vector
spaces due to the curse of dimensionality:
average inter point distances become more
similar. This can be somewhat overcome by
assuming that the data indeed lies on a lower
dimensional manifold which can be captured
by dimensionality reduction. In the reduced
space partitioning schemas might be
applicable again while the original neighborhoods
are preserved by the reduction. Common
dimensionality reduction approaches are
Principle Component Analysis
        <xref ref-type="bibr" rid="ref10">(M.E., 2003)</xref>
        for
linear reduction or Isomap
        <xref ref-type="bibr" rid="ref14">(Tenenbaum, de
Silva, and Langford, 2000)</xref>
        for non linear
reduction. These methods are generally very
costly so other methods for nearest neighbor
searches in high dimensional vector spaces
have been devised. Locality sensitive hashing
(LSH)
        <xref ref-type="bibr" rid="ref1 ref6">(Gionis, Indyk, and Motwani, 1999)</xref>
        received a lot of attention in recent years due
to its simplicity and theoretical guarantees.
LSH is the equivalent of the aforementioned
fingerprinting schemes applied to high
dimensional vector spaces. The nearest neighbors
returned by LSH are only approximate in
nature. Another approximate nearest neighbor
schema was introduced in
        <xref ref-type="bibr" rid="ref4">(Chierichetti et al.,
2007)</xref>
        that uses hierarchical cluster trees for
space partitioning.
      </p>
      <p>
        Intrinsic plagiarism detection only
recently received attention from the
scientific community. It was first introduced in
        <xref ref-type="bibr" rid="ref11 ref9">(Meyer zu Eissen and Stein, 2006)</xref>
        and
defined as detecting plagiarized passages in a
suspicious document without a reference
collection or any other external knowledge. A
suspicious document is first decomposed into
passages. For each passage a feature vector is
constructed. Features are derived from
stylometric measures like the average sentence
length or the average word length known
from the field of authorship analysis. These
features have to be topic independent so as
to capture the style of an author and not the
domain she writes about. Next a difference
vector is constructed for each passage that
captures the passages deviation from the
document mean vector. Meyer zu Eissen and
Stein (2006) assume that a ground truth is
given, marking passages actually from the
author of the suspicious document. A model is
then trained based on one-class classification,
using the ground truth as the training set.
The model is then used to determine which
passages are plagiarized. However, it is not
clear how the ground truth is derived from
a suspicious document when no information
about the document is known beforehand.
3
      </p>
      <p>External Plagiarism Detection
We treat external plagiarism detection as a
nearest neighbor search in a high dimensional
term vector space. This is motivated by the
extensive literature that exists for the nearest
neighbor search problem as well as its
conceptual simplicity. Our system consists of three
stages:
• Vectorization of the passages of each
document in the reference corpus and
partitioning of the reference corpus
vector space.
• Vectorization of the passages of a
suspicious document and finding each
passage’s nearest neighbor(s) in the
reference corpus vector space. Detection of
plagiarism for each suspicious document
is based on its nearest neighbor list via
similarity thresholding.
• Post processing of the detected
plagiarized passages, merging subsequent
plagiarized passages to a single block.
3.1</p>
      <sec id="sec-2-1">
        <title>Reference Corpus</title>
      </sec>
      <sec id="sec-2-2">
        <title>Vectorization &amp; Partitioning</title>
        <p>
          We adopt the vector space model for textual
data as given in
          <xref ref-type="bibr" rid="ref12">(Salton, Wong, and Yang,
1975)</xref>
          . Each unique term in the reference
corpus is represented as a dimension in the
vecd
tor space R , where d is the number of unique
terms. Instead of creating a single vector for
a complete document we create vectors for
each sentence in a document as we want to
detect plagiarism on a per sentence level.
        </p>
        <p>We use the OpenNLP Framework 1 for
tokenization and sentence splitting. For each
sentence in every reference corpus document
a term frequency vector based on the
sentence’s lower cased tokens is constructed,
excluding stop words based on a stop word list.
We did not apply any stemming or
lemmatization. The resulting vectors are then
normalized to unit length to overcome difference
in sentence length. We use the standard
cosine similarity to asses the similarity between
sentences given by:
cosine similarity(x, y) =
x, y
x y
d
were x, y ∈ R . The denominator of the
above equation can be dropped as all vectors
are scaled to unit length.</p>
        <p>
          To achieve sub linear nearest neighbor
searches we implement a variation of
cluster pruning
          <xref ref-type="bibr" rid="ref4">(Chierichetti et al., 2007)</xref>
          . The
set of reference corpus sentence vectors is
first clustered into l partitions using a
balanced online spherical k-means
implementation
          <xref ref-type="bibr" rid="ref15">(Zhong, 2005)</xref>
          . Balancing is crucial as it
provides more equal runtimes when
searching for nearest neighbors. The balancing is
achieved by introducing a penality to clusters
that have more samples than others during
the clustering process. Each sentence
vector is associated with a single partition, each
partition has a representative centroid vector.
Our approach deviates from cluster pruning
as presented in
          <xref ref-type="bibr" rid="ref4">(Chierichetti et al., 2007)</xref>
          by
associating each sentence vector only with
the nearest cluster. Additionally we store
a sorted list of similarities for each cluster,
holding the similarities between the centroid
of the cluster and the sentence vectors
associated with that cluster. The resulting
structure serves as an index. It allows searching
the approximate nearest sentences for a given
query sentence.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>3.2 Suspicious Document</title>
      </sec>
      <sec id="sec-2-4">
        <title>Vectorization &amp; Plagiarism</title>
      </sec>
      <sec id="sec-2-5">
        <title>Detection</title>
        <p>We vectorize a suspicious document in the
same way we vectorize reference corpus
documents. For each sentence vector of a
suspi1http://opennlp.sourceforge.net/
cious document we search the reference
corpus for the k most similar sentences as
follows:
• Determine the nearest cluster to the
query sentence based on the cosine
similarity between the centroids and the
query sentence.
• Find the position in the sorted similarity
list the query sentence would be inserted
at based on its similarity to the cluster
centroid. Gather k2 sentence vectors to
the left and right of that position in the
list.
• Perform the same search in the
second nearest cluster returning k2 potential
nearest neighbors and merge the two sets
of candidate reference corpus sentences.</p>
        <p>We justify this procedure as follows: The
cluster a vector belongs to is likely to also
contain the vector’s nearest neighbors. To
increase the probability of catching most true
nearest neighbors we also use the second
nearest cluster. An original sentence in the
reference corpus and a full duplicate in a
suspicious document will belong to the same
cluster as they have the same vector
representation. Consequently both vectors have the
same similarity with the centroid of that
cluster. The search in a cluster’s similarity list
should thus return the duplicated sentence.
This can fail if there are more than k
vectors in the cluster having the same similarity
with the centroid as the suspicious sentence
vector. The outcome is dependent on the
quality of the sentence splitter as this type of
search for full duplicates relies on correct
sentence boundaries. The schema will also work
in case a sentence was plagiarized with small
modifications, albeit with a much lower
probability of finding the correct nearest
neighbor. We thus expect our system to work well
for detecting full duplicates, acceptable in the
case of slightly modified sentences and poorly
for highly obfuscated sentences. Figure 1
illustrates the function of the similarity list of
a cluster.</p>
        <p>With the presented schema we can
reduce the search for the nearest neighbors of a
sentence from being linear in the number of
sentences in the reference corpus to roughly
O(l + k + k/2) cosine similarity evaluations
per sentence, where l is the number of
centroids or clusters, k is the number of vectors
taken from the nearest cluster’s similarity list
and k2 is the number of vectors taken from the
second nearest cluster’s similarity list.</p>
        <p>A sentence of a suspicious document is
marked as plagiarized if its cosine
similarity to the most similar candidate sentence
from the reference corpus exceeds a
threshold α. This parameter allows controlling the
sensitivity of the system. The outcome of
this stage is a set of sentences from the
suspicious document that are plagiarized with
high probability. The information about the
plagiarized sentences’ position in the original
documents is also retained.</p>
      </sec>
      <sec id="sec-2-6">
        <title>3.3 Post Processing</title>
        <p>The final stage assembles the sentences
marked as plagiarized in the second stage to
continuous blocks. This is accomplished by
simply checking whether sentences marked
as plagiarized are in sequence in the
suspicious document. If this is the case they are
merged. This is repeated until no more
merging is possible. To further increase the
recall, we compare a plagiarized sentence’s left
and right neighbor sentences to the neighbors
of the original sentence. These might have
been missed in the nearest neighbor search
and have now a chance to be detected. The
neighborhood sentences are again compared
via the cosine similarity and marked as
plagiarized if the similarity to an original
sentence is above some threshold β.
4</p>
        <p>
          Intrinsic Plagiarism Detection
Our intrinsic plagiarism detection system is
based on the ideas presented in
          <xref ref-type="bibr" rid="ref11 ref9">(Meyer zu
Eissen and Stein, 2006)</xref>
          and
          <xref ref-type="bibr" rid="ref7">(Grieve, 2007)</xref>
          .
Meyer zu Eissen and Stein were the first to
define the problem of intrinsic plagiarism
detection: determine whether passages in a
suspicious document are plagiarized based only
on changes in style within the document. An
author’s style is also of importance in the field
of authorship classification. Both problems
rely on so called stylometric features. These
features should be topic and genre
independent and reflect an author’s style of writing.
Changes of style within a document can be
detected by various methods. We choose a
simple outlier detection scheme based on a
vector space spanned by various stylometric
features. The system is composed of 3 stages:
• Vectorization of each sentence in the
suspicious document.
        </p>
        <p>• Determination of outlier sentences based
on the document’s mean vector.
• Post processing of the detected outlier
sentences.</p>
      </sec>
      <sec id="sec-2-7">
        <title>4.1 Suspicious Document</title>
      </sec>
      <sec id="sec-2-8">
        <title>Vectorization</title>
        <p>
          Plagiarism is determined on a per sentence
level in our intrinsic plagiarism detection
system. However, we chose to use a window of
k sentences around a given suspicious
sentence. The window is composed of k2
sentences to the left and right of the sentence.
We adopt various stylometric features as
presented in
          <xref ref-type="bibr" rid="ref11 ref9">(Meyer zu Eissen and Stein, 2006)</xref>
          and
          <xref ref-type="bibr" rid="ref7">(Grieve, 2007)</xref>
          that together form a
single vector space:
• Average word frequency class: Each
token w in a sentence window is
assigned to an average word frequency
class. The class is calculated by
log(f reqw∗/f reqw , where f reqw∗ is the
absolute number of occurances of the
most frequent word in a huge corpus and
f reqw is the number of occurances of the
token in that same corpus. We derrived
our frequency table from tokenizing the
English Wikipedia, resulting in
aproximately 6 million unique terms. Each
average word frequency class is represented
by a single dimension in the final
vector space. The values in these dimension
specify the number of tokens belonging
to that class.
• Punctuation: For each sentence
window the number of occurances of a
certain punctuation character is measured.
Each punctuation character is
represented by a single dimension in the final
vector space. The values in these
dimensions reflect the frequencies of
punctuation characters in the sentence window.
2
• Part of speech tags: Each token w in
a sentence window is assigned a part of
speech tag from the Penn Treebank part
of speech tag set. Each part of speech
tag is represented by a dimension in the
final vector space. The values in these
dimensions reflect the frequencies of part
of speech tags in the sentence window.
• Pronouns: For each sentence window
the number of occurances of a certain
pronoun is measured. Each pronoun is
represented as a single dimension in the
final vector space. The values reflect the
2We used the following punctuation characters in
our experiments:
.,?!:;()the list of similarities by an average window
smoothing procedure, where the size of the
window l is a parameter. We determine the
mean cosine similarity as well as the standard
deviation from this list of similarities as:
mean =
        </p>
        <p>cos(vi, m)
stddev =</p>
        <p>(cos(vi, m) − mean)2
1 n
n</p>
        <p>i=1
1 n
n
i=1
frequencies with which each pronoun
occured in a sentence window. 3
• Closed class words: For each sentence
window the number of occurances of a
certain stop word is measured. Each
stop word is represented by a single
dimension in the final vector space. The
values reflect the frequencies of stop
words in a sentence window. We used
the stop word list from the Snowball
stemming framework 4.</p>
        <p>For each sentence we construct a vector
for each stylometric feature space based on
the sentence’s window. Each vector is
normalized to unit length. The vectors of a
sentence are then combined to a single vector by
concatenation. The resulting vector is again
normalized. Based on the final vectors of all
sentences we construct a document mean
vector.</p>
      </sec>
      <sec id="sec-2-9">
        <title>4.2 Outlier Detection</title>
        <p>The outlier detection tries to determine
which sentences deviate from the document’s
mean. We use a simple detection scheme: we
measure the cosine similarity from the mean
vector to each sentence vector. We smooth
3We used the following pronouns: i, you, he, she,
it, we, you, they, me, you, him, her, it, us, you,
them, myself, yourself, himself, herself, itself,
ourselves, yourselves, themselves, mine, yours, his, hers,
its, ours, yours, theirs, my, your, his, her, its, our,
your, their.</p>
        <p>4http://snowball.tartarus.org/</p>
        <p>Where n is the number of sentences, vi is
the ith sentence vector in the document, m
is the document mean vector and cos is the
cosine similarity. The jth sentence is marked
as an outlier if the following inequality holds:
cos(vj, m) &lt; mean − ∗ stddev
where is some small constant ≥ 1, and
vj is the sentence’s vector. Marked sentences
form the input to the last stage of the
system. Figure 2 presents the similarity curve
obtained for the suspicious document 5 in the
development corpus of the PAN 09 challenge
after smoothing.</p>
      </sec>
      <sec id="sec-2-10">
        <title>4.3 Post Processing</title>
        <p>Based on the sentences that deviate to much
from the mean we derive the final blocks of
plagiarized passages. As in the case of
external plagiarism detection we simply merge all
sentences marked that are neighbors until no
further merging is possible.
5</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We evaluated our system on the development
corpora of the PAN 09 plagiarism detection
competition. We describe description of the
dataset as well as precision, recall,
granularity and f1-measure results achieved by our
system for various parameter settings. The
measures are defined as follows:
precision =
recall =
|R| i=1
1 |S| #detected chars of si
|S| i=1
1 |R| #plagiarized chars of ri</p>
      <p>|si|
|ri|
granularity = log2(1+
#detections ofsiinR)
1 |SR|
|SR| i=1
f 1 =
2 ∗ precision ∗ recall</p>
      <p>precision + recall
were S is the set of detected passages and
si is a single detected passage, R is the set
of real plagiarized passages and ri is a single
plagiarized passage and SR is the set of real
plagiarized passages for which at least one
detection exists. |S|, |R| and |SR| denotes
the number of passages in a set, |si| and |ri|
denote the number of characters in a passage.</p>
      <sec id="sec-3-1">
        <title>5.1 PAN 09 Development Corpora</title>
        <p>For each of the two tasks of the PAN 09
competition a development corpus was available.</p>
        <p>The external plagiarism detection dataset
consisted of 7214 suspicious documents and
as many reference documents. Artificial
plagiarized passages were added to the
suspicious documents via an artificial plagiarist.
For each document the plagiarist decided
whether or not to plagiarize, from which
reference documents to plagiarize, how many
passages to plagiarize, which type of
plagiarism to use for each passage and how long
each passage would be. The type of
plagiarism could either be obfuscated or
translated plagiarism. Obfuscation was achieved
by shuffling and deleting words, inserting
words from an external source and replacing
words with synonyms, antonyms, hypernyms
or hyponyms. Obfuscation levels ranged from
none to low to high.</p>
        <p>For the intrinsic plagiarism detection task
a corpus of 3091 suspicious documents was
available. Plagiarized passages were
generated similar to the external task.</p>
      </sec>
      <sec id="sec-3-2">
        <title>5.2 External Plagiarism Detection</title>
      </sec>
      <sec id="sec-3-3">
        <title>Results</title>
        <p>
          We performed a parameter study, evaluating
precision, recall and f1-measure, for 500
randomly drawn suspicious documents from the
external plagiarism development corpus. We
studies the effect of the following parameters:
• l, the number of clusters for the index.
• k, the number of candidates taken from
the similarity lists.
• α, the threshold above which the
similarity between a suspicious and reference
sentence has to be so that the suspicious
sentence is marked as plagiarized.
• β, the threshold above which the
similarity between neighbors of a marked
sentence and the neighbors of the
original sentence have to be in order to be
marked as plagiarized.
• k, the size of the sentence window
• l, the size of the smoothing window by
which the similarity list is smoothed.
l - k
50 - 2
50 - 20
50 - 200
100 - 2
200 - 2
500 - 2000
Competition
• , the constant by which the standard
deviation of the similarity list is
multiplied
ual. We can also reproduce the results by
          <xref ref-type="bibr" rid="ref7">Grieve (2007)</xref>
          for the punctuation feature
which performed acceptable as well.
        </p>
        <p>Based on the results for the separate
feature spaces we took the three best
performing spaces, part of speech tags, pronouns
and punctuation as the basis for the
combined vector space. This combined vector
space was then again evaluated on the
complete development corpus with varying
parameters, showing significant improvements
in all measures compared to the separate
feature spaces. We used the best parameter
settings found in this evaluation for the
competition corpus.
6 Conclusion and Future Work
We presented our methods and results for the
intrinsic and external plagiarism task of the
PAN 09 competition. Our systems performed
acceptable, taking the 5th out of 13 places in
the external task, the 3rd out of 4 places in
the intrinsic task and the 5th overall place
place out of 13 in the competition.</p>
        <p>We plan on extending our external
plagiarism detection system by incorporating term
expansion via synonyms, hyponyms and
hypernyms in order to cope with highly
obfuscated plagiarism. We also plan to use a more
sophisticated cluster pruning scheme that is
hierarchical in nature, using hebbian learning
to construct a topology of the search space
to further increase the probability that the
true nearest neighbor of a vector can be
determined.</p>
        <p>For future work for the intrinsic plagiarism
problem, we aim at a better outlier
detection method. We will also try to analyze and
incorporate more stylometric features,
combining them with the best performing
features found in this competition. Dynamically
adapting the parameters k and l for each
document as well as for each feature space is also
planed.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Baeza-Yates</surname>
          </string-name>
          ,
          <source>Ricardo and Berthier RibeiroNeto</source>
          .
          <year>1999</year>
          .
          <article-title>Modern Information Retrieval</article-title>
          . Addison Wesley, May.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Bentley</surname>
            ,
            <given-names>Jon</given-names>
          </string-name>
          <string-name>
            <surname>Louis</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>Multidimensional binary search trees used for associative searching</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>18</volume>
          (
          <issue>9</issue>
          ):
          <fpage>509</fpage>
          -
          <lpage>517</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Brin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davis</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Copy detection mechanisms for digital documents</article-title>
          .
          <source>In ACM International Conference on Management of Data (SIGMOD</source>
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Chierichetti</surname>
            , Flavio, Alessandro Panconesi, Prabhakar Raghavan, Mauro Sozio, Alessandro Tiberi, and
            <given-names>Eli</given-names>
          </string-name>
          <string-name>
            <surname>Upfal</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Finding near neighbors through cluster pruning</article-title>
          .
          <source>In PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems</source>
          , pages
          <fpage>103</fpage>
          -
          <lpage>112</lpage>
          , New York, NY, USA. ACM.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Ciaccia</surname>
            , Paolo,
            <given-names>Marco</given-names>
          </string-name>
          <string-name>
            <surname>Patella</surname>
            , and
            <given-names>Pavel</given-names>
          </string-name>
          <string-name>
            <surname>Zezula</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>M-tree: An efficient access method for similarity search in metric spaces</article-title>
          .
          <source>In Matthias Jarke</source>
          ,
          <string-name>
            <given-names>Michael J.</given-names>
            <surname>Carey</surname>
          </string-name>
          ,
          <string-name>
            <surname>Klaus R. Dittrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>Frederick H. Lochovsky</surname>
          </string-name>
          , Pericles Loucopoulos, and Manfred A. Jeusfeld, editors,
          <source>VLDB</source>
          , pages
          <fpage>426</fpage>
          -
          <lpage>435</lpage>
          . Morgan Kaufmann.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Gionis</surname>
            , Aristides,
            <given-names>Piotr</given-names>
          </string-name>
          <string-name>
            <surname>Indyk</surname>
            , and
            <given-names>Rajeev</given-names>
          </string-name>
          <string-name>
            <surname>Motwani</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Similarity search in high dimensions via hashing</article-title>
          .
          <source>In VLDB '99: Proceedings of the 25th International Conference on Very Large Data Bases</source>
          , pages
          <fpage>518</fpage>
          -
          <lpage>529</lpage>
          , San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Grieve</surname>
          </string-name>
          , Jack.
          <year>2007</year>
          .
          <article-title>Quantitative authorship attribution: An evaluation of techniques</article-title>
          .
          <source>Lit Linguist Computing</source>
          ,
          <volume>22</volume>
          (
          <issue>3</issue>
          ):
          <fpage>251</fpage>
          -
          <lpage>270</lpage>
          , September.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Hoad</surname>
            ,
            <given-names>Timothy C.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Justin</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Methods for identifying versioned and plagiarized documents</article-title>
          .
          <source>J. Am. Soc. Inf. Sci. Technol</source>
          .,
          <volume>54</volume>
          (
          <issue>3</issue>
          ):
          <fpage>203</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Maurer</surname>
            , Hermann, Frank Kappe, and
            <given-names>Bilal</given-names>
          </string-name>
          <string-name>
            <surname>Zaka</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Plagiarism - a survey</article-title>
          .
          <source>Journal of Universal Computer Science</source>
          ,
          <volume>12</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1050</fpage>
          -
          <lpage>1084</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>M.E.</given-names>
            ,
            <surname>Timmerman</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Principal component analysis (2nd ed</article-title>
          .). i.
          <source>t. jolliffe. Journal of the American Statistical Association</source>
          ,
          <volume>98</volume>
          :
          <fpage>1082</fpage>
          -
          <lpage>1083</lpage>
          , January.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          Meyer zu Eissen, Sven and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Intrinsic plagiarism detection</article-title>
          . In Mounia Lalmas,
          <string-name>
            <surname>Andy</surname>
            <given-names>MacFarlane</given-names>
          </string-name>
          , Stefan M. Ru¨ger, Anastasios Tombros, Theodora Tsikrika, and Alexei Yavlinsky, editors,
          <source>ECIR</source>
          , volume
          <volume>3936</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>565</fpage>
          -
          <lpage>569</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ):
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Sheard</surname>
            , Judy, Martin Dick, Selby Markham, Ian Macdonald, and
            <given-names>Meaghan</given-names>
          </string-name>
          <string-name>
            <surname>Walsh</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Cheating and plagiarism: perceptions and practices of first year it students</article-title>
          .
          <source>SIGCSE Bull.</source>
          ,
          <volume>34</volume>
          (
          <issue>3</issue>
          ):
          <fpage>183</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Tenenbaum</surname>
            ,
            <given-names>J. B.</given-names>
          </string-name>
          , V. de Silva, and
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Langford</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>A global geometric framework for nonlinear dimensionality reduction</article-title>
          .
          <source>Science</source>
          ,
          <volume>290</volume>
          (
          <issue>5500</issue>
          ):
          <fpage>2319</fpage>
          -
          <lpage>2323</lpage>
          , December.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Zhong</surname>
          </string-name>
          , Shi.
          <year>2005</year>
          .
          <article-title>Efficient online spherical k-means clustering</article-title>
          .
          <source>In Neural Networks</source>
          ,
          <year>2005</year>
          . IJCNN '
          <fpage>05</fpage>
          .
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          . 2005 IEEE International Joint Conference on, volume
          <volume>5</volume>
          , pages
          <fpage>3180</fpage>
          -
          <lpage>3185</lpage>
          vol.
          <volume>5</volume>
          , July-4 Aug.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>