=Paper= {{Paper |id=Vol-502/paper-9 |storemode=property |title=External and Intrinsic Plagiarism Detection using Vector Space Models |pdfUrl=https://ceur-ws.org/Vol-502/paper9.pdf |volume=Vol-502 }} ==External and Intrinsic Plagiarism Detection using Vector Space Models== https://ceur-ws.org/Vol-502/paper9.pdf
                      External and Intrinsic Plagiarism Detection
                             Using Vector Space Models
    Mario Zechner, Markus Muhr, Roman Kern        Michael Granitzer
                   Know-Center                     Know-Center Graz
                    8010 Graz                 Graz University of Technology
      {mzechner, mmuhr, rkern}@know-center.at     mgranitzer@tugraz.at

        Abstract: Plagiarism detection can be divided in external and intrinsic methods.
        Naive external plagiarism analysis suffers from computationally demanding full near-
        est neighbor searches within a reference corpus. We present a conceptually simple
        space partitioning approach to achieve search times sub linear in the number of ref-
        erence documents, trading precision for speed. We focus on full duplicate searches
        while achieving acceptable results in the near duplicate case. Intrinsic plagiarism
        analysis tries to find plagiarized passages within a document without any exter-
        nal knowledge. We use several topic independent stylometric features from which
        a vector space model for each sentence of a suspicious document is constructed.
        Plagiarized passages are detected by an outlier analysis relative to the document
        mean vector. Our system was created for the first PAN competition on plagiarism
        detection in 2009. The evaluation was performed on the challenge’s development
        and competition corpora for which we report our results.
        Keywords: Plagiarism Detection, Nearest Neighbor Search, Stylometry, Outlier
        Detection
1     Introduction                                                              • We present a robust external plagiarism
Plagiarism, defined as the theft of intellec-                                      detection method based on indexing by
tual property (Maurer, Kappe, and Zaka,                                           balanced clustering in a high dimen-
2006), has been a problem for centuries not                                       sional term vector space with a focus on
only in academic circles. There exist differ-                                      full and very near duplicate detection.
ent forms of plagiarism, ranging from simply                                    • We present an intrinsic plagiarism detec-
copying and pasting original passages to more                                     tion method based on a combined set
elaborate paraphrased and translated plagia-                                      of stylometric features spanning a vec-
rism. Anecdotal evidence and studies such                                         tor space, using outlier analysis to de-
as (Sheard et al., 2002) strengthen the suspi-                                    termine plagiarized passages without a
cion that plagiarism is on the rise, facilitated                                  reference corpus or any other external
by new media such as the World Wide Web.                                          knowledge source.
Growing information sources ease plagiarism
while plagiarism prevention and detection be-                                   • We report our results for both problems
come harder.                                                                      on the PAN 09 development and compe-
   To combat these problems the first com-                                         tition corpora.
petition on plagiarism detection was held at
the third PAN workshop in 2009 in which we                                     This paper is outlined as follows: in sec-
participated. The competition was split into                                tion 2 we present related work. Section 3 de-
two tasks: external plagiarism detection and                                scribes our system for the external plagiarism
intrinsic plagiarism detection. External pla-                               task, section 4 gives insight on our intrinsic
giarism detection deals with the problem of                                 plagiarism detection system. In section 5 we
finding plagiarized passages in a suspicious                                 describe the PAN 09 dataset and report our
document based on a reference corpus. In-                                   results for the external and intrinsic plagia-
trinsic plagiarism detection does not use ex-                               rism tasks. We conclude and give an outlook
ternal knowledge and tries to identify dis-                                 on future work in section 6
crepancies in style within a suspicious docu-
ment. For both tasks extensive, machine gen-                                2     Related Work
erated training corpora were provided upon                                  External plagiarism detection relies on a ref-
which we developed and evaluated our solu-                                  erence corpus composed of documents from
tions. Our contribution is as follows:                                      which passages might have been plagiarized.

Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 47-55, 2009.
48     Mario Zechner, Markus Muhr, Roman Kern and Michael Granitzer

A passage could be made up of paragraphs,                 fined by features extracted from the pas-
a fixed size block of words, a block of sen-               sages such as terms (usually called the ”bag
tences and so on. A suspicious document is                of words” vector space model). This often
checked for plagiarism by searching for pas-              yields high dimensional vector spaces. Neigh-
sages that are duplicates or near duplicates              bors are defined either by a distance metric
of passages in documents within the reference             D : Rd × Rd → R or by a similarity function
corpus. An external plagiarism system then                S : Rd ×Rd → R. Smaller values for D signal
reports these findings to a human controller               nearness while high values for S signal sim-
who decides whether the detected passages                 ilarity. Nearest neighbor searches in a vec-
are plagiarized or not.                                   tor space with a distance metric can be made
   A naive solution to this problem is to com-            sub linear by the use of space partitioning
pare each passage in a suspicious document                techniques that rely on the triangle inequality
to every passage of each document in the ref-             guaranteed by a metric. Famous partitioning
erence corpus. This is obviously prohibitive.             schemes are the Kd-Tree (Bentley, 1975) or
The reference corpus has to be large in order             the metric tree (Ciaccia, Patella, and Zezula,
to find as many plagiarized passages as possi-             1997). However, these techniques degrade to
ble. This fact directly translates to very high           linear searches for high dimensional vector
runtimes when using the naive approach.                   spaces due to the curse of dimensionality: av-
   External plagiarism detection is similar               erage inter point distances become more sim-
to textual information retrieval (IR) (Baeza-             ilar. This can be somewhat overcome by as-
Yates and Ribeiro-Neto, 1999). Given a set                suming that the data indeed lies on a lower
of query terms an IR system returns a ranked              dimensional manifold which can be captured
set of documents from a corpus that best                  by dimensionality reduction. In the reduced
matches the query terms. The most common                  space partitioning schemas might be appli-
structure for answering such queries is an in-            cable again while the original neighborhoods
verted index. An external plagiarism detec-               are preserved by the reduction. Common di-
tion system using an inverted index indexes               mensionality reduction approaches are Prin-
passages of the reference corpus’ documents.              ciple Component Analysis (M.E., 2003) for
For each passage in a suspicious document a               linear reduction or Isomap (Tenenbaum, de
query is send to the system and the returned              Silva, and Langford, 2000) for non linear re-
ranked list of reference passages is analyzed.            duction. These methods are generally very
Such a system was presented in (Hoad and                  costly so other methods for nearest neighbor
Zobel, 2003) for finding duplicate or near du-             searches in high dimensional vector spaces
plicate documents.                                        have been devised. Locality sensitive hashing
   Another method for finding duplicates                   (LSH) (Gionis, Indyk, and Motwani, 1999)
and near duplicates is based on hashing or                received a lot of attention in recent years due
fingerprinting. Such methods produce one                   to its simplicity and theoretical guarantees.
or more fingerprints that describe the con-                LSH is the equivalent of the aforementioned
tent of a document or passage. A suspi-                   fingerprinting schemes applied to high dimen-
cious document’s passages are compared to                 sional vector spaces. The nearest neighbors
the reference corpus based on their hashes                returned by LSH are only approximate in na-
or fingerprints. Duplicate and near duplicate              ture. Another approximate nearest neighbor
passages are assumed to have similar finger-               schema was introduced in (Chierichetti et al.,
prints. One of the first systems for plagiarism            2007) that uses hierarchical cluster trees for
detection using this schema was presented in              space partitioning.
(Brin, Davis, and Garcia-Molina, 1995).                      Intrinsic plagiarism detection only re-
   External plagiarism detection can also be              cently received attention from the scien-
viewed as nearest neighbor problem in a vec-              tific community. It was first introduced in
tor space Rd . Passages are represented as                (Meyer zu Eissen and Stein, 2006) and de-
vectors within this vector space. If pas-                 fined as detecting plagiarized passages in a
sages from the reference corpus are ”near                 suspicious document without a reference col-
enough” to a passage of the suspicious doc-               lection or any other external knowledge. A
ument in this vector space, the passage is                suspicious document is first decomposed into
marked as potentially plagiarized. The di-                passages. For each passage a feature vector is
mensions of the vector space are usually de-              constructed. Features are derived from sty-
                                   External and Intrinsic Plagiarism Detection Using Vector Space Models    49

lometric measures like the average sentence             each sentence in a document as we want to
length or the average word length known                 detect plagiarism on a per sentence level.
from the field of authorship analysis. These                We use the OpenNLP Framework 1 for to-
features have to be topic independent so as             kenization and sentence splitting. For each
to capture the style of an author and not the           sentence in every reference corpus document
domain she writes about. Next a difference               a term frequency vector based on the sen-
vector is constructed for each passage that             tence’s lower cased tokens is constructed, ex-
captures the passages deviation from the doc-           cluding stop words based on a stop word list.
ument mean vector. Meyer zu Eissen and                  We did not apply any stemming or lemma-
Stein (2006) assume that a ground truth is              tization. The resulting vectors are then nor-
given, marking passages actually from the au-           malized to unit length to overcome difference
thor of the suspicious document. A model is             in sentence length. We use the standard co-
then trained based on one-class classification,          sine similarity to asses the similarity between
using the ground truth as the training set.             sentences given by:
The model is then used to determine which
passages are plagiarized. However, it is not
clear how the ground truth is derived from                                                         x, y
                                                                  cosine similarity(x, y) =
a suspicious document when no information                                                         xy
about the document is known beforehand.
                                                           were x, y ∈ Rd . The denominator of the
3     External Plagiarism Detection                     above equation can be dropped as all vectors
                                                        are scaled to unit length.
We treat external plagiarism detection as a                To achieve sub linear nearest neighbor
nearest neighbor search in a high dimensional           searches we implement a variation of clus-
term vector space. This is motivated by the             ter pruning (Chierichetti et al., 2007). The
extensive literature that exists for the nearest        set of reference corpus sentence vectors is
neighbor search problem as well as its concep-          first clustered into l partitions using a bal-
tual simplicity. Our system consists of three           anced online spherical k-means implementa-
stages:                                                 tion (Zhong, 2005). Balancing is crucial as it
                                                        provides more equal runtimes when search-
    • Vectorization of the passages of each             ing for nearest neighbors. The balancing is
      document in the reference corpus and              achieved by introducing a penality to clusters
      partitioning of the reference corpus vec-         that have more samples than others during
      tor space.                                        the clustering process. Each sentence vec-
    • Vectorization of the passages of a sus-           tor is associated with a single partition, each
      picious document and finding each pas-             partition has a representative centroid vector.
      sage’s nearest neighbor(s) in the refer-          Our approach deviates from cluster pruning
      ence corpus vector space. Detection of            as presented in (Chierichetti et al., 2007) by
      plagiarism for each suspicious document           associating each sentence vector only with
      is based on its nearest neighbor list via         the nearest cluster. Additionally we store
      similarity thresholding.                          a sorted list of similarities for each cluster,
                                                        holding the similarities between the centroid
    • Post processing of the detected plagia-           of the cluster and the sentence vectors asso-
      rized passages, merging subsequent pla-           ciated with that cluster. The resulting struc-
      giarized passages to a single block.              ture serves as an index. It allows searching
                                                        the approximate nearest sentences for a given
3.1     Reference Corpus                                query sentence.
        Vectorization & Partitioning                    3.2        Suspicious Document
We adopt the vector space model for textual                        Vectorization & Plagiarism
data as given in (Salton, Wong, and Yang,                          Detection
1975). Each unique term in the reference cor-           We vectorize a suspicious document in the
pus is represented as a dimension in the vec-           same way we vectorize reference corpus doc-
tor space Rd , where d is the number of unique          uments. For each sentence vector of a suspi-
terms. Instead of creating a single vector for
                                                            1
a complete document we create vectors for                       http://opennlp.sourceforge.net/
50       Mario Zechner, Markus Muhr, Roman Kern and Michael Granitzer

cious document we search the reference cor-                 taken from the nearest cluster’s similarity list
pus for the k most similar sentences as fol-                and k2 is the number of vectors taken from the
lows:                                                       second nearest cluster’s similarity list.
                                                               A sentence of a suspicious document is
     • Determine the nearest cluster to the                 marked as plagiarized if its cosine similar-
       query sentence based on the cosine sim-              ity to the most similar candidate sentence
       ilarity between the centroids and the                from the reference corpus exceeds a thresh-
       query sentence.                                      old α. This parameter allows controlling the
     • Find the position in the sorted similarity           sensitivity of the system. The outcome of
       list the query sentence would be inserted            this stage is a set of sentences from the sus-
       at based on its similarity to the cluster            picious document that are plagiarized with
       centroid. Gather k2 sentence vectors to              high probability. The information about the
       the left and right of that position in the           plagiarized sentences’ position in the original
       list.                                                documents is also retained.
     • Perform the same search in the sec-                  3.3     Post Processing
       ond nearest cluster returning k2 potential           The final stage assembles the sentences
       nearest neighbors and merge the two sets             marked as plagiarized in the second stage to
       of candidate reference corpus sentences.             continuous blocks. This is accomplished by
                                                            simply checking whether sentences marked
   We justify this procedure as follows: The                as plagiarized are in sequence in the suspi-
cluster a vector belongs to is likely to also               cious document. If this is the case they are
contain the vector’s nearest neighbors. To in-              merged. This is repeated until no more merg-
crease the probability of catching most true                ing is possible. To further increase the re-
nearest neighbors we also use the second                    call, we compare a plagiarized sentence’s left
nearest cluster. An original sentence in the                and right neighbor sentences to the neighbors
reference corpus and a full duplicate in a sus-             of the original sentence. These might have
picious document will belong to the same                    been missed in the nearest neighbor search
cluster as they have the same vector represen-              and have now a chance to be detected. The
tation. Consequently both vectors have the                  neighborhood sentences are again compared
same similarity with the centroid of that clus-             via the cosine similarity and marked as pla-
ter. The search in a cluster’s similarity list              giarized if the similarity to an original sen-
should thus return the duplicated sentence.                 tence is above some threshold β.
This can fail if there are more than k vec-
tors in the cluster having the same similarity              4     Intrinsic Plagiarism Detection
with the centroid as the suspicious sentence                Our intrinsic plagiarism detection system is
vector. The outcome is dependent on the                     based on the ideas presented in (Meyer zu
quality of the sentence splitter as this type of            Eissen and Stein, 2006) and (Grieve, 2007).
search for full duplicates relies on correct sen-           Meyer zu Eissen and Stein were the first to
tence boundaries. The schema will also work                 define the problem of intrinsic plagiarism de-
in case a sentence was plagiarized with small               tection: determine whether passages in a sus-
modifications, albeit with a much lower prob-                picious document are plagiarized based only
ability of finding the correct nearest neigh-                on changes in style within the document. An
bor. We thus expect our system to work well                 author’s style is also of importance in the field
for detecting full duplicates, acceptable in the            of authorship classification. Both problems
case of slightly modified sentences and poorly               rely on so called stylometric features. These
for highly obfuscated sentences. Figure 1 il-               features should be topic and genre indepen-
lustrates the function of the similarity list of            dent and reflect an author’s style of writing.
a cluster.                                                  Changes of style within a document can be
   With the presented schema we can re-                     detected by various methods. We choose a
duce the search for the nearest neighbors of a              simple outlier detection scheme based on a
sentence from being linear in the number of                 vector space spanned by various stylometric
sentences in the reference corpus to roughly                features. The system is composed of 3 stages:
O(l + k + k/2) cosine similarity evaluations
per sentence, where l is the number of cen-                     • Vectorization of each sentence in the sus-
troids or clusters, k is the number of vectors                    picious document.
                                   External and Intrinsic Plagiarism Detection Using Vector Space Models   51




Figure 1: A cluster with its centroid, ten associated sentence vectors as well as a query sentence
vector. The similarity sorting induces concentric rings of near equal similarity around the
centroid. The search of the query sentence in the sorted similarity list with k = 4 is illustrated
to the left.

  • Determination of outlier sentences based                    by a single dimension in the final vec-
    on the document’s mean vector.                              tor space. The values in these dimension
  • Post processing of the detected outlier                     specify the number of tokens belonging
    sentences.                                                  to that class.

4.1   Suspicious Document                                  • Punctuation: For each sentence win-
                                                             dow the number of occurances of a cer-
      Vectorization
                                                             tain punctuation character is measured.
Plagiarism is determined on a per sentence                   Each punctuation character is repre-
level in our intrinsic plagiarism detection sys-             sented by a single dimension in the final
tem. However, we chose to use a window of                    vector space. The values in these dimen-
k sentences around a given suspicious sen-                   sions reflect the frequencies of punctua-
tence. The window is composed of k2 sen-                     tion characters in the sentence window.
tences to the left and right of the sentence.                   2
We adopt various stylometric features as pre-
sented in (Meyer zu Eissen and Stein, 2006)                • Part of speech tags: Each token w in
and (Grieve, 2007) that together form a sin-                 a sentence window is assigned a part of
gle vector space:                                            speech tag from the Penn Treebank part
                                                             of speech tag set. Each part of speech
  • Average word frequency class: Each
                                                             tag is represented by a dimension in the
    token w in a sentence window is as-
                                                             final vector space. The values in these
    signed to an average word frequency
                                                             dimensions reflect the frequencies of part
    class.     The class is calculated by
                                                             of speech tags in the sentence window.
    log(f reqw∗ /f reqw , where f reqw∗ is the
    absolute number of occurances of the                   • Pronouns: For each sentence window
    most frequent word in a huge corpus and                  the number of occurances of a certain
    f reqw is the number of occurances of the                pronoun is measured. Each pronoun is
    token in that same corpus. We derrived                   represented as a single dimension in the
    our frequency table from tokenizing the                  final vector space. The values reflect the
    English Wikipedia, resulting in aproxi-
    mately 6 million unique terms. Each av-                 2
                                                            We used the following punctuation characters in
    erage word frequency class is represented           our experiments: .,?!:;()-
52         Mario Zechner, Markus Muhr, Roman Kern and Michael Granitzer

                                                              the list of similarities by an average window
                                                              smoothing procedure, where the size of the
                                                              window l is a parameter. We determine the
                                                              mean cosine similarity as well as the standard
                                                              deviation from this list of similarities as:

                                                                                       1
                                                                                            n
                                                                          mean =          cos(vi , m)
                                                                                       n
                                                                                            i=1

                                                                           
                                                                            n
                                                                           1 
                                                                  stddev =     (cos(vi , m) − mean)2
                                                                             n
                                                                                  i=1
Figure 2: Similarity curve for a document.
The x-axis shows the sentence as they appear                      Where n is the number of sentences, vi is
in the document, the y-axis shows the sim-                    the ith sentence vector in the document, m
ilarity of the sentence with the document’s                   is the document mean vector and cos is the
mean vector. Filled areas indicate the loca-                  cosine similarity. The j t h sentence is marked
tion of actually plagiarized sentences.                       as an outlier if the following inequality holds:

         frequencies with which each pronoun oc-                      cos(vj , m) < mean −  ∗ stddev
         cured in a sentence window. 3
     • Closed class words: For each sentence                     where  is some small constant ≥ 1, and
       window the number of occurances of a                   vj is the sentence’s vector. Marked sentences
       certain stop word is measured. Each                    form the input to the last stage of the sys-
       stop word is represented by a single di-               tem. Figure 2 presents the similarity curve
       mension in the final vector space. The                  obtained for the suspicious document 5 in the
       values reflect the frequencies of stop                  development corpus of the PAN 09 challenge
       words in a sentence window. We used                    after smoothing.
       the stop word list from the Snowball                   4.3     Post Processing
       stemming framework 4 .
                                                              Based on the sentences that deviate to much
   For each sentence we construct a vector                    from the mean we derive the final blocks of
for each stylometric feature space based on                   plagiarized passages. As in the case of exter-
the sentence’s window. Each vector is nor-                    nal plagiarism detection we simply merge all
malized to unit length. The vectors of a sen-                 sentences marked that are neighbors until no
tence are then combined to a single vector by                 further merging is possible.
concatenation. The resulting vector is again
normalized. Based on the final vectors of all                  5     Evaluation
sentences we construct a document mean vec-                   We evaluated our system on the development
tor.                                                          corpora of the PAN 09 plagiarism detection
                                                              competition. We describe description of the
4.2       Outlier Detection                                   dataset as well as precision, recall, granular-
The outlier detection tries to determine                      ity and f1-measure results achieved by our
which sentences deviate from the document’s                   system for various parameter settings. The
mean. We use a simple detection scheme: we                    measures are defined as follows:
measure the cosine similarity from the mean
vector to each sentence vector. We smooth
                                                                                      |S|
     3
      We used the following pronouns: i, you, he, she,                       1  #detected chars of si
                                                                precision =
it, we, you, they, me, you, him, her, it, us, you,                          |S|          |si |
them, myself, yourself, himself, herself, itself, our-                                i=1
selves, yourselves, themselves, mine, yours, his, hers,
                                                                                |R|
its, ours, yours, theirs, my, your, his, her, its, our,
your, their.
                                                                              1  #plagiarized chars of ri
                                                                  recall =
    4
      http://snowball.tartarus.org/                                          |R|            |ri |
                                                                                i=1
                                    External and Intrinsic Plagiarism Detection Using Vector Space Models   53

                2 ∗ precision ∗ recall                      • β, the threshold above which the sim-
         f1 =
                 precision + recall                           ilarity between neighbors of a marked
                                |SR |                         sentence and the neighbors of the orig-
                            1 
granularity = log2 (1+                #detections ofsi inR) inal sentence have to be in order to be
                          |SR |                               marked as plagiarized.
                                i=1
were S is the set of detected passages and
si is a single detected passage, R is the set                Table 5.2 gives the results for various pa-
of real plagiarized passages and ri is a single          rameter   settings. The threshold α was set dy-
plagiarized passage and SR is the set of real            namically   depending on the length of a sen-
plagiarized passages for which at least one              tence.   Very   small sentences of five words,
detection exists. |S|, |R| and |SR | denotes             which are most likely sentence splitting er-
the number of passages in a set, |si | and |ri |         rors or headlines, are completely ignored.
denote the number of characters in a passage.            For smaller sentences with fewer than fif-
                                                         teen words, the threshold is set to to 0.7.
5.1 PAN 09 Development Corpora                           Sentences with up to thirty five words the
For each of the two tasks of the PAN 09 com-             threshold is set to 0.5, for longer sentences
petition a development corpus was available.             the threshold is set to 0.4. The threshold β
    The external plagiarism detection dataset            for expanding detected blocks is set to 0.4
consisted of 7214 suspicious documents and               with the reasoning that it is very unlikely
as many reference documents. Artificial pla-              that nearby sentences will have high similar-
giarized passages were added to the suspi-               ity values by chance without being actually
cious documents via an artificial plagiarist.             copied. We arrived at this settings via man-
For each document the plagiarist decided                 ual trial and error on a small subset of suspi-
whether or not to plagiarize, from which ref-            cious documents.
erence documents to plagiarize, how many                     The results show that the influence of the
passages to plagiarize, which type of plagia-            number of centroids l as well as the num-
rism to use for each passage and how long                ber of candidates k is marginal. In fact only
each passage would be. The type of pla-                  a considerable change of the parameters to
giarism could either be obfuscated or trans-             l = 500 and k = 2000 can affect the recall
lated plagiarism. Obfuscation was achieved               significantly. However, using such high set-
by shuffling and deleting words, inserting                 tings degrades the nearest neighbor search for
words from an external source and replacing              a sentence according to the previously stated
words with synonyms, antonyms, hypernyms                 complexity of O(l + k + k2 ) similarity mea-
or hyponyms. Obfuscation levels ranged from              surements. We believe that the marginal im-
none to low to high.                                     provement of the precision and recall due to
    For the intrinsic plagiarism detection task          higher values of l and k do not compensate
a corpus of 3091 suspicious documents was                the much higher runtime. For larger corpora
available. Plagiarized passages were gener-              l and k would have to be set even higher, fur-
ated similar to the external task.                       ther increasing the runtime. We thus suggest
5.2 External Plagiarism Detection                        using moderate values for l and k and instead
                                                         focus on tuning α and β. Especially β is a
        Results                                          good candidate to improve the overall recall.
We performed a parameter study, evaluating
precision, recall and f1-measure, for 500 ran-           5.3 Intrinsic Plagiarism Detection
domly drawn suspicious documents from the                        Results
external plagiarism development corpus. We
studies the effect of the following parameters:           We evaluated precision, recall and f1-measure
                                                         for all suspicious documents of the internal
   • l, the number of clusters for the index.            plagiarism development corpus for various
                                                         parameter settings. The parameters of the
   • k, the number of candidates taken from
                                                         system are as follows:
     the similarity lists.
  • α, the threshold above which the simi-                  • k, the size of the sentence window
    larity between a suspicious and reference
    sentence has to be so that the suspicious               • l, the size of the smoothing window by
    sentence is marked as plagiarized.                        which the similarity list is smoothed.
54       Mario Zechner, Markus Muhr, Roman Kern and Michael Granitzer

     l-k             Precision     Recall     F1-Measure        Granularity    Recall None     Recall Low
     50 - 2           0.9616       0.4045       0.5695            1.9817         0.7044          0.4937
     50 - 20          0.9523       0.4119       0.5750            1.9774         0.7053          0.4983
     50 - 200         0.9411       0.4210       0.5818            1.9738         0.7053          0.5075
     100 - 2          0.9597       0.4101       0.5746            1.9767         0.7044          0.4954
     200 - 2          0.9419       0.4132       0.5745            1.9739         0.7050          0.4988
     500 - 2000       0.8149       0.4782       0.6027            1.8497         0.7027          0.5534
     Competition      0.6051       0.3714       0.4603            2.4424             -              -


Table 1: Results for the extrinsic plagiarism detection system on a split of the development
corpus. The split contains 500 plagiarized documents plus the original documents from which
the plagiarized passages were taken. The last row shows the results on the competition corpus.
The postprocessing was always the same. The first column contains the number of centroids l
and the number of evaluated neighbors k in the similarity list. Each row holds the precision,
recall, f1-measure and granularity for all obfuscation levels. The column Recall None presents
the recall on none obfuscated plagiarized passages and Recall Low contains the recall on none and
low obfuscated plagiarized passages. The discrepancies between the measure on the development
corpus split and the competition corpus are due to β being to low for the competition corpus.
      Feature Space (k-l)                          Precision      Recall      F1-Measure     Granularity
      Word Freq. Class (6-3)                        0.2215        0.0934        0.1314            -
      Punctuation (12-9)                            0.1675        0.1908        0.1784            -
      Part of Speech Tags (6-6)                     0.1797        0.1791        0.1794            -
      Pronouns (12-9)                               0.1370        0.3587        0.1983            -
      Closed Class Words (12-9)                     0.1192        0.1467        0.1316            -
      Combined Feature Space (12-6)                 0.1827        0.2637        0.2159            -
      Competition Corpus                            0.1968        0.2724        0.2286         1.2942


Table 2: Results for the intrinsic plagiarism detection system on the development corpus. Each
cell holds the precision, recall and f1-measure for a given feature space and parameter setting.
We only present the top performing parameter settings due to space constraints. We did not
evaluate the granularity on the development corpus.

     • , the constant by which the standard                ual. We can also reproduce the results by
       deviation of the similarity list is multi-           Grieve (2007) for the punctuation feature
       plied                                                which performed acceptable as well.
                                                                Based on the results for the separate fea-
   Table 5.3 gives the results for various fea-             ture spaces we took the three best perform-
ture spaces and settings for the parameters                 ing spaces, part of speech tags, pronouns
k and l. Parameter  was fixed at 1.1 for all                and punctuation as the basis for the com-
the experiments. We varied k from 6 to 12                   bined vector space. This combined vector
in steps of 2 and l from 3 to 12 in steps of                space was then again evaluated on the com-
3. The ranges arose from preliminary exper-                 plete development corpus with varying pa-
iments where they gave the best results.                    rameters, showing significant improvements
   We performed the experiment for each                     in all measures compared to the separate fea-
separate feature space in order to deter-                   ture spaces. We used the best parameter set-
mine which features have high discrimina-                   tings found in this evaluation for the compe-
tive power. Surprisingly, pronouns performed                tition corpus.
very well when compared to more sophis-
ticated features like the average word fre-                 6     Conclusion and Future Work
quency class. It should be noted that pro-                  We presented our methods and results for the
nouns do not fulfill the constraint of genre in-             intrinsic and external plagiarism task of the
dependence. A play by Shakespeare is likely                 PAN 09 competition. Our systems performed
to contain many more pronouns then a man-                   acceptable, taking the 5th out of 13 places in
                                   External and Intrinsic Plagiarism Detection Using Vector Space Models   55

the external task, the 3rd out of 4 places in           Gionis, Aristides, Piotr Indyk, and Rajeev
the intrinsic task and the 5th overall place              Motwani. 1999. Similarity search in high
place out of 13 in the competition.                       dimensions via hashing. In VLDB ’99:
   We plan on extending our external plagia-              Proceedings of the 25th International Con-
rism detection system by incorporating term               ference on Very Large Data Bases, pages
expansion via synonyms, hyponyms and hy-                  518–529, San Francisco, CA, USA. Mor-
pernyms in order to cope with highly obfus-               gan Kaufmann Publishers Inc.
cated plagiarism. We also plan to use a more
                                                        Grieve, Jack. 2007. Quantitative authorship
sophisticated cluster pruning scheme that is
                                                          attribution: An evaluation of techniques.
hierarchical in nature, using hebbian learning
                                                          Lit Linguist Computing, 22(3):251–270,
to construct a topology of the search space
                                                          September.
to further increase the probability that the
true nearest neighbor of a vector can be de-            Hoad, Timothy C. and Justin Zobel. 2003.
termined.                                                 Methods for identifying versioned and pla-
   For future work for the intrinsic plagiarism           giarized documents. J. Am. Soc. Inf. Sci.
problem, we aim at a better outlier detec-                Technol., 54(3):203–215.
tion method. We will also try to analyze and
                                                        Maurer, Hermann, Frank Kappe, and Bi-
incorporate more stylometric features, com-
                                                          lal Zaka. 2006. Plagiarism - a survey.
bining them with the best performing fea-
                                                          Journal of Universal Computer Science,
tures found in this competition. Dynamically
                                                          12(8):1050–1084.
adapting the parameters k and l for each doc-
ument as well as for each feature space is also         M.E., Timmerman. 2003. Principal compo-
planed.                                                   nent analysis (2nd ed.). i. t. jolliffe. Jour-
                                                          nal of the American Statistical Associa-
References                                                tion, 98:1082–1083, January.
Baeza-Yates, Ricardo and Berthier Ribeiro-              Meyer zu Eissen, Sven and Benno Stein.
  Neto. 1999. Modern Information Re-                      2006.    Intrinsic plagiarism detection.
  trieval. Addison Wesley, May.                           In Mounia Lalmas, Andy MacFarlane,
Bentley, Jon Louis. 1975. Multidimensional                Stefan M. Rüger, Anastasios Tombros,
  binary search trees used for associative                Theodora Tsikrika, and Alexei Yavlinsky,
  searching. Commun. ACM, 18(9):509–                      editors, ECIR, volume 3936 of Lecture
  517.                                                    Notes in Computer Science, pages 565–
Brin, S., J. Davis, and H. Garcia-Molina.                 569. Springer.
   1995. Copy detection mechanisms for dig-             Salton, G., A. Wong, and C. S. Yang. 1975.
   ital documents. In ACM International                    A vector space model for automatic index-
   Conference on Management of Data (SIG-                  ing. Commun. ACM, 18(11):613–620.
   MOD 1995).
                                                        Sheard, Judy, Martin Dick, Selby Markham,
Chierichetti, Flavio, Alessandro Panconesi,               Ian Macdonald, and Meaghan Walsh.
  Prabhakar Raghavan, Mauro Sozio,                        2002. Cheating and plagiarism: percep-
  Alessandro Tiberi, and Eli Upfal. 2007.                 tions and practices of first year it students.
  Finding near neighbors through cluster                  SIGCSE Bull., 34(3):183–187.
  pruning. In PODS ’07: Proceedings of the
  twenty-sixth ACM SIGMOD-SIGACT-                       Tenenbaum, J. B., V. de Silva, and J. C.
  SIGART symposium on Principles of                       Langford. 2000. A global geometric
  database systems, pages 103–112, New                    framework for nonlinear dimensionality
  York, NY, USA. ACM.                                     reduction. Science, 290(5500):2319–2323,
                                                          December.
Ciaccia, Paolo, Marco Patella, and Pavel
   Zezula. 1997. M-tree: An efficient ac-                 Zhong, Shi. 2005. Efficient online spherical
   cess method for similarity search in met-              k-means clustering. In Neural Networks,
   ric spaces. In Matthias Jarke, Michael J.              2005. IJCNN ’05. Proceedings. 2005 IEEE
   Carey, Klaus R. Dittrich, Frederick H. Lo-             International Joint Conference on, vol-
   chovsky, Pericles Loucopoulos, and Man-                ume 5, pages 3180–3185 vol. 5, July-4
   fred A. Jeusfeld, editors, VLDB, pages                 Aug.
   426–435. Morgan Kaufmann.