-

External and Intrinsic Plagiarism Detection Using Vector Space Models

Mario Zechner, Markus Muhr, Roman Kern

rkern@know-center.at {mzechner, mmuhr, rkern}@know-center.at 1

Michael Granitzer

mgranitzer@tugraz.at 0 0 Know-Center Graz, Graz University of Technology 1 Know-Center , 8010 Graz

2009

47 55

Plagiarism detection can be divided in external and intrinsic methods. Naive external plagiarism analysis suffers from computationally demanding full nearest neighbor searches within a reference corpus. We present a conceptually simple space partitioning approach to achieve search times sub linear in the number of reference documents, trading precision for speed. We focus on full duplicate searches while achieving acceptable results in the near duplicate case. Intrinsic plagiarism analysis tries to find plagiarized passages within a document without any external knowledge. We use several topic independent stylometric features from which a vector space model for each sentence of a suspicious document is constructed. Plagiarized passages are detected by an outlier analysis relative to the document mean vector. Our system was created for the first PAN competition on plagiarism detection in 2009. The evaluation was performed on the challenge's development and competition corpora for which we report our results.

Plagiarism, defined as the theft of intellectual property (Maurer, Kappe, and Zaka, 2006) , has been a problem for centuries not only in academic circles. There exist different forms of plagiarism, ranging from simply copying and pasting original passages to more elaborate paraphrased and translated plagiarism. Anecdotal evidence and studies such as (Sheard et al., 2002) strengthen the suspicion that plagiarism is on the rise, facilitated by new media such as the World Wide Web. Growing information sources ease plagiarism while plagiarism prevention and detection become harder.

To combat these problems the first competition on plagiarism detection was held at the third PAN workshop in 2009 in which we participated. The competition was split into two tasks: external plagiarism detection and intrinsic plagiarism detection. External plagiarism detection deals with the problem of finding plagiarized passages in a suspicious document based on a reference corpus. Intrinsic plagiarism detection does not use external knowledge and tries to identify discrepancies in style within a suspicious document. For both tasks extensive, machine generated training corpora were provided upon which we developed and evaluated our solutions. Our contribution is as follows: • We present a robust external plagiarism detection method based on indexing by balanced clustering in a high dimensional term vector space with a focus on full and very near duplicate detection. • We present an intrinsic plagiarism detection method based on a combined set of stylometric features spanning a vector space, using outlier analysis to determine plagiarized passages without a reference corpus or any other external knowledge source. • We report our results for both problems on the PAN 09 development and competition corpora.

This paper is outlined as follows: in section 2 we present related work. Section 3 describes our system for the external plagiarism task, section 4 gives insight on our intrinsic plagiarism detection system. In section 5 we describe the PAN 09 dataset and report our results for the external and intrinsic plagiarism tasks. We conclude and give an outlook on future work in section 6 2

Related Work

External plagiarism detection relies on a reference corpus composed of documents from which passages might have been plagiarized. A passage could be made up of paragraphs, a fixed size block of words, a block of sentences and so on. A suspicious document is checked for plagiarism by searching for passages that are duplicates or near duplicates of passages in documents within the reference corpus. An external plagiarism system then reports these findings to a human controller who decides whether the detected passages are plagiarized or not.

A naive solution to this problem is to compare each passage in a suspicious document to every passage of each document in the reference corpus. This is obviously prohibitive. The reference corpus has to be large in order to find as many plagiarized passages as possible. This fact directly translates to very high runtimes when using the naive approach.

External plagiarism detection is similar to textual information retrieval (IR) (BaezaYates and Ribeiro-Neto, 1999) . Given a set of query terms an IR system returns a ranked set of documents from a corpus that best matches the query terms. The most common structure for answering such queries is an inverted index. An external plagiarism detection system using an inverted index indexes passages of the reference corpus’ documents. For each passage in a suspicious document a query is send to the system and the returned ranked list of reference passages is analyzed. Such a system was presented in (Hoad and Zobel, 2003) for finding duplicate or near duplicate documents.

Another method for finding duplicates and near duplicates is based on hashing or fingerprinting. Such methods produce one or more fingerprints that describe the content of a document or passage. A suspicious document’s passages are compared to the reference corpus based on their hashes or fingerprints. Duplicate and near duplicate passages are assumed to have similar fingerprints. One of the first systems for plagiarism detection using this schema was presented in (Brin, Davis, and Garcia-Molina, 1995) .

External plagiarism detection can also be viewed as nearest neighbor problem in a vector space Rd. Passages are represented as vectors within this vector space. If passages from the reference corpus are ”near enough” to a passage of the suspicious document in this vector space, the passage is marked as potentially plagiarized. The dimensions of the vector space are usually defined by features extracted from the passages such as terms (usually called the ”bag of words” vector space model). This often yields high dimensional vector spaces. Neighbors are defined either by a distance metric S : Rd ××RRdd→ R. Smaller values for D signal D : Rd → R or by a similarity function nearness while high values for S signal similarity. Nearest neighbor searches in a vector space with a distance metric can be made sub linear by the use of space partitioning techniques that rely on the triangle inequality guaranteed by a metric. Famous partitioning schemes are the Kd-Tree (Bentley, 1975) or the metric tree (Ciaccia, Patella, and Zezula, 1997) . However, these techniques degrade to linear searches for high dimensional vector spaces due to the curse of dimensionality: average inter point distances become more similar. This can be somewhat overcome by assuming that the data indeed lies on a lower dimensional manifold which can be captured by dimensionality reduction. In the reduced space partitioning schemas might be applicable again while the original neighborhoods are preserved by the reduction. Common dimensionality reduction approaches are Principle Component Analysis (M.E., 2003) for linear reduction or Isomap (Tenenbaum, de Silva, and Langford, 2000) for non linear reduction. These methods are generally very costly so other methods for nearest neighbor searches in high dimensional vector spaces have been devised. Locality sensitive hashing (LSH) (Gionis, Indyk, and Motwani, 1999) received a lot of attention in recent years due to its simplicity and theoretical guarantees. LSH is the equivalent of the aforementioned fingerprinting schemes applied to high dimensional vector spaces. The nearest neighbors returned by LSH are only approximate in nature. Another approximate nearest neighbor schema was introduced in (Chierichetti et al., 2007) that uses hierarchical cluster trees for space partitioning.

Intrinsic plagiarism detection only recently received attention from the scientific community. It was first introduced in (Meyer zu Eissen and Stein, 2006) and defined as detecting plagiarized passages in a suspicious document without a reference collection or any other external knowledge. A suspicious document is first decomposed into passages. For each passage a feature vector is constructed. Features are derived from stylometric measures like the average sentence length or the average word length known from the field of authorship analysis. These features have to be topic independent so as to capture the style of an author and not the domain she writes about. Next a difference vector is constructed for each passage that captures the passages deviation from the document mean vector. Meyer zu Eissen and Stein (2006) assume that a ground truth is given, marking passages actually from the author of the suspicious document. A model is then trained based on one-class classification, using the ground truth as the training set. The model is then used to determine which passages are plagiarized. However, it is not clear how the ground truth is derived from a suspicious document when no information about the document is known beforehand. 3

External Plagiarism Detection We treat external plagiarism detection as a nearest neighbor search in a high dimensional term vector space. This is motivated by the extensive literature that exists for the nearest neighbor search problem as well as its conceptual simplicity. Our system consists of three stages: • Vectorization of the passages of each document in the reference corpus and partitioning of the reference corpus vector space. • Vectorization of the passages of a suspicious document and finding each passage’s nearest neighbor(s) in the reference corpus vector space. Detection of plagiarism for each suspicious document is based on its nearest neighbor list via similarity thresholding. • Post processing of the detected plagiarized passages, merging subsequent plagiarized passages to a single block. 3.1

Reference Corpus Vectorization & Partitioning

We adopt the vector space model for textual data as given in (Salton, Wong, and Yang, 1975) . Each unique term in the reference corpus is represented as a dimension in the vecd tor space R , where d is the number of unique terms. Instead of creating a single vector for a complete document we create vectors for each sentence in a document as we want to detect plagiarism on a per sentence level.

We use the OpenNLP Framework 1 for tokenization and sentence splitting. For each sentence in every reference corpus document a term frequency vector based on the sentence’s lower cased tokens is constructed, excluding stop words based on a stop word list. We did not apply any stemming or lemmatization. The resulting vectors are then normalized to unit length to overcome difference in sentence length. We use the standard cosine similarity to asses the similarity between sentences given by: cosine similarity(x, y) = x, y x y d were x, y ∈ R . The denominator of the above equation can be dropped as all vectors are scaled to unit length.

To achieve sub linear nearest neighbor searches we implement a variation of cluster pruning (Chierichetti et al., 2007) . The set of reference corpus sentence vectors is first clustered into l partitions using a balanced online spherical k-means implementation (Zhong, 2005) . Balancing is crucial as it provides more equal runtimes when searching for nearest neighbors. The balancing is achieved by introducing a penality to clusters that have more samples than others during the clustering process. Each sentence vector is associated with a single partition, each partition has a representative centroid vector. Our approach deviates from cluster pruning as presented in (Chierichetti et al., 2007) by associating each sentence vector only with the nearest cluster. Additionally we store a sorted list of similarities for each cluster, holding the similarities between the centroid of the cluster and the sentence vectors associated with that cluster. The resulting structure serves as an index. It allows searching the approximate nearest sentences for a given query sentence.

3.2 Suspicious Document Vectorization & Plagiarism Detection

We vectorize a suspicious document in the same way we vectorize reference corpus documents. For each sentence vector of a suspi1http://opennlp.sourceforge.net/ cious document we search the reference corpus for the k most similar sentences as follows: • Determine the nearest cluster to the query sentence based on the cosine similarity between the centroids and the query sentence. • Find the position in the sorted similarity list the query sentence would be inserted at based on its similarity to the cluster centroid. Gather k2 sentence vectors to the left and right of that position in the list. • Perform the same search in the second nearest cluster returning k2 potential nearest neighbors and merge the two sets of candidate reference corpus sentences.

We justify this procedure as follows: The cluster a vector belongs to is likely to also contain the vector’s nearest neighbors. To increase the probability of catching most true nearest neighbors we also use the second nearest cluster. An original sentence in the reference corpus and a full duplicate in a suspicious document will belong to the same cluster as they have the same vector representation. Consequently both vectors have the same similarity with the centroid of that cluster. The search in a cluster’s similarity list should thus return the duplicated sentence. This can fail if there are more than k vectors in the cluster having the same similarity with the centroid as the suspicious sentence vector. The outcome is dependent on the quality of the sentence splitter as this type of search for full duplicates relies on correct sentence boundaries. The schema will also work in case a sentence was plagiarized with small modifications, albeit with a much lower probability of finding the correct nearest neighbor. We thus expect our system to work well for detecting full duplicates, acceptable in the case of slightly modified sentences and poorly for highly obfuscated sentences. Figure 1 illustrates the function of the similarity list of a cluster.

With the presented schema we can reduce the search for the nearest neighbors of a sentence from being linear in the number of sentences in the reference corpus to roughly O(l + k + k/2) cosine similarity evaluations per sentence, where l is the number of centroids or clusters, k is the number of vectors taken from the nearest cluster’s similarity list and k2 is the number of vectors taken from the second nearest cluster’s similarity list.

A sentence of a suspicious document is marked as plagiarized if its cosine similarity to the most similar candidate sentence from the reference corpus exceeds a threshold α. This parameter allows controlling the sensitivity of the system. The outcome of this stage is a set of sentences from the suspicious document that are plagiarized with high probability. The information about the plagiarized sentences’ position in the original documents is also retained.

3.3 Post Processing

The final stage assembles the sentences marked as plagiarized in the second stage to continuous blocks. This is accomplished by simply checking whether sentences marked as plagiarized are in sequence in the suspicious document. If this is the case they are merged. This is repeated until no more merging is possible. To further increase the recall, we compare a plagiarized sentence’s left and right neighbor sentences to the neighbors of the original sentence. These might have been missed in the nearest neighbor search and have now a chance to be detected. The neighborhood sentences are again compared via the cosine similarity and marked as plagiarized if the similarity to an original sentence is above some threshold β. 4

Intrinsic Plagiarism Detection Our intrinsic plagiarism detection system is based on the ideas presented in (Meyer zu Eissen and Stein, 2006) and (Grieve, 2007) . Meyer zu Eissen and Stein were the first to define the problem of intrinsic plagiarism detection: determine whether passages in a suspicious document are plagiarized based only on changes in style within the document. An author’s style is also of importance in the field of authorship classification. Both problems rely on so called stylometric features. These features should be topic and genre independent and reflect an author’s style of writing. Changes of style within a document can be detected by various methods. We choose a simple outlier detection scheme based on a vector space spanned by various stylometric features. The system is composed of 3 stages: • Vectorization of each sentence in the suspicious document.

• Determination of outlier sentences based on the document’s mean vector. • Post processing of the detected outlier sentences.

4.1 Suspicious Document Vectorization

Plagiarism is determined on a per sentence level in our intrinsic plagiarism detection system. However, we chose to use a window of k sentences around a given suspicious sentence. The window is composed of k2 sentences to the left and right of the sentence. We adopt various stylometric features as presented in (Meyer zu Eissen and Stein, 2006) and (Grieve, 2007) that together form a single vector space: • Average word frequency class: Each token w in a sentence window is assigned to an average word frequency class. The class is calculated by log(f reqw∗/f reqw , where f reqw∗ is the absolute number of occurances of the most frequent word in a huge corpus and f reqw is the number of occurances of the token in that same corpus. We derrived our frequency table from tokenizing the English Wikipedia, resulting in aproximately 6 million unique terms. Each average word frequency class is represented by a single dimension in the final vector space. The values in these dimension specify the number of tokens belonging to that class. • Punctuation: For each sentence window the number of occurances of a certain punctuation character is measured. Each punctuation character is represented by a single dimension in the final vector space. The values in these dimensions reflect the frequencies of punctuation characters in the sentence window. 2 • Part of speech tags: Each token w in a sentence window is assigned a part of speech tag from the Penn Treebank part of speech tag set. Each part of speech tag is represented by a dimension in the final vector space. The values in these dimensions reflect the frequencies of part of speech tags in the sentence window. • Pronouns: For each sentence window the number of occurances of a certain pronoun is measured. Each pronoun is represented as a single dimension in the final vector space. The values reflect the 2We used the following punctuation characters in our experiments: .,?!:;()the list of similarities by an average window smoothing procedure, where the size of the window l is a parameter. We determine the mean cosine similarity as well as the standard deviation from this list of similarities as: mean =

cos(vi, m) stddev =

(cos(vi, m) − mean)2 1 n n

i=1 1 n n i=1 frequencies with which each pronoun occured in a sentence window. 3 • Closed class words: For each sentence window the number of occurances of a certain stop word is measured. Each stop word is represented by a single dimension in the final vector space. The values reflect the frequencies of stop words in a sentence window. We used the stop word list from the Snowball stemming framework 4.

For each sentence we construct a vector for each stylometric feature space based on the sentence’s window. Each vector is normalized to unit length. The vectors of a sentence are then combined to a single vector by concatenation. The resulting vector is again normalized. Based on the final vectors of all sentences we construct a document mean vector.

4.2 Outlier Detection

The outlier detection tries to determine which sentences deviate from the document’s mean. We use a simple detection scheme: we measure the cosine similarity from the mean vector to each sentence vector. We smooth 3We used the following pronouns: i, you, he, she, it, we, you, they, me, you, him, her, it, us, you, them, myself, yourself, himself, herself, itself, ourselves, yourselves, themselves, mine, yours, his, hers, its, ours, yours, theirs, my, your, his, her, its, our, your, their.

4http://snowball.tartarus.org/

Where n is the number of sentences, vi is the ith sentence vector in the document, m is the document mean vector and cos is the cosine similarity. The jth sentence is marked as an outlier if the following inequality holds: cos(vj, m) < mean − ∗ stddev where is some small constant ≥ 1, and vj is the sentence’s vector. Marked sentences form the input to the last stage of the system. Figure 2 presents the similarity curve obtained for the suspicious document 5 in the development corpus of the PAN 09 challenge after smoothing.

4.3 Post Processing

Based on the sentences that deviate to much from the mean we derive the final blocks of plagiarized passages. As in the case of external plagiarism detection we simply merge all sentences marked that are neighbors until no further merging is possible. 5

Evaluation

We evaluated our system on the development corpora of the PAN 09 plagiarism detection competition. We describe description of the dataset as well as precision, recall, granularity and f1-measure results achieved by our system for various parameter settings. The measures are defined as follows: precision = recall = |R| i=1 1 |S| #detected chars of si |S| i=1 1 |R| #plagiarized chars of ri

|si| |ri| granularity = log2(1+ #detections ofsiinR) 1 |SR| |SR| i=1 f 1 = 2 ∗ precision ∗ recall

precision + recall were S is the set of detected passages and si is a single detected passage, R is the set of real plagiarized passages and ri is a single plagiarized passage and SR is the set of real plagiarized passages for which at least one detection exists. |S|, |R| and |SR| denotes the number of passages in a set, |si| and |ri| denote the number of characters in a passage.

5.1 PAN 09 Development Corpora

For each of the two tasks of the PAN 09 competition a development corpus was available.

The external plagiarism detection dataset consisted of 7214 suspicious documents and as many reference documents. Artificial plagiarized passages were added to the suspicious documents via an artificial plagiarist. For each document the plagiarist decided whether or not to plagiarize, from which reference documents to plagiarize, how many passages to plagiarize, which type of plagiarism to use for each passage and how long each passage would be. The type of plagiarism could either be obfuscated or translated plagiarism. Obfuscation was achieved by shuffling and deleting words, inserting words from an external source and replacing words with synonyms, antonyms, hypernyms or hyponyms. Obfuscation levels ranged from none to low to high.

For the intrinsic plagiarism detection task a corpus of 3091 suspicious documents was available. Plagiarized passages were generated similar to the external task.

5.2 External Plagiarism Detection Results

We performed a parameter study, evaluating precision, recall and f1-measure, for 500 randomly drawn suspicious documents from the external plagiarism development corpus. We studies the effect of the following parameters: • l, the number of clusters for the index. • k, the number of candidates taken from the similarity lists. • α, the threshold above which the similarity between a suspicious and reference sentence has to be so that the suspicious sentence is marked as plagiarized. • β, the threshold above which the similarity between neighbors of a marked sentence and the neighbors of the original sentence have to be in order to be marked as plagiarized. • k, the size of the sentence window • l, the size of the smoothing window by which the similarity list is smoothed. l - k 50 - 2 50 - 20 50 - 200 100 - 2 200 - 2 500 - 2000 Competition • , the constant by which the standard deviation of the similarity list is multiplied ual. We can also reproduce the results by Grieve (2007) for the punctuation feature which performed acceptable as well.

Based on the results for the separate feature spaces we took the three best performing spaces, part of speech tags, pronouns and punctuation as the basis for the combined vector space. This combined vector space was then again evaluated on the complete development corpus with varying parameters, showing significant improvements in all measures compared to the separate feature spaces. We used the best parameter settings found in this evaluation for the competition corpus. 6 Conclusion and Future Work We presented our methods and results for the intrinsic and external plagiarism task of the PAN 09 competition. Our systems performed acceptable, taking the 5th out of 13 places in the external task, the 3rd out of 4 places in the intrinsic task and the 5th overall place place out of 13 in the competition.

We plan on extending our external plagiarism detection system by incorporating term expansion via synonyms, hyponyms and hypernyms in order to cope with highly obfuscated plagiarism. We also plan to use a more sophisticated cluster pruning scheme that is hierarchical in nature, using hebbian learning to construct a topology of the search space to further increase the probability that the true nearest neighbor of a vector can be determined.

For future work for the intrinsic plagiarism problem, we aim at a better outlier detection method. We will also try to analyze and incorporate more stylometric features, combining them with the best performing features found in this competition. Dynamically adapting the parameters k and l for each document as well as for each feature space is also planed.

Baeza-Yates , Ricardo and Berthier RibeiroNeto . 1999 . Modern Information Retrieval . Addison Wesley, May.

Bentley , Jon

Louis . 1975 . Multidimensional binary search trees used for associative searching . Commun. ACM , 18 ( 9 ): 509 - 517 .

Brin , S. ,

Davis , and

Garcia-Molina . 1995 . Copy detection mechanisms for digital documents . In ACM International Conference on Management of Data (SIGMOD 1995 ).

Chierichetti , Flavio, Alessandro Panconesi, Prabhakar Raghavan, Mauro Sozio, Alessandro Tiberi, and Eli

Upfal . 2007 . Finding near neighbors through cluster pruning . In PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACTSIGART symposium on Principles of database systems , pages 103 - 112 , New York, NY, USA. ACM.

Ciaccia , Paolo, Marco

Patella , and Pavel

Zezula . 1997 . M-tree: An efficient access method for similarity search in metric spaces . In Matthias Jarke ,

Michael J.

Carey , Klaus R. Dittrich , Frederick H. Lochovsky , Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, VLDB , pages 426 - 435 . Morgan Kaufmann.

Gionis , Aristides, Piotr

Indyk , and Rajeev

Motwani . 1999 . Similarity search in high dimensions via hashing . In VLDB '99: Proceedings of the 25th International Conference on Very Large Data Bases , pages 518 - 529 , San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

Grieve , Jack. 2007 . Quantitative authorship attribution: An evaluation of techniques . Lit Linguist Computing , 22 ( 3 ): 251 - 270 , September.

Hoad , Timothy C. and

Justin

Zobel . 2003 . Methods for identifying versioned and plagiarized documents . J. Am. Soc. Inf. Sci. Technol ., 54 ( 3 ): 203 - 215 .

Maurer , Hermann, Frank Kappe, and Bilal

Zaka . 2006 . Plagiarism - a survey . Journal of Universal Computer Science , 12 ( 8 ): 1050 - 1084 .

M.E. , Timmerman . 2003 . Principal component analysis (2nd ed .). i. t. jolliffe. Journal of the American Statistical Association , 98 : 1082 - 1083 , January.

Meyer zu Eissen, Sven and

Benno

Stein . 2006 . Intrinsic plagiarism detection . In Mounia Lalmas, Andy

MacFarlane

, Stefan M. Ru¨ger, Anastasios Tombros, Theodora Tsikrika, and Alexei Yavlinsky, editors, ECIR , volume 3936 of Lecture Notes in Computer Science, pages 565 - 569 . Springer.

Salton , G. ,

Wong , and

C. S.

Yang . 1975 . A vector space model for automatic indexing . Commun. ACM , 18 ( 11 ): 613 - 620 .

Sheard , Judy, Martin Dick, Selby Markham, Ian Macdonald, and Meaghan

Walsh . 2002 . Cheating and plagiarism: perceptions and practices of first year it students . SIGCSE Bull. , 34 ( 3 ): 183 - 187 .

Tenenbaum , J. B. , V. de Silva, and

J. C.

Langford . 2000 . A global geometric framework for nonlinear dimensionality reduction . Science , 290 ( 5500 ): 2319 - 2323 , December.

Zhong , Shi. 2005 . Efficient online spherical k-means clustering . In Neural Networks , 2005 . IJCNN ' 05 . Proceedings . 2005 IEEE International Joint Conference on, volume 5 , pages 3180 - 3185 vol. 5 , July-4 Aug.