=Paper=
{{Paper
|id=Vol-1988/LPKM2017_paper_3
|storemode=property
|title=Intrinsic Detection of Plagiarism based on Writing Style Grouping
|pdfUrl=https://ceur-ws.org/Vol-1988/LPKM2017_paper_3.pdf
|volume=Vol-1988
|authors=Maryam Elamine,Seifeddine Mechti,Lamia Hadrich Belguith
|dblpUrl=https://dblp.org/rec/conf/lpkm/ElamineMB17
}}
==Intrinsic Detection of Plagiarism based on Writing Style Grouping==
Intrinsic Detection of Plagiarism based on Writing Style Grouping Maryam Elamine1, SeifEddine Mechti2, Lamia Hadrich Belguith3 1ANLP Group, FSEGS, University of Sfax mary.elamine@gmail.com 2 LARODEC Laboratory, ISG of Tunis, University of Tunis, mechtiseif@gmail.com 3 ANLP Group, MIRACL Laboratory, FSEGS, University of Sfax l.belguith@fsegs.rnu.tn Abstract. In this paper, we tackle the task of intrinsic plagiarism detection, also referred to as author diarization. This task deals with identifying segments within a document written by multiple authors [2]. The main goal is to discover devia- tions in the writing style, looking for parts of the document that could potentially be written by another person [4]. In this paper, we present our hybrid approach that constructs a style function from stylometric features and detects the outliers. The proposed approach has been evaluated on two publicly available corpora. The obtained results outperform the ones obtained by the best state-of-the-art methods. Keywords: author diarization, plagiarism, intrinsic plagiarism detection, outli- ers detection. 1 Introduction Plagiarism detection is the task of identifying text reuse in a document or collection of documents [2]. It can be clustered as two main tracks: 1) Intrinsic plagiarism detection and 2) Extrinsic plagiarism detection. The former is the process of verifying the unity of a document against itself using local analysis. It focuses on finding whether the doc- ument was written by the same author or if there exists some parts written by other ones. The latter is the process of evaluating a document and verifying if there exists some parts that have been copied from external sources [8], thus the suspected docu- ment is compared with a collection of source documents [3]. The term “author diarization” came from the domain of speaker diarization, which is concerned with clustering and identifying various speakers from a single audio speech signal. Then the frequency range of the speakers' voices are analyzed (e.g. a class discussion on a particular topic). Likewise, the task of author diarization deals with a written document instead of audio conversations [5]. Its objective is to identify and cluster different authors within a single document [2]. The task of author diarization is extended and generalized by the introduction of within-document author clustering problems [9]. Author diarization consists of the following three sub tasks [1]: – Traditional intrinsic plagiarism detection: There exists one main author who wrote at least 70% of the considered document – Diarization with a given number of authors: The document is written by a known number of authors – Unrestricted diarization: The number of collaborating authors is unknown. In this paper, we present our proposed approach for intrinsic plagiarism detection by automatic grouping of writing style and for that purpose, we will explore two corpora; PAN 161 and PAN 172. We first group authors based on their writing style and for that, we explore a writing style function. Then, from the generated clusters there is a high probability that an author has plagiarized from another, so following this hypothesis, we segment the documents into segments of 500 characters, attribute a style function to each fragment, if there exists some parts with a different style in the document then it is plagiarized. Our approach uses a hybridization of stylistic features that include lexical features (Mean sentence length, Type-token ratio, punctuation marks and letter count) and syntactic features (POS tags and function words ratio). The obtained results outperform the ones obtained by the best state-of-the-art methods that have also ex- ploited the same corpora (i.e. the PAN16 corpus and the PAN17 corpus). 2 Related Work Plagiarism is rising as a serious problem in the academic and educational domains [3]. With the explosive growth of content found throughout the Web, people can find nearly everything they need for their written work. Thus, detection of such cases can become a monotonous task [4]. In their study, M. Kuznetsov et al. [1] used stylometric features such as character n- grams, word n-grams, punctuation marks and pronouns count. The authors used the PAN 11 corpus for experimenting and the PAN 16 corpus for evaluating. They formu- lated the task of intrinsic plagiarism as text segments classification. They exploited a per-sentence approach [12]. This approach constructs disjoint segments to different length and detects plagiarism on sentence level. In fact, sentences are labeled following this rule: if more than half the characters in a sentence “s” are plagiarized, then it is labeled as plagiarized, otherwise it is labeled as non-plagiarized. For the features used, first they exploited the word frequencies trait, which is based on analyzing occurrences of text words; the lowercased sequences of characters with the exception of stopwords. Second, they utilized the n-gram frequencies attribute in which they count the n-gram frequencies. In fact, experiments showed that for better use, it's best to exploit 1-grams, 3-grams and 4- grams jointly. The resulting n-gram feature returns for each of the con- sidered n-grams three statistics. Finally, for the final feature, the authors count for each sentence the number of occurrences of the most common punctuation marks (!,.?- ;) and the universal POS tags (VERB, NOUN, ADJ, ADV, PRON, CONJ, ADP, DET, 1 http://pan.webis.de/clef16/pan16-web/author-identification.html 2 http://pan.webis.de/clef17/pan17-web/author-identification.html PRT, NUM). For each sentence, they count its length in characters and the mean length of the sentence words. After constructing their features, the authors move on to detect the outliers. For the author diarization task, they adapted the intrinsic plagiarism ap- proach to solve the next problem. The algorithm functions the same way as previously described, but instead of the outlier detection phase, this approach provides segmenta- tions of series using the Hidden Markov Model (HMM) with Gaussian emissions [13]. For the task of author diarization with unknown number of authors, the authors esti- mated the number by computing an averaged t-statistic for all pairs of author segments. Afterwards, they iterated through a probable number n (n ∈ [2..20]), then they com- puted the time series segmentation for each n. For each segmentation, the measure of clusters discrepancy is computed [1]: |𝑚(𝑐 ) − 𝑚(𝑐 )| 𝑄 = (1) , 𝜎(𝑐 )² 𝜎 𝑐 ² + 𝑙(𝑐 ) 𝑙(𝑐 ) Where m (𝑐 ) is the mean of elements in cluster 𝑐 , 𝜎(c ) is the mean deviation, and l (𝑐 ) is the cluster size. The final estimation 𝑛 maximizes Q (n). After obtaining the estimation, the algo- rithm performs a diarization with a known number of authors 𝑛. The model obtained F-score 0.2 for intrinsic plagiarism detection, BCubed F-score3 0.54 for author diariza- tion with a known number of authors and a BCubed F-score 0.5 for unrestricted diari- zation. A. Sittar et al. [2] conducted their experiments on the same corpus (i.e. PAN 16) as [1]. They exploited stylistic features which include lexical attributes to uniquely iden- tify an author's writing style in a given document. In fact, the authors segmented each text document into sentences, and for each sentence, they exploited a total of 15 lexical features (Character n-grams, digits count, spaces count, words count, etc.). Actually, their approach consists of 6 steps. First, the “Read Raw Input Text” in which the authors read all the documents as they are. Second, the “Break Down Text into Sentences” step, in which the authors segmented the documents into sentences. Third is the “Lexical Features Computation” step, in which they counted the ratio of each feature in each sentence. The forth step is the “Distance Calculation”, the authors computed the dis- tance for each sentence. The fifth step is “ClustDist 4 Computation [15]” which is also calculated for each sentence. And the final step is “Generating Clusters” in which, on the basis of the scores obtained in the fifth step, the authors applied K-Means algorithm for clustering their data. In their experiments, they created a matrix V of order n x p. Each matrix row shows a vector of features for each sentence. In the training phase, their approach performed 3 The BCubed F-measure is a measure defined for non-overlapping clustering. It is like the regular F-Score; but the BCubed algorithm calculates the precision and recall numbers for each entity in the document. 4 ClustDist is a straightforward technique to compute the average distance from one portion (i.e. sentences) of text to all other pieces of text. well with sentences of length 7. In fact, sentences of length 5 demonstrated better results for the author diarization with known number of authors subtask and in the unrestricted diarization subtask. Following the results obtained in the training phase, the authors used only sentences with lengths that demonstrated the highest results for each subtask, which are as follows: 7 for the first subtask and 5 for the second and third subtask. As mentioned in [10], these are the obtained result by Kuznetsov et al. [1] and Sittar et al. [2] in the PAN 16 competition for intrinsic plagiarism detection: Table 1. Intrinsic Plagiarism Detection Results [10] Micro Macro Rank Team Recall Precision F Recall Precision F 1 Kuznetsov et al. 0.19 0.29 0.22 0.15 0.28 0.17 2 Sittar et al. 0.07 0.14 0.08 0.10 0.14 0.10 N. Akiva [7] treated the problem of intrinsic plagiarism detection, which was the center of interest in the competition PAN 2011. The author's approach consisted of two phases: chunks clustering and cluster properties detection. For the first step, for a given document, the author divided the text into chunks consisting of 1000 characters. Then, he identified the 100 rarest words that appear in at least 5% of the fragments. After- wards, the author created a numerical vector representing each chunk, its length is 100 and it corresponds to the presence or absence of the rare words in the fragment. The similarity between pairs is then measured using the Cosine metric. For the clustering, the author used a spectral clustering method called n-cut [14] for clustering the chunks. Later on, the author clustered the document to two parts only (true text and plagiarized text). For the second step, which purpose is to identify clusters that comprise plagia- rized parts, the author ran the clustering algorithm on the training corpus and measured a variety of properties which include the relative and absolute size of each cluster, the similarity of each chunk to its own cluster, to the other clusters and to the whole docu- ment. Afterwards, the author represented each chunk in the training set as a numerical vector. Then, he used a supervised learning algorithm to learn decision trees5 to distin- guish plagiarized segments from non-plagiarized segments. The author utilized ten-fold cross-validation in order to optimize parameter settings and to estimate accuracy re- sults. Actually, the author didn't exploit the full training set for efficiency reasons. The author ignored all documents with a percentage of plagiarism greater than 40%, and then randomly selected fragments from the remaining documents. On the PAN 11 eval- uation set, the author achieved a precision of 12.7% and a recall of 6.6%. S. Rao et al. [6] conducted their experiments on the same corpus (i.e. PAN 11) as [7]. They focused on features that model the author style (character n-grams, word fre- 5 A decision tree consists of nodes and branches to partition a set of samples into a set decisions. The starting node is also known as the root of the tree. In each node, a single test or decision is made to obtain a partition. In the terminal nodes or leaves, a decision is made quencies, means sentence length, stem suffixes frequency, closed class words fre- quency and frequency of discourse markers6. In fact, the authors combined their fea- tures to obtain better results in identifying the author style. Their approach first calcu- lates the distance between two normalized feature vectors: the first one is composed of the whole document whereas the second one represents the partially overlapping sec- tions of the documents of 2000 characters window with 200-step size. All the sections for which the style change value comes out to be greater than 2.0 are marked as plagia- rized. Consecutive plagiarized sections that are 500 characters apart are merged to form a single plagiarized case to maintain a proper granularity value. Then, the authors meas- ured the style change function distance between normalized stylometric feature vectors by exploiting a style change function. The corpus used has a total of 4753 documents for intrinsic setting. Their obtained result were mediocre and this was due to the low recall values and large number of false positive detection. Nevertheless, discourse markers based features along with the other traits exploited by the authors were suc- cessful in detecting intrinsic plagiarism. G. Oberreuter and J. D. Velásquez [4] also treated the problem of plagiarism detec- tion by detecting deviations in the writing style. They first preprocessed the document, for that they removed all characters leaving only those that belong to the a-z group, all the characters are considered in lowercase. Then, they explored word unigrams consid- ering all the words including the stopwords. Afterwards, they applied a word-fre- quency-based algorithm to test the self-similarity of a given document. Next, they built for all the words in the document a frequency vector (which is not normalized) and then they clustered the document into groups. At first, the authors created these segments with the use of a sliding window, over the whole document, of length “m”. For each segment, a new frequency vector is computed, this new vector is explored in further steps, and it’s utilized to determine if a segment deviates from the complete document. All segments are classified according to their distance with the document’s style. The authors evaluated their approach using the PAN corpora, which is publicly available. For the evaluation of their approach, they used the standard metrics for information retrieval (precision, recall and f-score). The obtained results show the unreliable nature of their approach because the precision is very low (0.3). Actually, their experiments were conducted on documents written in English; however, their approach is not lan- guage-dependent. 3 Our Proposed Approach In this study, we address the intrinsic plagiarism detection problem. In order to identify plagiarism in textual documents, we focused on stylometric features that best describe the writing style, plus we introduced the hybrid aspect (hybridization of lexical and syntactic features). Our proposed approach, as shown in figure 1, comprises five steps. 6 Discourse markers are words that do not change the meaning of the text. They are either used as filler elements in the text or out of author’s habit. They are used frequently and most likely twice every 2 or 3 sentences. Examples of discourse markers are:”well”,”actually”,”then”, etc. First, we have the clustering step, in which we grouped documents by writing style. Second, from the obtained clusters we tokenized each document into clusters of 500 characters so that it would be easy to use our features in the following phase. Next, and after creating a vector of features, we constructed a style function by which we deter- mine the designated style for each cluster. Finally, we have the phase of detecting out- liers. In fact, each cluster having a deviant style function than the rest of clusters in one single document is detected as an outlier. Fig. 1. Main Steps of our Proposed Approach 3.1 Clustering In this step, we grouped together documents, which are with high probability written by the same author, based on their writing style. Actually, we explored a variety of features in this phase, such as POS tags, means sentence length, function words, type- token ratio, punctuation marks, etc. We also used various classification algorithms such as KNN7, SVM8 and decision trees using Weka9. 3.2 Tokenize In this step, we parsed the documents and tokenized them by clusters of 500 charac- ters each. Since our approach is inspired by works proposed in the competition PAN@CLEF, we followed the same format demanded in the PAN competition. 7 In k-nearest neighbor (KNN), the nearest neighbor is calculated on the basis of the value of k, that specifies how many nearest neighbors are to be considered to define a class of a sample data point. 8 SVM is a learning machine for two-group classification problems. It defines a linear decision as an optimal hyperplane with maximal margin between the vectors of two classes. 9 http://www.cs.waikato.ac.nz/ml/weka/ 3.3 Composing Features In this step, we vectorized text sentences and constructed feature description. We first used our features separately then we explored the hybrid aspect. From the obtained results, we constructed a vector of features. 3.4 Constructing a Style Function In this phase, and after creating our vector, we constructed the style function. Actually, an author style function is generated as an output of a classifier trained on basic features [1]. In this step, we attribute to each cluster generated in the second step a style function. 3.5 Detecting Outliers In the final phase, we tried to detect outliers. In each document, we inspected each cluster created in phase 2; each style appearing to be different from the other styles within the same document will be marked as plagiarized. In fact, we explored the KNN algorithm in this step; given a document, the KNN algorithm will segment the docu- ment into fragments based on the writing style. 4 Experiments We tested our approach on the PAN 16 and the PAN 17 corpora for the task of author diarization and style breach. We conducted several individual experiments testing our features separately and combined. 4.1 Dataset Description The original problem of intrinsic plagiarism detection is related to the question, whether an author has misused parts of a text from others without proper references, and if yes, which parts are plagiarized. Thus, in a given document, the writing style needs to be analyzed to identify the authors [10]. The PAN 16 Corpus: The task at PAN 16 focuses on identifying authorships within a single document. Thereby, the task is not only focused on searching for plagiarism, but also to identifying contributions of different authors in a given document. The former is the case, where it can be assumed that the main text is written by one author and only some fragments are by other writers. The latter is the case, where in a single document, there exists multiple authors. Such documents may be the result of a collaborative work which is known as “author diarization” (e.g. a combined master thesis written by two students or scientific papers written by a known number of cooperating researchers.) Author diarization con- sists of three subtasks: Intrinsic plagiarism detection, diarization with a known number of authors and unrestricted diarization. For all three subtasks, distinct training and test datasets have been provided, which are based on the Webis-TRC-12 dataset [11], with 150 topics from TREC Web Tracks from 2009-2011, whereby professional writers were hired to compose a single document on a given topic. In fact, from the written documents, the datasets for the three subtasks have been generated by varying several configurations such as the number of authors in a text and their respective contributions, the decision if the authors are uniformly distributed or if switches are permitted within a sentence, at the end of a sentence, or only between paragraphs, etc. Since the training set has been partly published, the test documents are created only from unpublished documents. Overall, the number of training/test documents for the respective subtasks are 71/29 for traditional intrinsic plagiarism detection, 55/31 for diarization with a given number of authors, and 54/29 for unrestricted diarization [10]. The PAN 17 Corpus: The PAN 17 corpus provides documents written in English. The PAN 17 task fo- cuses on detecting style breaches within documents, i.e. to locate borders where author- ships change. Therefore, it deals with the task of text segmentation, however; it does not focus on detecting switches in the topic. Thus, given a document, the task is to identify whether the document is multi-authored, and if yes, the borders where authors switch should be determined. The documents provided in this corpus may contain zero up to arbitrarily many style breaches. Thereby, switches of authorships may only occur at the end of sentences and not within them10. 4.2 Results Our approach achieved good results with both corpora; Table 2 compares the results of our approach obtained with the PAN 16 and PAN 17 corpora. As evaluation measures, we used precision, recall and f-measure. Actually, we did several benchmarking tests using the PAN@CLEF 2016 and PAN@CLEF 2017 corpora. Figure 2 illustrates the performance of our features compared to those of the hybridization. It is clear that the hybrid aspect gives great result compared to exploring the traits separately. Table 2. Dataset Results Corpus Precision Recall F-score PAN16 0.748 0.635 0.686 PAN17 0.701 0.6 0.646 10 For more information visit the site of PAN17 : http://pan.webis.de/clef17/pan17-web/author- identification.html PAN16 PAN17 1 0,9 0,8 0,7 Precision 0,6 0,5 0,4 0,3 0,2 0,1 0 Features Fig. 2. Performance of Features Figure 2 shows the performance of the features exploited in our experiments based on their precision. It is obvious that our traits performed well with the PAN 16 corpus than with the PAN 17 corpus and that the hybridization has a better performance than the features used separately. 5 Conclusion In this paper, we proposed our approach for the task of intrinsic plagiarism detection. We explored a hybrid approach to optimize the performance; we combined various features (stylistic and syntactic attributes) for the construction of a style function. The experiments focused on this exploration were performed on two corpora comprising documents in English. We explored different aspects in our work such as the explora- tion of the KNN classifier to detect the outliers in a given document and feature hybrid- ization. We obtained good results that outperform the ones obtained by the best state- of-the-art methods; the method achieved an f-score of 0.686 for the PAN 16 corpus and an f-score of 0.646 for the PAN 17 corpus. As future works, we intend to experiment on other features such as n-grams. More- over, in our approach we only considered texts written in English, therefore we would like to improve our approach so it would be language independent. References 1. Kuznetsov, M., Motrenko, A., Kuznetsova, R. and Strijov, V.: Methods for Intrinsic Plagiarism Detection and Author Diarization (2016) 2. Sittar, A., Iqbal, H. R. and Nawab, R. M. A.: Author Diarization using Cluster-Dis- tance Approach (2016) 3. K, V. and Gupta, D.: Detection of Idea Plagiarism using Syntax-Semantic Concept Extractions with Genetic Algorithm (2016) 4. Oberreuter, G. and Velásquez, J. D.: Text Mining Applied to Plagiarism Detection: The use of Words for Detecting Deviations in the Writing Style (2013) 5. Miro, X. A., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G. and Vinyals, O.: Speaker diarization: A review of recent research (2012) 6. Rao, S., Gupta, P., Singhal, K. and Majumder, P.: External & Intrinsic Plagiarism Detection: VSM & Discourse Markers based Approach, Notebook for PAN at CLEF 2011 (2011) 7. Akiva, N.: Using Clustering to Identify Outlier Chunks of Text, Notebook for PAN at CLEF 2011 (2011) 8. Magooda, A., Mahgoub, A. Y., Rashwan, M., Fayek, M. B. and Raafat, H.: RDI System for Extrinsic Plagiarism Detection (RDI_RED) Working Notes for PAN- AraPlagDet at FIRE 2015 (2015) 9. Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B. and Potthast, M.: Clustering by authorship within and across documents (2016) 10. Rosso, P., Rangel, F., Potthast, M., Stamatatos, E., Tschuggnall, M. and Stein, B.: Overview of PAN’16: New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation (2016) 11. Potthast, M., Hagen, M., Völske, M. and Stein, B.: Crowdsourcing Interaction Logs to Understand Text Reuse from the Web. In: Proceedings of ACL 13. ACL. (2013) 12. Zechner, M., Muhr, M., Kern, R. and Granitzer, M.: External and Intrinsic Plagia- rism Detection Using Vector Space Models (2009) 13. Keogh, E., Chu, S., Hart, D. and Pazzani, M.: Segmenting Time Series: A Survey and Novel Approach (2004) 14. Dhillon, I. S., Guan, Y. and Kulis, B.: Kernel k-means, Spectral Clustering and Normalized Cuts (2004) 15. Guthrie, D.: Unsupervised Detection of Anomalous Text; Ph.D. thesis (2008)