=Paper=
{{Paper
|id=Vol-1176/CLEF2010wn-PAN-AlzahraniEt2010
|storemode=property
|title=Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010
|pdfUrl=https://ceur-ws.org/Vol-1176/CLEF2010wn-PAN-AlzahraniEt2010.pdf
|volume=Vol-1176
}}
==Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection - Lab Report for PAN at CLEF 2010==
Fuzzy Semantic-Based String Similarity for Extrinsic Plagiarism Detection Lab Report for PAN at CLEF 2010 Salha Alzahrani1, Naomie Salim2 1 Faculty of computer Science and Information Systems, Taif Univeristy, Taif, Saudi Arabia 2 FSKSM, Universiti Teknologi Malaysia, Johor Bahru, Malaysia s.zahrani@tu.edu.sa, naomie@utm.my Abstract. This report explains our plagiarism detection method using fuzzy semantic-based string similarity approach. The algorithm was developed through four main stages. First is pre-processing which includes tokenisation, stemming and stop words removing. Second is retrieving a list of candidate documents for each suspicious document using shingling and Jaccard coefficient. Suspicious documents are then compared sentence-wise with the associated candidate documents. This stage entails the computation of fuzzy degree of similarity that ranges between two edges: 0 for completely different sentences and 1 for exactly identical sentences. Two sentences are marked as similar (i.e. plagiarised) if they gain a fuzzy similarity score above a certain threshold. The last step is post-processing whereby consecutive sentences are joined to form single paragraphs/sections. Our performance measures on PAN’09 training corpus for external plagiarism detection task (recall=0.3097, precision=0.5424, granularity=7.8867) indicates that about 54% of our detections are correct while we detect only 30% of the plagiarism cases. The performance measures on PAN’10 test collection is less (recall= 0.1259, precision= 0.5761, granularity= 3.5828), due to the fact that our algorithm handles external plagiarism detection but neither intrinsic nor cross-lingual. Although our fuzzy semantic-based method can detect some means of obfuscation, it might not work at all levels. Our future work is to improve it for more detection efficiency and less time complexity. In particular, we need to advance the post-processing stage to gain more ideal granularity. 1 Introduction Plagiarism could be more fuzzy than clear, more complex than trivial copy and paste. Methods for plagiarism detection mostly track verbatim plagiarism; however, detecting excessive paraphrasing is a difficult task. Many current techniques rely on exactly matched substrings or some kinds of textual fingerprinting. But that may not be sufficient as cases of rephrasing and rewording the content treated as different (i.e. not plagiarised). Therefore, this work considers the problem of finding the suspected fragments that have the same semantics with the same/different syntax. In this regard, matching fragments of text becomes approximate or vague and can be implemented as a spectrum of values between 1 (i.e. exactly matched) and 0 (entirely different). The scale can be defined in a way similar to a human’s judgement as in Figure 1. Our work is similar to the work by Yerra and Ng (2005). In their paper, a copy detection approach for web documents was developed using fuzzy information retrieval (IR) model. The fundamental concept in fuzzy IR shows that words in a document have certain degree with a fuzzy set that contains words with related meaning and two documents are considered similar although their semantic content may be different if they gain high similarity degree with the fuzzy set. Thus, fuzzy IR has proved to work well for partially related semantic content in web retrieval. Subsequent to the previous work, Koberstein and Ng (2006) developed a reliable tool using fuzzy IR for determining the degree of similarity between two web documents and clustering the collection based on words. In addition, Alzahrani and Salim (2009) adapted the fuzzy IR model for use with Arabic scripts. By using Arabic plagiarism corpus of 4477 source statements and 303 query/suspicious statements, the similarity score of two documents is the averaged similarity among statements treated as plagiarised even if they are restructured or reworded. Experimental results showed that fuzzy IR can find to what extent two Arabic statements are similar or dissimilar. On the other hand, semantic similarity between short passages can be obtained by using the information extracted from a structured lexical database and corpus statistics (Li et al., 2006). The similarity of two sentences is derived from two similarities: semantic and order. The semantic vectors for two pairs of sentences are obtained by using unique terms in both sentences along with their synonyms from WordNet besides term weighting in the corpus. The order similarity defines that different words order may convey different meaning and should be count into total string similarity. Our work combines the fuzzy similarity model (Yerra and Ng, 2005) and semantic similarity model derived from a lexical database (Li et al., 2006). Instead of constructing a fuzzy thesaurus derived from word-to-word correlation factors as in Yerra and Ng’s (2005) model, we choose to work with synonyms extracted from WordNet lexical database as in Li et al. (2006). threshold Figure1. Fuzzy Similarity Indication with Vague Boundaries 2 Problem Statement In this report, we have considered the problem stated as follows. Problem: Given a suspicious document dataset Dq and a large source collection D, find all suspicious parts sq from dq:dqϵDq that are similar to parts sx from dx:dxϵDx based on fuzzy semantic-based similarity approach as will be described in section 3. Requirements: First, extract a set of features for each dqϵDq and dϵD. Second, find a list of most promising documents Dx where DxϲD based on shingling and Jaccard similarity coefficient known in IR. Third, perform sentence-wise in-depth analysis using fuzzy semantic-based approach. Last, perform post-processing operations to merge subsequent similar statements into passages or paragraphs. Limitations: Neither intrinsic (i.e. variations in writing styles) nor cross-language plagiarism detection is handled by this algorithm. That is, the languages of both suspicious and candidate documents are considered homogeneous. 3 External Plagiarism Detection 3.1 Operational Framework We implement an algorithm that tackles monolingual external plagiarism detection. In particular, it is designed to detect different degrees of obfuscation by using a fuzzy semantic-based approach. The operational framework is shown in Figure 2. The process starts with a set of source collection D and suspicious/query documents Dq. Then new representatives are generated d′ and d′q for each cleansed and tokenised dq and d documents respectively. d′ and d′q are then used for shingling and computing the similarity between the shingles. The list of most similar documents for each dqϵDq is called candidate set Dx whereby DxϲD. After generating Dx, more sentence-wise detailed analysis is performed to obtain similar sentences (sq, sx) where sq∈dq, sx∈dx. The similarity score is gained by implementing a fuzzy semantic-based similarity measure between words in both sentences as will be seen shortly. Finally, the system performs post-processing in order to merge similar sentences into passages (pq, px) such that pq∈dq, px∈dx. Subsequent sections detail each stage. Suspicious Source Documents Documents dq d Phase I Pre- d′,d′q Creating d′′,d′′q Computing List of Generating processin Shingles Jaccard Jaccards Candidates For each dq Ǝ Dx : dx ϵ D Phase II Sentence-wise (pq, px): Post- (sq, sx): Results WordNet Fuzzy Semantic- pq∈dq, processing sq∈dq, Report Lexical DB Based Detailed px∈dx sx∈dx Analysis Figure2. Operational Framework of Our External Plagiarism Detection Algorithm 3.2 Phase I: Retrieval of Similar Documents Near duplicate detection methods can be used to bring similar sources and discard dissimilar ones. We use shingling and Jaccard coefficient approach (Manning, Raghavan, & Schütze, 2008). The k-shingle (or word-k-gram) referred to a sequence of consecutive words of size k. The value for k is typically 3 or 4. Intuitively, two documents A and B are similar if they share enough k-shingles. By performing union and intersection operations between the k-shingles, we can find the Jaccard similarity coefficient between A and B as stated in equation (1). J(A,B)=|shingles of A ∩ shingles of B| / |shingles of A ∪ shingles of B| (1) Therefore for each suspicious document dq, documents of Jaccard coefficient above a threshold value are taken to form the set of candidate documents Dx. We set the threshold of Jaccard ≥ 0.1 because we found that this value derives about 1 to 30 candidate documents for each dq. It was found that when we compare documents of Jaccard similarity less than 0.1, either none or about 1-2 plagiarised statements are detected. Also by using this method, we find that some suspicious documents do not have any candidates. That means that they might contain intrinsic plagiarism or do not contain plagiarism at all. More interesting, using this approach assumes the number of candidates for each suspicious document dynamic and may be small. That saves the computation time in contrast to having a fixed number of candidates for each suspicious document. 3.3 Phase II: In-Depth Detailed Analysis of (dq , dx) pairs At this stage, a sentence-wise detailed analysis between each suspicious document dq and its candidate document dx∈Dx is performed. At first, dq and dx are segmented into sentences Sq and Sx respectively using end-of-sentence delimiters. To obtain the degree of similarity between two sentences (sq, sx), a term-to-sentence correlation factor for each term wq in sq and the sentence sx is computed as: µq,x= 1−∏ wk ∈Sx(1 – Fq,k) (2) where wk are words in sx and Fq,k is a fuzzy similarity between wq and wk that we defined as follows: 1 if wk and wq are identical Fq,k = 0 .5 if wk is in the synset of wq 0 otherwise (3) The synset of wq is extracted by querying the WordNet lexical database (Miller, 1995). For example, the sentences S1=“this car consumes a lot of oil” and S2=“this car consumes a lot of petrol” are almost identical except the word oil replaced with petrol. Since the word petrol is found in the synonym set (synset) of oil, both sentences convey the same meaning and the degree of similarity between (sq, sx) is expressed as a fuzzy number between 0 and 1 using the equation Sim(sq, sx) = ( µ1,x + µ2,x+ . . . + µq,x+. . . + µn,x ) / n (4) where n is the total number of words in sq. Thus, we can calculate the degree of similarity between the S1 and S2 as shown in Figure 3 (a) taking into account that stop words were removed and non-stop words were stemmed. Another example is the sentences S1=“the teacher gives each student a text that he authored” and S2=“a textbook authored by the instructor is given to his pupils” where the later was paraphrased from the first. Since the following word pairs (teach, instruct), (student, pupil), (text, textbook) are synonyms, and the rest of words are identical, these sentences gain high similarity score of 0.7 as shown in Figure 3 (b). This indicates that sentences are semantically alike. It is noticeable that stemming the words can handle some means of obfuscation. Both of the previous examples have sentences with equal number of words. An example of a sentence pair with different lengths and semantics is S1= “this car consumes a lot of oil” and S2=“the engine of this car is of poor quality and consumes a lot of petrol”. In this case, Sim(S1,S2)≠ Sim(S2,S1). Thus to judge two sentences as equal (i.e. plagiarised), the minimum similarity score should be above a threshold value (α >0.65) as in (5). According to this, the last pair shown in Figure 3 (c) is considered dissimilar because the minimum similarity is less than α. EQ(sq, sx) = 1 if MIN(Sim(sq, sx), Sim(sq, sx)) ≥ α 0 otherwise (5) Finally, the output of this algorithm is a list of sentence pairs (sq, sx): sq∈dq, sx∈dx, dx∈D marked as similar/plagiarised. Because of using sentences as comparison scheme, post-processing is required to merge subsequent sentences marked as plagiarised into passages. Also, we consider small distances under the predicate less than or equal to 100 characters to merge subsequent plagiarised sentences into passages pairs (pq, px) : pq∈dq, px∈dx, dx∈Dx. car consume oil teach give student text author car consume petrol textbook author instruct give pupil Sim(S1,S2) = (1+ 1+ 0.5)/3 = 0.83 Sim(S1,S2) = (0.5+ 1+ 0.5+ 0.5+ 1)/5=0.7 (a) the use of synonyms (b) the use of synonyms and different structure car consume oil (c) the use of different words and semantics ... engine car poor quality consume petrol Sim(S1,S2) = (1+1+0.5)/3= 0.83 Sim(S2,S1) = (0+1+0+0+1+ 0.5)/6= 0.42 Figure 3. Examples of different sentence pairs. 4 Experimental Setup 4.1 Instrumentation Our algorithm has been built using C#.NET 2008. By using different libraries such as Linq, we perform sets operation to compute Jaccard similarity. We used a server with 4-core processors, 2.8 GHz. To utilise all cores, we have migrated our code to work on Visual Studio.NET 2010 which has introduced the concept of parallel computing1. 4.2 Code Configuration Below is a list of some parameters and settings that we have configured in our code. For pre-processing, stop words removal and porter stemmer (Porter, 1980) algorithms were used. For generating k-shingles, the k was set to 3 (i.e. word-3-grams). For computing Jaccard similarity and finding candidates, a threshold value of Jaccard=0.1 was set to filter out non-candidate documents. For semantic-based analysis, WorldNet v3.0 using MySQL2 was used to query the Synset table and extract synonyms of the words. For fuzzy similarity between two sentences, the equations (2)-(5) was employed, and the threshold in (5) was set to α= 0.65 which was found to be the most suitable based on our experimental trials. 5 Evaluation and Discussion on PAN’09 and PAN’10 In the PAN’09 extrinsic part, the recall was 0.3097 and the precision was 0.5424. In PAN’10, we submitted partial results of about 56.25% of the suspicious documents at first. Our full results on PAN’10 are presented in this paper (recall= 0.1259, precision= 0.5761, granularity= 3.5828). Our results are of both PAN’09 and PAN’10 are comparable as shown in Table 1. The results in PAN’10 showed that we detected about 12% of the plagiarism cases and about 57% of the detections were correct. The low recall might be for the reasons: (i) the algorithm was designed for extrinsic plagiarism task and did not tackle intrinsic nor cross-lingual plagiarism, (ii) we used stems instead of lemmas in pre-processing; however, WordNet needs lemmas which needs to be corrected in the future model, and (iii) the candidates compared were not enough to find more plagiarism cases. The precision of the algorithm shows that 57% of the detections were correct. Two words may be synonyms but with different senses and hence different meaning that make sentences not plagiarised which may lead to more false positives by our algorithm. Moreover, statements of short lengths might get a fuzzy similarity score of more than 0.65 easily; another reason for false positives. The ability of detecting each plagiarism case at once was bigger than 1 1 http://channel9.msdn.com/learn/courses/VS2010/Parallel/ 2 http://wordnet.princeton.edu/wordnet/related-projects/#SQL because the algorithm enabled the merging process of sentences if and only if they are subsequent or with few characters in between. Table 1. Results of our fuzzy semantic-based approach in PAN’09 and PAN’10. Dataset Recall Precision Granularity Plag. Score PAN’09 PAN’09 Extrinsic Part 0.3097 0.5424 7.8867 0.1251 PAN’10 PAN’10 Extrinsic Part 0.1548 0.5758 3.5919 0.1109 PAN’10 All (partial) 0.0464 0.3460 17.3057 0.0195 PAN’10 All (complete) 0.1259 0.5761 3.5828 0.0941 Future Work Our future work is to improve it for more detection efficiency and less time complexity. We will consider the following work: (i) using word-k-grams instead of sentences, (ii) using a Lemmatiser instead of the stemmer to get better results from WordNet, and (iii) modifying the post-processing stage to gain more ideal granularity. Acknowledgements We thank the organizers of PAN’10 evaluation lab, Martin Potthast in particular, for his generous support and instantaneous answer through the PAN’s mailing list. References Alzahrani, S. M., & Salim, N. (2009). On the Use of Fuzzy Information Retrieval for Gauging Similarity of Arabic Documents. Paper presented at the Second International Conference on the Applications of Digital Information and Web Technologies (ICADIWT 2009), London Metropolitan University, UK. Koberstein, J., & Ng, Y.-K. (2006). Using Word Clusters to Detect Similar Web Documents. In Knowledge Science, Engineering and Management (pp. 215- 228). Li, Y., McLean, D., Bandar, Z. A., O'Shea, J. D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138-1150. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Web search basics: Near- duplicates and shingling. In Introduction to Information Retrieval (pp. 437- 442): Cambridge University Press. Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM, 38(11), 39-41. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130−137. Yerra, R., & Ng, Y.-K. (2005). A Sentence-Based Copy Detection Approach for Web Documents. In Fuzzy Systems and Knowledge Discovery (pp. 557-570).