Approaches for Candidate Document Retrieval and Detailed Comparison of Plagiarism Detection Notebook for PAN at CLEF 2012 Kong Leilei1, Qi Haoliang1, Wang Shuai1, Du Cuixia2,Wang Suhong2 and Han Yong1 1Heilongjiang Institute of Technology, China 2Harbin Engineering University, China sevenkll@hotmail.com Abstract. In this paper we report on our plagiarism detection system which is used to process the PAN plagiarism corpus for the tasks of Candidate Document Retrieval and Detailed Comparison. To retrieve the plagiarism candidate document by using ChatNoir API, a method based on tf*idf to extract the keywords of suspicious documents as queries is proposed. An Lucene ranking method is used for plagiarism candidate document reduction. And a detailed comparison algorithm to get the web pages that are actually sources for plagiarized passages is applied. To extract all plagiarism passages from the suspicious document and their corresponding source passages from the source document, a plagiarism detection method combined with semantic similarity and structure similarity is proposed. Semantic similarity is calculated by Vector Space Model while structure similarity is calculated by our own method. We use information retrieval to get candidate pairs of sentences from suspicious document and potential source document. A method which is called Bilateral Alternating Sorting is applied to merge pairs of sentences. Those plagiarism candidate result pairs are screened in post-processing. Keywords: plagiarism detection, tf*idf, semantic similarity, structure similarity, Vector Space Model 1 Introduction The rapid development of network technology, including large numbers of search engines, document repositories, translation software systems, not only provides people with the various knowledge acquisition channel, but also opens the door for text plagiarism. Plagiarism generally refers to the illegitimate use of someone else’s information, text, ideas, etc. without proper reference to the original source of these borrowings [1]. Plagiarism and its automatic retrieval have attracted considerable attention from research and industry: various papers have been published on the topic, and many commercial software systems are being developed [2]. It becomes more important to determine the originality of the text. The research on copying text recognition has become an urgent need to address the problem. In recent years, many well-known organizations carried out evaluation, international competitions and conferences focused on the plagiarism detection. And PAN[3] is one of them. PAN@CLEF [3]offers a controlled evaluation environment to evaluation the algorithm or system for plagiarism detection. This year, we focused on the plagiarism detection evaluation of PAN@CLEF2012, which included two sub-tasks: Candidate Document Retrieval and Detailed Comparison. We spent six months to research the problems and submitted our results of the two sub-tasks. Our team obtained the first place for the Detailed Comparison sub-task. In this paper, we introduce a method for Candidate Document Retrieval and Detailed Comparison sub-tasks. In the Candidate Document Retrieval sub-task, a method based on tf*idf to extract the keywords of suspicious documents as queries was proposed. A scoring method was used to plagiarism candidate document ranking. In the Detailed Comparison sub-task, a plagiarism detection method based on Vector Space Model(VSM) and Overlapping Measure Model at the sentence level was presented. Bilateral Alternating Sorting was designed to merge the pairs of plagiarism sentences, and those plagiarism candidate result pairs were screened in post-processing. The rest of the paper is organized as follows. Section 2 describes the overview of related work. Then, Section 3 and Section 4 describe the method for Candidate Document Retrieval sub-task and Detailed Comparison sub-task while Section 5 includes the evaluation results. Finally, Section 6 discusses the main points of this study and proposes future work directions. 2 Overview of Related Work Most often, copy is an open copying, and plagiarism mainly refers to plagiarize other's language, charts, formulas or research ideas and then edited, pieced together, modified and added to their own papers, writings, project applications, data files, computer code material, and so on, and set him/herself up as the author. The research of formal language text plagiarism starts earlier for its strict formal syntax, clear semantics of expression, easy analysis and processing. Since Ottenstein [4] put forward attribute counting to detect the program copying, there emerged a lot of the formalization text copying recognition system. Natural language text has no formal syntax constraints and its semantics has the ambiguity, so it is more difficult to carry out plagiarism identification. The research on natural language text copy detection began in the 1990s, and has made great progress since Richard used keyword matching algorithm to develop the WordCheck[5]. The core problem of text plagiarism detection is to determine whether the plagiarism exists and how to measure the similarity degree. For text similarity problem, many researchers have put forth some effective detection methods. It mainly includes (1) Similarity calculation method based on statistics, such as [6] and [7]. It needs the support of large-scale corpus, the long training process and has some limitations. (2) Similarity calculation method based on semantic comprehension: it neither needs the support of large-scale corpus nor the long training process. It has a high precision but mostly is limited in the scope of words or sentences. Specific methods includes similarity calculation with Wordnet [8], similarity calculation with TongYiCi CiLin [9], similarity calculation with semantic sequence kernel [10], etc. Finger Printing and Word Frequency are the mainly method to recognize the plagiarism. Finger Printing is fast, simple and effective, suitable for large-scale computing. Word Frequency method first statistics the number of each word in the document to constitute the feature vector of the document, then use vector dot product, cosine law, the correlative frequency model, etc. to measure the similarity of two documents. Word Frequency statistical method has a high precision but its speed in not fast than Finger Printing technology. Although a lot of plagiarism detection systems are better to complete the simple text copy detection, detection for English has also made some achievements, there are also some questions and works which have not received much attention yet. First, plagiarism detection filed has not effective techniques to filter the plagiarism source in massive data corpus. The corpus of reference documentation is limited to the data of several G in the existing plagiarism detection system. However, over time, reference documentation set of documents to be detected grows increasingly large and source document set is not limited to be a few G magnitude data. Massive data processing and the growing amount of data increase the difficulty of plagiarism detection. Existing methods are powerless in the face of dealing with in the data of several T. Timely and effective in a limited time to find suspicious from the source document to be detected is the key to the establishment of effective plagiarism detection system. Second, the performance of existing systems which are based on matching and statistics techniques is not satisfactory for the practical application. They still have the wrong check, leakage check, especially non-straightforward copy which has a low distinction degree, and has difficult to achieve accurate identification of plagiarism. And they can not deal with the complex text plagiarism, especially for two articles which have the same meaning and converting the writing method. 3 Candidate Document Retrieval Given a suspicious document and a web search engine, the task is to retrieve a set of candidate source documents that may have served as an original to plagiarize from[3]. To retrieve the plagiarism candidate document by using ChatNoir API, we apply a method based on tf*idf to extract the keywords of suspicious documents as queries. When using ChatNoir API gets copy source of suspicious document, we use an improved Lucene[23] scoring method to reduce the plagiarism candidate document. Finally, a detailed comparison algorithm to get the web pages that are actually sources for plagiarized passages is applied. The detailed method is described as follows. 3.1 Getting Query First, each suspicious document s is preprocessed, including stemming, removing stop words and replacing figures. The queries for every s are coming from the top queryGroup*queryLength terms which are the top n sorted by tf*idf values from high to low in each paragraph of s, where queryGroup is the group number of queries and queryLength is the term number of each query group. In testing phase, the queries we used is 2*5. 3.2 Retrieving ChatNoir API is applied to retrieve the plagiarism candidate document for each query group. Then, put retrieved top n results in the result set of plagiarism candidate document and use ChatNoir API to download them, where n=10. 3.3 Getting Sources for Plagiarized Passages The plagiarism candidate document is preprocessed into non-overlapping plaintext passages. We index them and use query of 2*5 for each passage of each suspicious document to retrieve by using an Lucene scoring method for filtering the plagiarism candidate in [11]. And we take top n of retrieving results of each query as results. Last, we use an algorithm to get the web pages that are actually sources for plagiarized passages which will be described in following PAN Detailed Comparison Task. 4 Detailed Comparisons Given a pair of suspicious document and potential source document, the task is to extract all plagiarized passages from the suspicious document and their corresponding source passages from the source document [3]. Firstly, the suspicious documents and plagiarism candidate source documents are pre-processed. We apply a plagiarism detection method combined with semantic similarity and structure similarity to extract all plagiarism passages from the suspicious document and their corresponding source passages from the source document. Semantic similarity is calculated by Vector Space Model while structure similarity is calculated by an Overlapping Measure Model which will be described as follows. We use a method of information retrieval to get candidate pairs of sentences from suspicious document and potential source document and a merge algorithm which is called Bilateral Alternating Sorting is applied to merge pairs of sentences. Finally, those plagiarism candidate result pairs are screened in post-processing. This method is described in detail in the following parts. 4.1 Pre-processing In the pre-processing part, the suspicious documents and plagiarism candidate source documents will be processed in some ways, including removal of special characters and whitespace, case transformation, removal of stopwords and stemming. 4.2 Detailed Comparison Since the passage is the smallest unit that an author expresses an independent and complete view and the sentence is the basic structure of one passage, we choose sentences as the chunks. The following is the steps of processing the candidate plagiarism passages detection. Step1: Suspicious documents and source documents are divided according to the sentence. After that we index all the sentences in the source documents. Each sentence in the suspicious documents will be retrieved in the index as a query. This kind of process is called sentence similarity retrieval. We regard suspicious passage S and reference passage R in source document as pairs of plagiarism candidate sentence which their cosine distance is greater than t1 to get semantic similarity, as shown in formula 1: n w w Sk Rk Sim(S,R)  cos  k 1  t1 n n ( wSk 2 )( wRk 2 ) k 1 k 1(1) where Sim(S,R) is the similarity degree of S and R, Θ is document vector angel, WSk and WRk are the weight of S and R respectively,t1 is threshold. We used t1=0.42. Step2: We screen plagiarism candidate sentence to get structure similarity by using formula 2. 2   Min(N I S (t),N I R (t)) tI S  I R T  t2 IS  IR (2) where NIs(t) and NIR(t) are the number of the terms which are overlapping in the suspicious sentence and reference sentence, Min(NIs(t),NIR(t)) is the smallest one of NIs(t) and NIR(t) , t2 is the threshold. We used t2=0.32. Those sentence pairs which are not only in line with formula 1 but also formula 2 will be regarded as the plagiarism candidate sentence pairs. Step 3: Merge the scattered plagiarism candidate sentence pairs which are got by above process method. This process is the recovery of a complete plagiarism case. We design a Bilateral Alternating Sorting algorithm to merge the suspicious sentence and reference sentence which guarantee the suspicious sentence and reference sentence are adjacent. Because of patent pending, this method is inconvenient stated here. The passage pairs after merging are called the candidate result pairs. 4.3 Post-processing We use formula 2 to screen candidate result pairs which perhaps the non suspicious passage pairs in post-processing phase .We used t3=0.30 this time. 5 Results The results of Candidate Document Retrieval sub-task and the Detailed Comparison sub-task are summarized in Table 1 and Table 2. Table 1 Results of Candidate Document sub-task Reported Sources Downloaded Sources Retrieved Sources Team Precision Recall Precision Recall Precision Recall Gillam et al. 0.6266 0.2493 0.0182 0.5567 0.0182 0.5567 University of Surrey, UK Jayapal 0.6582 0.2775 0.0709 0.4342 0.0698 0.4342 University of Sheffield, UK Kong Leilei Heilongjiang Institute of 0.5720 0.2351 0.0178 0.3742 0.0141 0.3788 Technology, China Palkovskii et al. Zhytomyr State University, 0.4349 0.1203 0.0025 0.2133 0.0024 0.2133 Ukraine Suchomel et al. Masaryk University, Czech 0.5177 0.2087 0.0813 0.3513 0.0094 0.4519 Republic Table 2 Results of Detailed Comparison sub-task Detailed Comparison Task Runtime*[Se Rank Team PlagDet Precision Recall Granularity conds/Pair] Kong Leilei 1 Heilongjiang Institute of 0.7386159 0.8249708 0.6782238 1.0109503 5.9187108 Technology, China Kasprzak et al. 2 Masaryk University, Czech 0.6826726 0.8931670 0.5524708 1.0000000 5.3679195 Republic Grozea et al. 3 Fraunhofer Institute FIRST, 0.6787810 0.7747815 0.6351092 1.0396952 4.8279920 Germany Oberreuter 4 0.6735574 0.8673093 0.5553130 1.0073026 2.5899274 Universidad de Chile, Chile Rodrí guez Torrejón et al. 5 Universidad de Huelva, 0.6252024 0.8344227 0.5004208 1.0009596 0.1900923 Spain 6 Conclusions Our method is evaluated by PAN2012@CLEF and compared with the other plagiarism detection systems. The evaluation results of our method in the competition were excellent. With the PAN-09, PAN-10, PAN-11 and PAN-12 corpora, our method showed a great advantage and produced a high performance. Results show that our system’s overall performance, especially the recall is higher than most of the other methods for most kinds of plagiarism cases. The plagiarism detection method we proposed is flexible and scalable, the time limit is reasonable. In our case we only needed one mainstream server to run the complex plagiarism detection system. Furthermore, we will aim to determine the threshold boundary more reasonably. The synonym replacement and the translation would also be introduced into the plagiarism system. We will work on a better Candidate Document Retrieval method. Acknowledgements. This work is supported by NSF of China(60970057,60903083). Remark: This work was done in Heilongjiang Institute of Technology. Reference 1. Maurer, H., Kappe, F., Zaka, B.: Plagiarism-A Survey. Journal of Universal Computer Science 12(8), 1050–1084 (2006) 2. Martin Potthast, Benno Stein, Andreas Eiselt , Bauhaus-universität Weimar, Alberto Barrón- cedeño , Paolo Rosso. Overview of the 1st International Competition on Plagiarism Detection. SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), CEUR-WS.org 3. http://pan.webis.de/ 4. Ottenstein K J.An Algorithmic Approach to the Detection and Prevention of Plagiarism .ACM SIGCSE Bulletin.1976,8(4):30-41 5. Clough P. Plagiarism in Natural and Programming Languages: an Overview of Current Tools and Technologies.Research Memoranda:CS-00-05, Department of Computer Science, University of Shefield, 2000 6. PAN Qian Hong, WANG Ju, SHI Zhong Zhi .Text Similarity Computing Based on Attribute Theory.Chinese Journal of Computers, 1999, 22(6) : 651-655 7. Song Qin-Bao, Yang Xiang-Rong. A Detection Algorithm for the Illegal Coping and Distributing of Digital Goods.Chinese Journal of Computers, 2002, 25(11):1206-1211 8. Agire E, Rigau G. A Proposal for Word Sense Disambiguation Using Conceptual Distance. International Conference on Recent Advances in Natural Langu age Processing, Velingrad, 1995:258-264 9. Che Wanxiang, Liu Ting, Qin Bing, Li Sheng. Chinese Sentences Similarity Computation Oriented the Searching in Bilingual Sentence Pairs. The Seventh Joint Academic Conference for Conputational Lingguistics. Beijing: Tsinghua University Press.2003:520-526 10.Bao Junpeng, Shen Junyi, Liu Xiaodong, Liu Haiyan, Zhang Xiaodi. Document Copy Detection Based on Kernel Method. Proceedings of 2003 IEEE International Conference on Natural La nguage Processing an d Knowledge Engineering, Beijing, Oct.2003:250-256 11.http://lucene.apache.org/