Using Microsoft SQL Server platform for plagiarism detection Vladislav Shcherbinin Sergey Butakov American University of Nigeria SolBridge International School of Business, Lamido Zubairu way, Yola township by-pass, 151-13 Samsung 1-Dong, Dong-gu, Daejeon, PMB 2250, Yola, Nigeria 300-814, South Korea vladislav.scherbinin@gmail.com butakov@solbridge.ac.kr Abstract: The paper presents an approach for plagiarism detection using Microsoft SQL Server platform in a large corpus of documents. The approach was used for participation in the first international plagiarism detection competition that was held as a part of PAN’09 workshop. The main advantages of the proposed approach are its high precision, good performance and readiness for deployment into a production environment with relatively low cost of the required third party software. The approach uses fingerprinting-based algorithm to compare documents and Levenstein’s metric to markup plagiarized fragments in the texts. Keywords: external plagiarism detection, Winnowing, document fingerprinting 1 Introduction 2 Detection process Digital plagiarism remains a burning issue both The document processing for the competition in academia and industry over the last two was performed by three nodes. Node 1 served decades. Of course methods and tools of as DBMS platform and Node 2 and Node 3 plagiarism uncovering have evolved a lot from were used on the detection phase. The the pioneering works on plagiarism uncovering following subsections explain detection steps in in source codes in 1980s to web-enabled anti- details. plagiarism services of today. Plagiarism detection methods at large can be 2.1 Loading and preprocessing of the split into two large groups: external document documents analysis methods and intrinsic plagiarism detection methods, or stylometry (Maurer, To perform the comparison on a large corpus of Kappe, & Zaka 2006). The method and documents we decided to use the Winnowing, software proposed in this paper aimed on the one of the well-known fingerprinting-based external plagiarism detection, e.g. revealing the algorithms (Schleimer et al., 2003). According text copied from other documents. The software to this algorithm each document was substituted was tested on the corpus of document provided with a set of its hashes for the detection for competition. The rest of the paper is purposes. organized as follows: the detailed description of The database designed to store documents the software platform and the detection process and fingerprints consists of three tables: Folder, can be found in the second and third sections of Document, and Fingerprint. the paper. Conclusion section summarizes the After loading documents and compiling their results and proposes directions for the future fingerprints the Fingerprint table was indexed research. with two indexes: one nonclustered index on hash value and document ID (index 1) and another clustered index on document ID, hash value and sequential number of a hash in the document (index 2). After the loading phase Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 36-37, 2009. Using Microsoft SQL Server Platform for Plagiarism Detection 37 the Fingerprint table was populated with source and suspicious documents, 137,981,386 records. The most time consuming while characters are equal. This will operation here was loading documents and identify exact excerpt. compiling fingerprints. c. Skip forward n characters, and compare excerpts using Levenstein’s distance to identify 2.2 Locating sources near similar and obfuscated The main objective of this step was to reduce excerpts. the number of documents for comparison phase. 3. Save identified intervals into the DB. This step selects all pairs of documents that Both nodes used several separate threads for share at least one fingerprint and stores these this processing and each thread was processing pairs in a table for more detailed analysis. After a separate document, retrieved on the step 1 this step the table that links the pairs of possible shown above. The detection time could be matches in the documents was populated with improved by increasing the computational only 44,532 records instead of 52,000,000 – power of the processing nodes (nodes 2 and 3) possible number of pairs the search would have or by further increasing the number of nodes. had to process if it compares all suspicious documents versus all source documents: 7214 * 2.4 Compiling results 7215 = 52,049,010. This step literally On the last step Microsoft SQL Server substituted the “one-vs-all” comparison with Integration Services was used to export “one-vs-suspicions”. As this step consists of information about detected plagiarism to XML only one query the better system performance files with the required format. could be achieved only by improving MS SQL Server hardware. This step uses index 1. 3 Conclusion 2.3 Detecting plagiarized passages As the competition results indicate the proposed approach provides competitive results in terms At this point all the required information is of preciseness. Moreover it comes in the ready- ready for the main step: detection of the to-deploy form that can be easily implemented common fragments in documents. The result on relatively inexpensive third party software from this step was used to identify exact (MS SQL Server). This will allow easy system plagiarized excerpts and to establish anchors for integration with virtually any university-wide the further analysis. The main point here is the course management system. The required proper indexing of the Fingerprint table: on this improvements to reduce the granularity of step the clustered index created earlier (index 2) results are planned for implementation in the was used which provided the best possible next version of the software. At this stage of the execution plan. development the solution is publicly available After all common fingerprints have been for downloading as a desktop version at identified and thus provided established www.siberiasoft.info. anchors, the next task was to find common intervals for marking up the plagiarized References passages. For better performance this process was distributed among two workstations (nodes Maurer, H., Kappe F., Zaka B. (2006) 2 and 3), each running a console application Plagiarism – A Survey. Journal of Universal performing the following steps: Computer Sciences, vol. 12, no. 8, pp. 1050 1. Retrieve an unprocessed document from the – 1084. Document table and corresponding records Schleimer S., Wilkerson D., and Aiken A. from the table that links it with possible (2003). Winnowing: Local Algorithms for sources. Document Fingerprinting. Proceedings of 2. For each record run the following steps: the ACM SIGMOD International a. Execute the stored procedure to Conference on Management of Data, pages retrieve starting positions of the 76-85, June 2003. common excerpts. b. For each result skip forward character by character in both