=Paper=
{{Paper
|id=Vol-502/paper-7
|storemode=property
|title=Using Microsoft SQL Server Platform for Plagiarism Detection
|pdfUrl=https://ceur-ws.org/Vol-502/paper7.pdf
|volume=Vol-502
}}
==Using Microsoft SQL Server Platform for Plagiarism Detection==
Using Microsoft SQL Server platform for plagiarism detection
Vladislav Shcherbinin Sergey Butakov
American University of Nigeria SolBridge International School of Business,
Lamido Zubairu way, Yola township by-pass, 151-13 Samsung 1-Dong, Dong-gu, Daejeon,
PMB 2250, Yola, Nigeria 300-814, South Korea
vladislav.scherbinin@gmail.com butakov@solbridge.ac.kr
Abstract: The paper presents an approach for plagiarism detection using Microsoft SQL Server
platform in a large corpus of documents. The approach was used for participation in the first
international plagiarism detection competition that was held as a part of PAN’09 workshop. The
main advantages of the proposed approach are its high precision, good performance and
readiness for deployment into a production environment with relatively low cost of the required
third party software. The approach uses fingerprinting-based algorithm to compare documents
and Levenstein’s metric to markup plagiarized fragments in the texts.
Keywords: external plagiarism detection, Winnowing, document fingerprinting
1 Introduction 2 Detection process
Digital plagiarism remains a burning issue both The document processing for the competition
in academia and industry over the last two was performed by three nodes. Node 1 served
decades. Of course methods and tools of as DBMS platform and Node 2 and Node 3
plagiarism uncovering have evolved a lot from were used on the detection phase. The
the pioneering works on plagiarism uncovering following subsections explain detection steps in
in source codes in 1980s to web-enabled anti- details.
plagiarism services of today.
Plagiarism detection methods at large can be 2.1 Loading and preprocessing of the
split into two large groups: external document documents
analysis methods and intrinsic plagiarism
detection methods, or stylometry (Maurer, To perform the comparison on a large corpus of
Kappe, & Zaka 2006). The method and documents we decided to use the Winnowing,
software proposed in this paper aimed on the one of the well-known fingerprinting-based
external plagiarism detection, e.g. revealing the algorithms (Schleimer et al., 2003). According
text copied from other documents. The software to this algorithm each document was substituted
was tested on the corpus of document provided with a set of its hashes for the detection
for competition. The rest of the paper is purposes.
organized as follows: the detailed description of The database designed to store documents
the software platform and the detection process and fingerprints consists of three tables: Folder,
can be found in the second and third sections of Document, and Fingerprint.
the paper. Conclusion section summarizes the After loading documents and compiling their
results and proposes directions for the future fingerprints the Fingerprint table was indexed
research. with two indexes: one nonclustered index on
hash value and document ID (index 1) and
another clustered index on document ID, hash
value and sequential number of a hash in the
document (index 2). After the loading phase
Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 36-37, 2009.
Using Microsoft SQL Server Platform for Plagiarism Detection 37
the Fingerprint table was populated with source and suspicious documents,
137,981,386 records. The most time consuming while characters are equal. This will
operation here was loading documents and identify exact excerpt.
compiling fingerprints. c. Skip forward n characters, and
compare excerpts using
Levenstein’s distance to identify
2.2 Locating sources near similar and obfuscated
The main objective of this step was to reduce excerpts.
the number of documents for comparison phase. 3. Save identified intervals into the DB.
This step selects all pairs of documents that Both nodes used several separate threads for
share at least one fingerprint and stores these this processing and each thread was processing
pairs in a table for more detailed analysis. After a separate document, retrieved on the step 1
this step the table that links the pairs of possible shown above. The detection time could be
matches in the documents was populated with improved by increasing the computational
only 44,532 records instead of 52,000,000 – power of the processing nodes (nodes 2 and 3)
possible number of pairs the search would have or by further increasing the number of nodes.
had to process if it compares all suspicious
documents versus all source documents: 7214 * 2.4 Compiling results
7215 = 52,049,010. This step literally
On the last step Microsoft SQL Server
substituted the “one-vs-all” comparison with
Integration Services was used to export
“one-vs-suspicions”. As this step consists of
information about detected plagiarism to XML
only one query the better system performance
files with the required format.
could be achieved only by improving MS SQL
Server hardware. This step uses index 1.
3 Conclusion
2.3 Detecting plagiarized passages As the competition results indicate the proposed
approach provides competitive results in terms
At this point all the required information is
of preciseness. Moreover it comes in the ready-
ready for the main step: detection of the
to-deploy form that can be easily implemented
common fragments in documents. The result
on relatively inexpensive third party software
from this step was used to identify exact
(MS SQL Server). This will allow easy system
plagiarized excerpts and to establish anchors for
integration with virtually any university-wide
the further analysis. The main point here is the
course management system. The required
proper indexing of the Fingerprint table: on this
improvements to reduce the granularity of
step the clustered index created earlier (index 2)
results are planned for implementation in the
was used which provided the best possible
next version of the software. At this stage of the
execution plan.
development the solution is publicly available
After all common fingerprints have been
for downloading as a desktop version at
identified and thus provided established
www.siberiasoft.info.
anchors, the next task was to find common
intervals for marking up the plagiarized
References
passages. For better performance this process
was distributed among two workstations (nodes Maurer, H., Kappe F., Zaka B. (2006)
2 and 3), each running a console application Plagiarism – A Survey. Journal of Universal
performing the following steps: Computer Sciences, vol. 12, no. 8, pp. 1050
1. Retrieve an unprocessed document from the – 1084.
Document table and corresponding records
Schleimer S., Wilkerson D., and Aiken A.
from the table that links it with possible
(2003). Winnowing: Local Algorithms for
sources.
Document Fingerprinting. Proceedings of
2. For each record run the following steps:
the ACM SIGMOD International
a. Execute the stored procedure to
Conference on Management of Data, pages
retrieve starting positions of the
76-85, June 2003.
common excerpts.
b. For each result skip forward
character by character in both