Using Microsoft SQL Server platform for plagiarism detection
          Vladislav Shcherbinin                          Sergey Butakov
     American University of Nigeria         SolBridge International School of Business,
Lamido Zubairu way, Yola township by-pass, 151-13 Samsung 1-Dong, Dong-gu, Daejeon,
        PMB 2250, Yola, Nigeria                       300-814, South Korea
     vladislav.scherbinin@gmail.com                 butakov@solbridge.ac.kr


        Abstract: The paper presents an approach for plagiarism detection using Microsoft SQL Server
        platform in a large corpus of documents. The approach was used for participation in the first
        international plagiarism detection competition that was held as a part of PAN’09 workshop. The
        main advantages of the proposed approach are its high precision, good performance and
        readiness for deployment into a production environment with relatively low cost of the required
        third party software. The approach uses fingerprinting-based algorithm to compare documents
        and Levenstein’s metric to markup plagiarized fragments in the texts.
        Keywords: external plagiarism detection, Winnowing, document fingerprinting


1     Introduction                                                          2    Detection process
Digital plagiarism remains a burning issue both                             The document processing for the competition
in academia and industry over the last two                                  was performed by three nodes. Node 1 served
decades. Of course methods and tools of                                     as DBMS platform and Node 2 and Node 3
plagiarism uncovering have evolved a lot from                               were used on the detection phase. The
the pioneering works on plagiarism uncovering                               following subsections explain detection steps in
in source codes in 1980s to web-enabled anti-                               details.
plagiarism services of today.
   Plagiarism detection methods at large can be                             2.1 Loading and preprocessing of the
split into two large groups: external document                              documents
analysis methods and intrinsic plagiarism
detection methods, or stylometry (Maurer,                                   To perform the comparison on a large corpus of
Kappe, & Zaka 2006). The method and                                         documents we decided to use the Winnowing,
software proposed in this paper aimed on the                                one of the well-known fingerprinting-based
external plagiarism detection, e.g. revealing the                           algorithms (Schleimer et al., 2003). According
text copied from other documents. The software                              to this algorithm each document was substituted
was tested on the corpus of document provided                               with a set of its hashes for the detection
for competition. The rest of the paper is                                   purposes.
organized as follows: the detailed description of                               The database designed to store documents
the software platform and the detection process                             and fingerprints consists of three tables: Folder,
can be found in the second and third sections of                            Document, and Fingerprint.
the paper. Conclusion section summarizes the                                    After loading documents and compiling their
results and proposes directions for the future                              fingerprints the Fingerprint table was indexed
research.                                                                   with two indexes: one nonclustered index on
                                                                            hash value and document ID (index 1) and
                                                                            another clustered index on document ID, hash
                                                                            value and sequential number of a hash in the
                                                                            document (index 2). After the loading phase

Stein, Rosso, Stamatatos, Koppel, Agirre (Eds.): PAN'09, pp. 36-37, 2009.
                                              Using Microsoft SQL Server Platform for Plagiarism Detection   37

the Fingerprint table was populated with                                source and suspicious documents,
137,981,386 records. The most time consuming                            while characters are equal. This will
operation here was loading documents and                                identify exact excerpt.
compiling fingerprints.                                            c. Skip forward n characters, and
                                                                        compare         excerpts       using
                                                                        Levenstein’s distance to identify
2.2   Locating sources                                                  near similar and obfuscated
The main objective of this step was to reduce                           excerpts.
the number of documents for comparison phase.              3. Save identified intervals into the DB.
This step selects all pairs of documents that                  Both nodes used several separate threads for
share at least one fingerprint and stores these            this processing and each thread was processing
pairs in a table for more detailed analysis. After         a separate document, retrieved on the step 1
this step the table that links the pairs of possible       shown above. The detection time could be
matches in the documents was populated with                improved by increasing the computational
only 44,532 records instead of 52,000,000 –                power of the processing nodes (nodes 2 and 3)
possible number of pairs the search would have             or by further increasing the number of nodes.
had to process if it compares all suspicious
documents versus all source documents: 7214 *              2.4    Compiling results
7215 = 52,049,010. This step literally
                                                           On the last step Microsoft SQL Server
substituted the “one-vs-all” comparison with
                                                           Integration Services was used to export
“one-vs-suspicions”. As this step consists of
                                                           information about detected plagiarism to XML
only one query the better system performance
                                                           files with the required format.
could be achieved only by improving MS SQL
Server hardware. This step uses index 1.
                                                           3     Conclusion
2.3   Detecting plagiarized passages                       As the competition results indicate the proposed
                                                           approach provides competitive results in terms
At this point all the required information is
                                                           of preciseness. Moreover it comes in the ready-
ready for the main step: detection of the
                                                           to-deploy form that can be easily implemented
common fragments in documents. The result
                                                           on relatively inexpensive third party software
from this step was used to identify exact
                                                           (MS SQL Server). This will allow easy system
plagiarized excerpts and to establish anchors for
                                                           integration with virtually any university-wide
the further analysis. The main point here is the
                                                           course management system. The required
proper indexing of the Fingerprint table: on this
                                                           improvements to reduce the granularity of
step the clustered index created earlier (index 2)
                                                           results are planned for implementation in the
was used which provided the best possible
                                                           next version of the software. At this stage of the
execution plan.
                                                           development the solution is publicly available
    After all common fingerprints have been
                                                           for downloading as a desktop version at
identified and thus provided established
                                                           www.siberiasoft.info.
anchors, the next task was to find common
intervals for marking up the plagiarized
                                                           References
passages. For better performance this process
was distributed among two workstations (nodes              Maurer, H., Kappe F., Zaka B. (2006)
2 and 3), each running a console application                 Plagiarism – A Survey. Journal of Universal
performing the following steps:                              Computer Sciences, vol. 12, no. 8, pp. 1050
1. Retrieve an unprocessed document from the                 – 1084.
    Document table and corresponding records
                                                           Schleimer S., Wilkerson D., and Aiken A.
    from the table that links it with possible
                                                              (2003). Winnowing: Local Algorithms for
    sources.
                                                              Document Fingerprinting. Proceedings of
2. For each record run the following steps:
                                                              the     ACM      SIGMOD     International
         a. Execute the stored procedure to
                                                              Conference on Management of Data, pages
             retrieve starting positions of the
                                                              76-85, June 2003.
             common excerpts.
         b. For each result skip forward
             character by character in both