=Paper= {{Paper |id=Vol-1737/T4-10 |storemode=property |title=Plagiarism Detection Based on a Novel Trie-based Approach |pdfUrl=https://ceur-ws.org/Vol-1737/T4-10.pdf |volume=Vol-1737 |authors=Alireza Talebpour,Mohammad Shirzadi Laskoukelayeh,Zahra Aminolroaya |dblpUrl=https://dblp.org/rec/conf/fire/TalebpourLA16 }} ==Plagiarism Detection Based on a Novel Trie-based Approach== https://ceur-ws.org/Vol-1737/T4-10.pdf
          Plagiarism Detection Based on a Novel Trie-based
                             Approach

             Alireza Talebpour                   Mohammad Shirzadi                       Zahra Aminolroaya
          Department of Computer                   Laskoukelayeh                        CyberSpace Research
                  Engineer                       CyberSpace Research                            Institute
        Faculty of Computer Science                      Institute                     Shahid Beheshti University
              and Engineering                   Shahid Beheshti University                    Tehran, Iran
         Shahid Beheshti University                    Tehran, Iran                  z.aminolroaya@Mail.sbu.ac.ir
                 Tehran, Iran                 m.shirzadi@email.kntu.ac.ir
           Talebpour@sbu.ac.ir

ABSTRACT                                                          knowledge-based approach is the wordnet database. In this
Nowadays, plagiarism detection becomes as one of major            database different words are grouped together based on their
problems in the text mining field. New coming technologies        cognitive synonyms[4]. This database could be used to find
have made plagiarisation easy and more feasible. Therefore,       restated phrases. Words in different locations in sentences
it is vital to develop automatic system to detect plagiarisa-     may have different applications, so knowing syntactic cate-
tion in different contents.                                       gory (POS) of the words i.e. noun, verb, etc. could simplify
   In this paper, we propose a trie to compare source and sus-    the problem of plagiarism detection.
picious text documents. We use PersianPlagDet text docu-            Plagiarized documents can be in any languages which need
ments as a case study. Both character-based and knowledge-        different policies to be detected due to different semantics
based techniques for detection purposes have improved our         and grammars. In this paper, we have proposed a novel ap-
method. Besides, our fast algorithm for insertion and re-         proach for the PAN FIRE Shared Task of Persian plagiarism
trieval has made possible to compare long documents with          detection in the international contest PersianPlagDet 2016.
high speed.                                                       We have used a hybrid method considering both character-
                                                                  based and knowledge-based approaches. A Persian wordnet
                                                                  database, Farsnet, is considered as our knowledge database
CCS Concepts                                                      [13]. Besides, we have applied POS tagging by using HAZM
•Computing methodologies → Natural language pro-                  package [1]. By finding nouns and their synsets from the
cessing; •Information systems → Information systems               Farsnet, we could more precisely save and retrieve suspicious
applications;                                                     words from our proposed tree structure. In our plagiarism
                                                                  detection methodology, we have applied a novel extended
Keywords                                                          prefix tree i.e. trie to store and retrieve documents. We not
                                                                  only consider the task of text plagiarism detection but also
Plagiarism detection; Trie-based method, Text mining
                                                                  the algorithm computation time as an important factors.

1. INTRODUCTION                                                   1.1   Related works
   Plagiarism means trying to pass off somebody else’s words         There are many studies to find solutions for the prob-
as your own [14]. Plagiarism detection is the process of locat-   lems of plagiarism detection and document matching. In
ing text reuse within a suspicious document [5]. Nowadays,        the Nineties, studies on copy detection mechanisms of dig-
with the advent of technologies like the internet and the         italized documents have led to computerized detecting pla-
growth of digital content creation, plagiarism, especially in     giarism [16]. By the growth of generated data, the speed of
the format of text from existed contents, becomes a growing       plagiarism detectors has become an important criterion. in
problem and one of the major problems in the text mining          [8], a parameterized backward trie matching is considered
field. For example, plagiarism as a way to release the pres-      as a fast method for the problem of source and suspicious
sure to publish papers pushes down the quality of scientific      documents alignment.
papers. In [7], Lesk declares that, in some countries, 15% of        The plagiarism detection problem is also studied in differ-
submissions to arXiv contain duplicated materials and are         ent languages. For Persian language plagiarism detection,
plagiarized. Due to these problems, it is urgent to provide       In [9], after preprocessing the source and suspicious docu-
a system to automatically detect plagiarism and validate          ments, different similarity measurements like “Jaccard sim-
them.                                                             ilarity coefficient”, “Clough & Stevenson metric” and “LCS”
   There have been many approaches proposed based on lex-         are used for similarity comparisons between source and sus-
ical and semantic methods. On the one hand, the plagiari-         picious documents. Also, by applying FarsNet, Rakian et.al
sation problem could be reduced to the problem of finding         propose an approach, “Persian Fuzzy Plagiarism Detection
exact matched phrases, and, on the other hand, it could be        (PFPD)”, to detect plagiarized cases [17].
as hard as finding restated phrases. Due to what a problem           Our fast trie-based approach is proposed for the problem
asked, different knowledge-based or character-based tech-         of the persian language plagiarisation detection. We de-
niques could be applied. One of the lexical database for          scribe the problem data and data preprocessing applied to
documents in section 2. Then, in section 3, the novel ap-         Acquiring words synsets
proach for plagiarism detection is described, and, in section     After defining words part-of-speech, we search through the
4, algorithm evaluation measurement is described, and our         Farsnet to find the nouns cognitive synonyms, synsets. We
approach is evaluated. Finally, the results are concluded in      find synsets because words may have be used instead of their
section 5.                                                        synonyms in different positions. For example, “computers”
                                                                  may be used as “estimators” or “data processors”. Like the
2. DATA                                                           words offsets, the synsets offsets are stored. Notice that the
  The data is a set of suspicious and source text documents       synset offsets are equal to the original words offsets.
released by PersianPlagDet competition [2]. In Persian-              To solve plagiarism problem, offsets specification of word
PlagDet data, the document plagiarisms could happen in            tokens and also collecting noun words synsets and words
different ways: parts of a source text document could exactly     stems are basic satellite data to be used in our proposed
being copied into a suspicious text, parts of a source text       tree model, trie, which is explained in the next section.
document with some random changes could being copied
into a suspicious text, and parts of a restated source text       3.    METHODOLOGY
document could be seen in a suspicious text.                        After source and suspicious documents have been pre-
                                                                  processed, we use a method to find similar fragments and
2.1 Data preparation                                              their exact offsets in both suspicious and source files. Be-
  Before applying plagiarism detection method, the source         fore source and suspicious documents being compared, doc-
and suspicious text documents should be prepared. We ex-          uments are saved to and retrieved from a trie data structure.
plain the processes needed before plagiarism detection step       In the next subsections, there would be a brief survey of trie
by step:                                                          trees and an explanation of our new proposed trie.
Text tokenization and POS tagging                                 3.1    Brief survey of trie trees
We tokenize text documents into words. Tokenization is the           A tokenized document is a set of words which can be
procedure of splitting a text into words, phrases, or other       stored in a dictionary. A trie data structure can be used to
meaningful parts, namely tokens [3]. In addition to tokeniza-     insert and find words in a dictionary in O(n), n represents a
tion, the exact position of tokens, word offsets, are stored.     single word length. The word “trie” is actually comes from
A token offset represents the token character-based distance      the “retrieval” which is its usage. In the trie tree, the prefix
from the beginning of the document. By applying the Hazm          tree, each node is a word or a prefix. All prefix characters of
POS tagger, we also specify part-of-speech of each word.          a word are inserted as a node, and the last letter is flagged
The nouns are important for us, and help us to compare            as the word end in trie. Trie trees could have < key, value >
phrases for plagiarism detection purpose. Thus, nouns are         data structure. Words with similar prefixes may have simi-
flagged for the next stages of processing.                        lar subpaths. As an example brought in Figure 1, the word
                                                                  “xy” value is “2”. Besides, “xy” and “xyzb” words have similar
Text cleansing and normalization                                  subpath. The node values are defined based on the problem.
First, we normalize text documents. Normalization is the          In the following, we describe the proposed trie and different
task of transforming text characters into a unique and nor-       key values.
mal form of a language. For example, we convert all Arabic
“yaa” and “Kaaf” to Persian “ye” and “Kaaf” for preprocess-         Figure 1: An example of trie data structure[15].
ing Persian text documents, and we unify all numbers with
different Persian and English unicodes. Punctuations are
also removed from text documents.

Removing stop words and frequent words
Stop words are also removed from text data. Stop words
are words which are moved out from text data in processing
steps because they do not contain significant information.
First, a group of stop words has been selected which an
expert has proposed. Then, frequent words are also chosen
and removed with considering a frequency threshold value.

Stemming words
The next step is to specify words stems. There are many
kinds of words inflections and derivations. The suffices “haa”,
“aan”, “yaan”, “aat”, “ien” and somtimes “gaan” could make
a single word plural. We remove these suffices from nouns.
Also, Arabic broken plurals are the most challenging kinds of
noun pluralization which cannot be distinguished by remov-
ing some suffices. An expert has provided the words stems         3.2    Proposed trie trees
by the help of Dehkhoda and Moein dictionaries which could           In this paper, we use trie data structure to insert and re-
help us to convert Arabic broken plural nouns to singular         trieve documents words due to trie properties i.e. fast inser-
ones.                                                             tion and searching and its high adjustment to our problem
solving i.e. defining the offsets of plagiarized strings. Our
method for plagiarism detection is divided into two different        Table 1: The evaluation of our approach on the test
processes:                                                           data released by PersianPlagDet 2016 contest [6].
                                                                      Measure Plagdet Granularity Precision Recall
Inserting documents to data structures                                  Value     0.775      1.228      0.964    0.837
After preprocessing both source and suspicious documents,
all the words with their exact positions in the source doc-
ument are inserted into trie, and the suspicious words are           example, for a potential copied phrase P = {w1 , w2 , ..., w5 }
added into an ordered list based on their position in the            in source and suspicious documents, if the synonym of w2
document. According to the trie definition, each trie node           were used instead of w2 , both w2 and its synonym are added
is a part of preprocessed words. In the proposed trie, each          to the trie.
word has a “word positions” list which includes the word oc-            Furthermore, if w2 were deliberately added or deleted
curance positions in the documents. Notice that the words            from the suspicious document, our plagiarism detector sys-
may have occurred in different positions in the document,            tem could detect the plagiarized section P correctly. This
but they are only inserted once in the trie, and their occu-         feature is achieved because of the nature of the linked lists
rance positions are added in the words positions lists. Also,        which we could trace the front and rear of words with.
the words positions lists are only considered for the nodes          According to different POS in sentence, the words can be
which represent the last character of words. The more re-            weighted differently for being added to removed intelligently.
peated words include in the suspicious document, the faster
the trie can be constructed.
   Due to enhancing searching speed, It is better to save the        4.   EVALUATION
longer document in the trie, however we always save the                We use macro-averaged precision and recall, granularity
source doument in the trie for simplicity.                           measurements, and the plagdet score described in [11]. The
                                                                     precision and recall measurements evaluate the performance
Finding the longest plagiarized fragments                            of detection in character level, while granularity considers
To report plagiarized sections, it is important to find simi-        the contiguity of text plagriasied phrases detected in source
lar words based on their sequential occurrences in the source        and suspicious documents. The granularity of detections R
and suspicious documents. The contiguity of words in suspi-          under true plagiarisms S is described as below [11];
cious documents could simply be kept based on the applied                                             1 ∑
data structure i.e. ordered list. For the source documents,                          gran(S, R) =             |RS |            (1)
                                                                                                    |SR | s∈S
the words positions lists added in the trie will help us to find                                                R

the order of words in the source plagiarized sections.               Where SR ⊆ S are the cases which are detected by detec-
   After constructing documents data structure, the longest          tions and RS ⊆ R are the detections by considering s:
plagiarized fragments should be found in both source and
suspicious documents. Thus, we iterate over the suspicious                    SR = {s|s ∈ S ∧ ∃r ∈ R : r        detects   s},
document words one by one and find the corresponding words                    RS = {r|r ∈ R ∧ r       detects   s}.
in source trie. It is obvious that finding words in trie con-
tribute to obtaining the word positions in the source docu-
ment. Also, the detected plagiarized positions of the suspi-         Plagdet score is an overall score which considers the other
cious document are added to the corresponding word “sus-             mentioned measurements. The Plagdet score overall score
picious positions” list in the trie.                                 is as follows [11];
   When a similar word is found in both documents the infor-
mation of the word front and rear words in source document                                                    Fα
                                                                                plagdet(S, R) =                                 (2)
in also kept in the trie:                                                                           log2 (1 + gran(S, R))
   consider Wp = {wp1 , wp2 , ..., wpn } is the list of suspicious   In which S and R are detections and true cases of a plagia-
words in ordered list and Ws = {ws1 , ws2 , ..., wsm } is the        rism and Fα is Fα − M easure, the weighted harmonic mean
ordered list of source words inserted in the trie. Where n is        of precision and recall which can be defined as bellow;
the number of words in the suspicious document and m is
the number of distinct words in the source one. If wp1 = ws1                                          precision.recall
                                                                                Fα = (1 + α2 ).                                 (3)
and wp2 = ws2 , then “ws2 ”, “ws2 position in the source” and                                     (α2 .precision) + recall
“wp2 position in the suspicious” are added as the ws1 front          If α is not predefined, we consider α = 1.
node “value”, “words position list” and “suspicious position            Table 1 shows the evaluation of our approach on the test
list”. Moreover, ws1 is added as the first of a sentence into        data released by PersianPlagDet 2016 competition which is
the “sentence list”. The process is also correct for rear nodes.     based on TIRA and the PAN evaluation setup [10, 18, 12].
   Traversing the suspicious document thoroughly leads to            Our approach high precision, recall and acceptable granu-
generating a set of linked lists helping to find the plagiarized     larity values contribute to admissible plagdet score.
fragments. The “sentence list” includes the first of plagia-
rized sections. By looking at the first of sentences in the
sentence list and finding them in the trie, all the plagiarized      5.   CONCLUSIONS
fragments could be found.                                               The advents of digitalization and technology have simpli-
   Adding both the exact word and its synonyms (with the             fied the act of plagiarizing. Thus, it is crucial to develop
help of Farsnet) to the trie would cause to find the potential       an automatic systems to detect plagiarisation in different
similar sections which are plagiarized by restatement. For           contents.
   We first prepared the text data released by international     [11] Potthast M., Stein B., Barrón-Cedeño A., and Rosso
PersianPlagDet 2016 contest. We made the data ready by                P. 2010. An evaluation framework for plagiarism detec-
preprocessing, tokenization and Morphological analysis (e.g.          tion. In Proceedings of the 23rd international conference
POS tagging) before documents comparison. In this pa-                 on computational linguistics: Posters. Association for
per, we have proposed a novel trie-based approach to save             Computational Linguistics, 997–1005.
and retrieve source and suspicious preparation documents
for solving the plagiarism detection problem. Fast insert-       [12] Potthast M., Gollub T., Rangel F., Rosso P., Sta-
ing and retrieval long sentences were our reasons to exploit          matatos E., and Stein B. 2014. Improving the Re-
trie trees structures for the detection problem. Both finding         producibility of PAN’s Shared Tasks: Plagiarism De-
noun words and their synsets with saving them to our ex-              tection, Author Identification, and Author Profiling.
tended trie have helped us to improve our text comparison             In Information Access Evaluation meets Multilingual-
especially in the case of restatement phrase matching.                ity, Multimodality, and Visualization. 5th International
   To evaluate our algorithm, we used macro-averaged preci-           Conference of the CLEF Initiative (CLEF 14), Evange-
sion and recall, granularity measurements, and the plagdet            los Kanoulas, Mihai Lupu, Paul Clough, Mark Sander-
score which were proposed by the PersianPlagDet competi-              son, Mark Hall, Allan Hanbury, and Elaine Toms
tion. High precision, recall and acceptable granularity made          (Eds.). Springer, Berlin Heidelberg New York, 268–299.
the overall plagdet score for our algorithm admissible. Be-           DOI:http://dx.doi.org/10.1007/978-3-319-11382-1 22
sides, thanks to the help of our proposed trie, large docu-      [13] Shamsfard M., Hesabi A., Fadaei H., Mansoory N.,
ments can be compared for the purpose of plagiarism detec-            Famian A., Bagherbeigi S., Fekri E., Monshizadeh M.,
tion.                                                                 and Assi S. M. 2010. Semi automatic development
   In the next study, we will work on the contiguity of text          of farsnet; the persian wordnet. In Proceedings of 5th
plagriasied phrase for better granularity results. Besides, we        Global WordNet Conference, Mumbai, India, Vol. 29.
will consider other part-of-speech synsets like verb synsets
to improve our algorithm performance.                            [14] Lea M. R. and Street B. 2014. understanding textual
                                                                      practices in higher education. Writing: Texts, processes
References                                                            and practices (2014), 62.
 [1] 2013. Hazm. https://github.com/sobhe/hazm. (2013).          [15] More N. 2015. Trie Data Structure. http://www.
 [2] 2016. PersianPlagDet 2016. http://www.ictrc.ac.ir/               ideserve.co.in/learn/trie-insert-and-search. (2015).
     plagdet. (2016).                                            [16] Brin S., Davis J., and Garcia-Molina H. 1995. Copy
 [3] Uysal A. K. and Gunal S. 2014. The impact of prepro-             detection mechanisms for digital documents. In ACM
     cessing on text classification. Information Processing &         SIGMOD Record, Vol. 24. ACM, 398–409.
     Management 50, 1 (2014), 104–112.
                                                                 [17] Rakian Sh., Esfahani F. S., and Rastegari H. 2015. A
 [4] Miller G. A. 1995. WordNet: a lexical database for               Persian Fuzzy Plagiarism Detection Approach. Journal
     English. Commun. ACM 38, 11 (1995), 39–41.                       of Information Systems and Telecommunication (JIST)
                                                                      3, 3 (2015), 182–190.
 [5] Asghari H., Khoshnava K., Fatemi O., and Faili H.
     2015. Developing Bilingual Plagiarism Detection Cor-        [18] Gollub T., Stein B., and Burrows S. 2012. Oust-
     pus Using Sentence Aligned Parallel Corpus. Notebook             ing Ivory Tower Research: Towards a Web Frame-
     for PAN at CLEF (2015).                                          work for Providing Experiments as a Service. In 35th
                                                                      International ACM Conference on Research and De-
 [6] Asghari H., Mohtaj S., Fatemi O., Faili H., Rosso P.,
                                                                      velopment in Information Retrieval (SIGIR 12), Bill
     and Potthast M. 2016. Algorithms and Corpora for Per-
                                                                      Hersh, Jamie Callan, Yoelle Maarek, and Mark Sander-
     sian Plagiarism Detection: Overview of PAN at FIRE
                                                                      son (Eds.). ACM, 1125–1126. DOI:http://dx.doi.org/
     2016. In Working notes of FIRE 2016 - Forum for In-
                                                                      10.1145/2348283.2348501
     formation Retrieval Evaluation (CEUR Workshop Pro-
     ceedings). CEUR-WS.org.
 [7] Lesk M. 2015. How many scientific papers are not orig-
     inal? Proceedings of the National Academy of Sciences
     112, 1 (2015), 6–7.
 [8] Mozgovoy M. 2007. Enhancing computer-aided plagia-
     rism detection. University Of Joensuu.
 [9] Mahmoodi M. and Varnamkhasti M. M. 2014. Design
     a Persian Automated Plagiarism Detector (AMZPPD).
     arXiv preprint arXiv:1403.1618 (2014).
[10] Potthast M., Stein B., Barrón-Cedeño A., and Rosso
     P. 2010. An Evaluation Framework for Plagiarism De-
     tection. In 23rd International Conference on Computa-
     tional Linguistics (COLING 10), Chu-Ren Huang and
     Dan Jurafsky (Eds.). Association for Computational
     Linguistics, Stroudsburg, Pennsylvania, 997–1005.