Graph-based Approach to Text Alignment for Plagiarism
             Detection in Persian Documents

        Mozhgan Momtaz                          Kayvan Bijari                  Mostafa Salehi                                                             Hadi Veisi
   Faculty of New Sciences and         Faculty of New Sciences and      Faculty of New Sciences and                                            Faculty of New Sciences and
           Technologies                        Technologies                     Technologies                                                           Technologies
      University of Tehran,                University of Tehran,           University of Tehran,                                                  University of Tehran,
              Tehran                              Tehran                           Tehran                                                                 Tehran
      m.momtaz92@ut.ac.ir                 kayvan.bijari@ut.ac.ir          mostafa_salehi@ut.ac.ir                                                    h.veisi@ut.ac.ir


ABSTRACT                                                               Categories of text alignment dataset are based on PAN
This paper presents a new approach for Persian plagiarism              Competitions [12].
detection. This approach uses a graph structure as well as one of
the graph similarity methods (iterative methods) for similarity
detection of two Persian documents. In this approach, documents
are represented by a graph with specified length, then each part of                                             Monolingual                                                     Source Retrieval
suspicious document is compared to that of the source document.          Different
The graph is made if these parts have more common bigrams than           kinds of
a predefined threshold. Once graphs are made, an iterative method       documents
is used to find analogous nodes in graphs. Two graphs are marked        plagiarism                         Cross-language                                                        Text alignment
as similar if they contain at least a certain number of similar
nodes. In order to evaluate the proposed method, it was run on
PAN2015 and PAN2016 Persian Text Alignment dataset. The
Plagdet score is defined to evaluate plagiarism detection methods
in PAN contest. The gained Plagdet of proposed method is 90%
on PAN2015 and 87% on PAN2016.


                                                                                                                                                      Translation Obfuscation
                                                                                                       None Obfuscation


                                                                                                                                                                                      Summary Obfuscation
                                                                                                                          Random Obfuscation
                                                                                     None plagiarism


CCS Concepts
• Information systems➝ Plagiarism Detection software
• Computing methodologies➝ Graph-based
Keywords
Plagiarism Detection, Text Based Graph Representation, Text
Alignment.
                                                                                Figure 1- Different kinds of documents plagiarism
1. INTRODUCTION

Nowadays, a large volume of information is a compound of               The proposed method, converts every document to a number of
different types of contextual data such as books, articles and other   parts with specific length using the graph approach idea and then
documents and this volume is growing increasingly. In many             if necessary, it converts each part to a graph for precise plagiarism
cases, we need to identify either the duplicated documents or the      examination, which creates graph based on simultaneous
ones which are near-copy documents among the many cases. In            occurrence of the words in fixed window size. After this step, the
this regard, Plagiarism detection in documents is one of the main      similarity between two graphs measured using node similarity
topics which gained attention among researchers in the recent          measures, if the rate of similarity is more than the specified
years. The act of plagiarism is to use other author's writing or       threshold, then that part is labeled as plagiarism. The proposed
ideas, without proper appreciation to the author or proper citation    method was run on the PAN2015 dataset [6] and PAN2016
to the original source [3]. In the recent years, identifying           dataset (Persian Plagdet2016 contest [5]).
plagiarism has become easier using different systems, but              The rest of paper is organized as follows: Section 2 is devoted to
different types of plagiarism is still an issue. In some types of      related works; Section 3 presents the graph-based methodology
plagiarism, the structure of the document is changed by                for plagiarism detection. In Section 4, experimental evaluation of
rearranging the words or using synonym words. Therefore, the           the proposed method is given, and finally Section 5 ends the paper
results of basic plagiarism detection methods are not acceptable.      with conclusion and future works.
So the need for more sophisticated methods for plagiarism
detection is growing. Different kinds of plagiarism are shown in
figure 1.
2. RELATED WORK                                                        model and the solutions based on graph model are summarized in
In order to detect monolingual plagiarism, various methods have        table 1.
been proposed. In this section, each of these methods is explained     Table1- Some issues of the bag-of-words model and the
briefly.                                                               solutions based on graph-based model for plagiarism
Character-based methods: The most important method is the              detection applications
fingerprint method. Fingerprint algorithms [20] consider the text      Issues of bag-of-words           Graph-based solution
as a series of characters and then they divide the characters in n-    methods to model text for
series groups, the most important algorithms include 16-gram, 8-       plagiarism detection
gram and 5-gram methods. In this method, the degree of similarity      applications
depends on the number of similar characters in a string. Although
this method ends up with a good result in detecting plagiarism, but    Ignoring order of the words      Using directed graph (step 3
when plagiarism has some paraphrasing or modified words, this                                           of proposed method)
method does not act in an effective way to detect plagiarism.          Ignoring the structure of the    Considering     the    whole
Structural-based methods: In the previous method, the only             text                             sentence as a graph (step 2
attention is on words as features of the document. However, in the                                      and 3 of proposed method)
structural-based methods [9], pays attention to the titles,            Neglecting word synonyms         Ability to add synonyms to
paragraphs, sections and resources. One of the most famous                                              the graph corresponding text
structural-based methods is tree-based method, which gained                                             (step 3 of proposed method)
much attention in the recent years. In the tree-structured model, a
two-layer model is defined that the top layer is for retrieving
documents and the bottom layer is considered for detecting
plagiarism between retrieved documents using the methods of            3. METHODOLOGY
similarity detection such as cosine similarity method.                 Any textual document can be presented via a corresponding
                                                                       graph. Graph based representation of text is important because it
Classification-based methods: In this method the documents             enables us to turn an unstructured text into a structural text, and
are classified based on specific words (or keywords) [19]. The         then the advantageous of graph representation can be applied to
primary goal in this approach is to retrieve similar documents and     text summarization, identifying similarities of the documents, and
to speed up the process of plagiarism detection.                       applications of text mining. For natural language processing
Semantic-Based methods: This method uses lexical network               applications text graph of documents should be built. In a text
in order to find semantic similarity for plagiarism detection [17].    graph, nodes represent words of the document, and the edges
The most famous lexical network in English language is WordNet         present the relation between different words. The relation of
[8]. By means of WordNet, it is possible to achieve more               words can vary from application to application. The proposed
information about a special word. This method is effective when        plagiarism detection method is consist of 5 steps that will be
plagiarism is done using synonym words. FarsNet [15] the Persian       discussed further in the following.
equivalent of WordNet, is also proposed for Persian language.          Step 1. preprocessing: normalization is one of the basic steps
Graph-based methods: In this method, the text will be                  in text mining and text processing. In the normalization process,
converted into a graph. Nodes in the graph can be words, phrases       punctuations and stop words are removed. In this paper, we have
or even sentences of the text. Edges in the graph represent the link   used Hazm package [10] for Persian text normalization.
or relation between nodes and they can show the semantic link          Step 2. turn text to set of clauses: suspicious document and
between words or the simultaneous occurrence in one sentence           the reference document are divided to a series of sentences. Each
[16]. This method will be discussed more in the proposed method        sentence of the suspected document will be compared to all the
section. Converting a text into a graph, enables us to detect          sentences of the reference document. In this step, a filtering will
plagiarism using the advantages of graph similarity algorithms         be done on sentences in order to reduce runtime. Finally, if the
[7].                                                                   two sentences at least have the cosine similarity of 0.4, then they
In this paper, the proposed method is a combination of structural-     go to the step of graph making process (this value is obtained
based method and graph-based method. Paying attention to the           experimentally), otherwise the comparison will continue to other
structure of the text leads to detection of plagiarism in the          sentences of the reference document.
document even if the plagiarism is the type in which the structure     Step 3. creating corresponding graph: in graph creation
of the text is changed.                                                step, each sentence will be converted to a graph. The nodes of the
One common and standard approach to model text document is             graph are main and unique words, and in this graph, an edge will
bag-of-words. This model is suitable for capturing word                be established between a specific word and 4 words before and
frequency [16]. Assuming that order of the word’s occurrence has       after it in the document. Igraph package [11] is used for graph
nothing to do with its meaning; this model has a proper result in      creation in this paper.
information retrieval. The drawback of this model is when it tries     Step 4. plagiarism detection: when graph creation is
to find the reused text and plagiarism between different parts of      complete, we are looking for nodes in the suspected document
the text, if a reused text is occurring by using synonym words then    that is common with the node of the reference document. An
this model could not properly detect the reused text. Furthermore,     iterative method based on simple idea indicates that the two
this model doesn’t express the meaning and the structure of the        graphs are similar when they have similar nodes, and nodes are
text [16]. However, Graph representation is mathematical               similar if they have analogous neighbors [18]. We use this method
constructs and can model different word’s relation and textual         for our specific graph. Then we find their similarity using
structure of the documents [16]. Some issues of the bag-of-words       equation 1:
                                                                         the two nodes. In this case, if plagiarism is done by add and
                    (     )
                                   (     ( )      ( ))             )1(   removal of the words, by considering minimum similarity
                                                                         threshold, these changes do not have much negative impact on
Where B is primary neighbors of the common node in the
                                                                         plagiarism detection. The result of Persian Plagdet contest is also
graph of reference document. And A is primary neighbors
                                                                         reported in [1].
of the common node in the graph of suspicious document.


If the similarities between two nodes is greater than the threshold
α (α = 0.4), then that node is selected as the similar node. Finally,                                  Preprocessing
if a sentence has more similar nodes than the threshold β
(β=1/3(the number of key words in suspicious document)), that
statement is labeled as one of the sentences which plagiarism has          Document suspected                                Reference documents
occurred. α and β are thresholds that are achieved experimentally            of plagiarism                                      in the database
and they are based on performances of the system.
Step 5. Integrate plagiarism labeled Sentences: In this                                   Separating text to a set of sentences
step, we integrate sentences with plagiarism label (output of step 4
of the algorithm) based on start and end offset of sentences in text.
This step important for granularity measure1 [2]. If there exist no
labeled sentences, we can assume that no plagiarism is occurred.           Sentence 1 of suspected                     Sentence 1 of reference
                                                                                 document                                    document
4. EXPERIMENTAL EVALUATION
In this section the results of the implementation of the proposed
method on the plagiarism data sets are given. Moreover, in the
following we are going to focus on analyzing results. The two                             Choosing words to build a graph based
data sets that were used for analyzing the proposed method are as                                on minimum similarity
follows.
1. Persian Text alignment dataset PAN2015: This dataset is
published on the website of PAN contest.
                                                                                 Comparing paired generated graphs with each other and
2. Persian Text Alignment dataset PAN2016: This dataset                          identify similar graphs. (Sentences of the similar graphs
contains 2749 training and testing documents and are related to                  will get similarity label). And Aggregation of similar
Persian Plagdet2016 international contest [14, 4, 13], which was                 labeled Sentences
organized by the Institute of Information and Communication
Technology (ICTRC) and contest results are available at the
contest site.                                                                                 Detecting the existence of plagiarism
4.1 Experimental Results                                                                      or not
Table 2 shows the results of the implementation of the proposed                           Figure 2- diagram of detecting plagiarism
method on the datasets.


Table2- Experimental Results on Persian document dataset                 5. CONCLUSION AND FUTURE WORK
                                                                         Using a graph structure, we proposed a method to convert
dataset                        Precision       Recall    Palgdet         unstructured text into graphs, Graph-based approach provides the
                                                                         ability that takes advantage of the benefits of graph algorithms
PAN2015                        0.91            0.89      0.90
                                                                         and use them in natural language processing algorithms. In this
PAN2016(training)              0.90            0.89      0.89            paper, we discussed and analyzed the results of the generalized
                                                                         method to detect plagiarism on the inner levels (Text Alignment).
PAN2016(contest)               0.89            0.85      0.87            By achieving the Benchmark of Plagdet 87% without using
                                                                         linguistic corpora and grammatical rules, it is expected that more
                                                                         works in graph based approaches achieve better results in
As shown in the results, according to the evaluation criteria, the       plagiarism detection. Furthermore, being independent from rules
graph-based method has achieved favorable results without using          and corpora enables this method to detect plagiarism in other
linguistic corpora and only due to the structure of the text. Graph      languages. As a future work we want to increase the accuracy of
approach has unique features to detect similar documents. Among          the algorithm to detect semantic plagiarism using FarsNet lexical
these features, one can mention paying attention to non-adjacent         network. Since it is possible to add word's synonyms to the
words in a sentence. This feature makes plagiarism detection easy,       corresponding graph of the document, by adding synonym words,
because plagiarism is done by rearranging words, but in the              the accuracy of detecting semantic plagiarism is increased.
character-based methods attention is just on the relationship            Another important category of modern plagiarism, is plagiarism
between adjacent words. Furthermore, another feature of the              on summary of a text. Due to the flexibility of graph approach in
graph is considering the minimum threshold of similarity between         detecting plagiarism, graph approach is also efficient in detecting
                                                                         plagiarism on summary of a text.
1 The logarithm of the granularity to smooth its influence on the
overall score.
5. REFERENCES                                                             Overview of the 5th international competition on plagiarism
                                                                          detection, in CLEF Conference on Multilingual and
                                                                          Multimodal Information Access Evaluation, CELCT.
[1] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and
    Potthast, M., 2016. Algorithms and Corpora for Persian           [13] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E.
    Plagiarism Detection: Overview of PAN at FIRE 2016. In                and Stein, B., 2014, September. Improving the
    Working notes of FIRE 2016 - Forum for Information                    Reproducibility of PAN’s Shared Tasks. In International
    Retrieval Evaluation, Kolkata, India, December 7-10, 2016,            Conference of the Cross-Language Evaluation Forum for
    CEUR Workshop Proceedings, CEUR-WS.org.                               European Languages (pp. 268-299). Springer International
                                                                          Publishing.
[2] Barrón-Cedeño, A., Potthast, M., Rosso, P., Stein, B., and
    Eiselt, A. 2010. Corpus and Evaluation Measures for              [14] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P.,
    Automatic Plagiarism Detection, In LREC.                              2010, August. An evaluation framework for plagiarism
                                                                          detection. In Proceedings of the 23rd international
[3] Flowerdew, J and Li, Y. 2007. Plagiarism and second                   conference on computational linguistics: Posters (pp. 997-
    language writing in an electronic age, Annual Review of               1005). Association for Computational Linguistics.
    Applied Linguistics, vol. 27, pp. 161--183.
                                                                     [15] Shamsfard, M., 2008. Developing FarsNet: A lexical
[4] Gollub, T., Stein, B. and Burrows, S., 2012, August. Ousting          ontology for Persian. In 4th Global WordNet Conference,
    ivory tower research: towards a web framework for                     Szeged, Hungary.
    providing experiments as a service. In Proceedings of the
    35th international ACM SIGIR conference on Research and          [16] Sonawane, S.S. and Kulkarni, P.A., 2014. Graph based
    development in information retrieval (pp. 1125-1126). ACM.            Representation and Analysis of Text Document: A Survey of
                                                                          Techniques. International     Journal    of     Computer
[5] ICT Research Institute, ACECR ,Iran, 2016. [Online].                  Applications, 96(19).
    Available: http://ictrc.ac.ir/plagdet/.
                                                                     [17] Torres, S. and Gelbukh, A., 2009. Comparing similarity
[6] Khoshnavataher, K., Zarrabi, V., Mohtaj, S., and Asghari, H.          measures for original WSD lesk algorithm. Research In
    2015. Developing Monolingual Persian Corpus for Extrinsic             Computing Science, 43, pp.155-166.
    Plagiarism Detection Using Artificial Obfuscation. Notebook
    for PAN at CLEF2015, 8-11 September, Toulouse, France.           [18] Zager, L., 2005. Graph similarity and matching (Doctoral
                                                                          dissertation, Massachusetts Institute of Technology).
[7] Kumar, N. 2014.A Graph Based Automatic Plagiarism
    Detection Technique to Handle Artificial Word Reordering         [19] Zini ,M. 2006 Plagiarism Detection through Multilevel Text
    and Paraphrasing, Computational Linguistics and Intelligent           Comparison, In Automated Production of Cross Media
    Text Processing, pp. 481--494.                                        Content for Multi-Channel Distribution,Second International
                                                                          Conference.
[8] Leacock, C., Miller, G.A. and Chodorow, M., 1998. Using
    corpus statistics and WordNet relations for sense                [20] Zini, M., Fabbri, M., Moneglia, M., and Panunzi. 2006.
    identification. Computational Linguistics, 24(1), pp.147-165.         Alessandro, Plagiarism detection through multilevel text
                                                                          comparison, In Second International Conference, IEEE.
[9] Leilei, K., Haoliang, Q., Shuai, W., Cuixia, D., Suhong, W.,
    and Yong, H. 2012. Approaches for candidate document
    retrieval and detailed comparison of plagiarism detection, in
    Notebook for PAN at CLEF 2012.
[10] [Online].        Available:         hazm         package,
     http://www.sobhe.ir/hazm. [Accessed 2015-04-30].
[11] [Online].          Available:         igraph        package,
     http://igraph.org/python/. [Accessed 2015-03-15].
[12] Potthast, M and Hagen, M ., Gollub, T., Tippmann, M.,
     Kiesel, J., Rosso, P., Stamatatos, E., and Stein, B. 2013.