Graph-based Approach to Text Alignment for Plagiarism Detection in Persian Documents Mozhgan Momtaz Kayvan Bijari Mostafa Salehi Hadi Veisi Faculty of New Sciences and Faculty of New Sciences and Faculty of New Sciences and Faculty of New Sciences and Technologies Technologies Technologies Technologies University of Tehran, University of Tehran, University of Tehran, University of Tehran, Tehran Tehran Tehran Tehran m.momtaz92@ut.ac.ir kayvan.bijari@ut.ac.ir mostafa_salehi@ut.ac.ir h.veisi@ut.ac.ir ABSTRACT Categories of text alignment dataset are based on PAN This paper presents a new approach for Persian plagiarism Competitions [12]. detection. This approach uses a graph structure as well as one of the graph similarity methods (iterative methods) for similarity detection of two Persian documents. In this approach, documents are represented by a graph with specified length, then each part of Monolingual Source Retrieval suspicious document is compared to that of the source document. Different The graph is made if these parts have more common bigrams than kinds of a predefined threshold. Once graphs are made, an iterative method documents is used to find analogous nodes in graphs. Two graphs are marked plagiarism Cross-language Text alignment as similar if they contain at least a certain number of similar nodes. In order to evaluate the proposed method, it was run on PAN2015 and PAN2016 Persian Text Alignment dataset. The Plagdet score is defined to evaluate plagiarism detection methods in PAN contest. The gained Plagdet of proposed method is 90% on PAN2015 and 87% on PAN2016. Translation Obfuscation None Obfuscation Summary Obfuscation Random Obfuscation None plagiarism CCS Concepts • Information systems➝ Plagiarism Detection software • Computing methodologies➝ Graph-based Keywords Plagiarism Detection, Text Based Graph Representation, Text Alignment. Figure 1- Different kinds of documents plagiarism 1. INTRODUCTION Nowadays, a large volume of information is a compound of The proposed method, converts every document to a number of different types of contextual data such as books, articles and other parts with specific length using the graph approach idea and then documents and this volume is growing increasingly. In many if necessary, it converts each part to a graph for precise plagiarism cases, we need to identify either the duplicated documents or the examination, which creates graph based on simultaneous ones which are near-copy documents among the many cases. In occurrence of the words in fixed window size. After this step, the this regard, Plagiarism detection in documents is one of the main similarity between two graphs measured using node similarity topics which gained attention among researchers in the recent measures, if the rate of similarity is more than the specified years. The act of plagiarism is to use other author's writing or threshold, then that part is labeled as plagiarism. The proposed ideas, without proper appreciation to the author or proper citation method was run on the PAN2015 dataset [6] and PAN2016 to the original source [3]. In the recent years, identifying dataset (Persian Plagdet2016 contest [5]). plagiarism has become easier using different systems, but The rest of paper is organized as follows: Section 2 is devoted to different types of plagiarism is still an issue. In some types of related works; Section 3 presents the graph-based methodology plagiarism, the structure of the document is changed by for plagiarism detection. In Section 4, experimental evaluation of rearranging the words or using synonym words. Therefore, the the proposed method is given, and finally Section 5 ends the paper results of basic plagiarism detection methods are not acceptable. with conclusion and future works. So the need for more sophisticated methods for plagiarism detection is growing. Different kinds of plagiarism are shown in figure 1. 2. RELATED WORK model and the solutions based on graph model are summarized in In order to detect monolingual plagiarism, various methods have table 1. been proposed. In this section, each of these methods is explained Table1- Some issues of the bag-of-words model and the briefly. solutions based on graph-based model for plagiarism Character-based methods: The most important method is the detection applications fingerprint method. Fingerprint algorithms [20] consider the text Issues of bag-of-words Graph-based solution as a series of characters and then they divide the characters in n- methods to model text for series groups, the most important algorithms include 16-gram, 8- plagiarism detection gram and 5-gram methods. In this method, the degree of similarity applications depends on the number of similar characters in a string. Although this method ends up with a good result in detecting plagiarism, but Ignoring order of the words Using directed graph (step 3 when plagiarism has some paraphrasing or modified words, this of proposed method) method does not act in an effective way to detect plagiarism. Ignoring the structure of the Considering the whole Structural-based methods: In the previous method, the only text sentence as a graph (step 2 attention is on words as features of the document. However, in the and 3 of proposed method) structural-based methods [9], pays attention to the titles, Neglecting word synonyms Ability to add synonyms to paragraphs, sections and resources. One of the most famous the graph corresponding text structural-based methods is tree-based method, which gained (step 3 of proposed method) much attention in the recent years. In the tree-structured model, a two-layer model is defined that the top layer is for retrieving documents and the bottom layer is considered for detecting plagiarism between retrieved documents using the methods of 3. METHODOLOGY similarity detection such as cosine similarity method. Any textual document can be presented via a corresponding graph. Graph based representation of text is important because it Classification-based methods: In this method the documents enables us to turn an unstructured text into a structural text, and are classified based on specific words (or keywords) [19]. The then the advantageous of graph representation can be applied to primary goal in this approach is to retrieve similar documents and text summarization, identifying similarities of the documents, and to speed up the process of plagiarism detection. applications of text mining. For natural language processing Semantic-Based methods: This method uses lexical network applications text graph of documents should be built. In a text in order to find semantic similarity for plagiarism detection [17]. graph, nodes represent words of the document, and the edges The most famous lexical network in English language is WordNet present the relation between different words. The relation of [8]. By means of WordNet, it is possible to achieve more words can vary from application to application. The proposed information about a special word. This method is effective when plagiarism detection method is consist of 5 steps that will be plagiarism is done using synonym words. FarsNet [15] the Persian discussed further in the following. equivalent of WordNet, is also proposed for Persian language. Step 1. preprocessing: normalization is one of the basic steps Graph-based methods: In this method, the text will be in text mining and text processing. In the normalization process, converted into a graph. Nodes in the graph can be words, phrases punctuations and stop words are removed. In this paper, we have or even sentences of the text. Edges in the graph represent the link used Hazm package [10] for Persian text normalization. or relation between nodes and they can show the semantic link Step 2. turn text to set of clauses: suspicious document and between words or the simultaneous occurrence in one sentence the reference document are divided to a series of sentences. Each [16]. This method will be discussed more in the proposed method sentence of the suspected document will be compared to all the section. Converting a text into a graph, enables us to detect sentences of the reference document. In this step, a filtering will plagiarism using the advantages of graph similarity algorithms be done on sentences in order to reduce runtime. Finally, if the [7]. two sentences at least have the cosine similarity of 0.4, then they In this paper, the proposed method is a combination of structural- go to the step of graph making process (this value is obtained based method and graph-based method. Paying attention to the experimentally), otherwise the comparison will continue to other structure of the text leads to detection of plagiarism in the sentences of the reference document. document even if the plagiarism is the type in which the structure Step 3. creating corresponding graph: in graph creation of the text is changed. step, each sentence will be converted to a graph. The nodes of the One common and standard approach to model text document is graph are main and unique words, and in this graph, an edge will bag-of-words. This model is suitable for capturing word be established between a specific word and 4 words before and frequency [16]. Assuming that order of the word’s occurrence has after it in the document. Igraph package [11] is used for graph nothing to do with its meaning; this model has a proper result in creation in this paper. information retrieval. The drawback of this model is when it tries Step 4. plagiarism detection: when graph creation is to find the reused text and plagiarism between different parts of complete, we are looking for nodes in the suspected document the text, if a reused text is occurring by using synonym words then that is common with the node of the reference document. An this model could not properly detect the reused text. Furthermore, iterative method based on simple idea indicates that the two this model doesn’t express the meaning and the structure of the graphs are similar when they have similar nodes, and nodes are text [16]. However, Graph representation is mathematical similar if they have analogous neighbors [18]. We use this method constructs and can model different word’s relation and textual for our specific graph. Then we find their similarity using structure of the documents [16]. Some issues of the bag-of-words equation 1: the two nodes. In this case, if plagiarism is done by add and ( ) ( ( ) ( )) )1( removal of the words, by considering minimum similarity threshold, these changes do not have much negative impact on Where B is primary neighbors of the common node in the plagiarism detection. The result of Persian Plagdet contest is also graph of reference document. And A is primary neighbors reported in [1]. of the common node in the graph of suspicious document. If the similarities between two nodes is greater than the threshold α (α = 0.4), then that node is selected as the similar node. Finally, Preprocessing if a sentence has more similar nodes than the threshold β (β=1/3(the number of key words in suspicious document)), that statement is labeled as one of the sentences which plagiarism has Document suspected Reference documents occurred. α and β are thresholds that are achieved experimentally of plagiarism in the database and they are based on performances of the system. Step 5. Integrate plagiarism labeled Sentences: In this Separating text to a set of sentences step, we integrate sentences with plagiarism label (output of step 4 of the algorithm) based on start and end offset of sentences in text. This step important for granularity measure1 [2]. If there exist no labeled sentences, we can assume that no plagiarism is occurred. Sentence 1 of suspected Sentence 1 of reference document document 4. EXPERIMENTAL EVALUATION In this section the results of the implementation of the proposed method on the plagiarism data sets are given. Moreover, in the following we are going to focus on analyzing results. The two Choosing words to build a graph based data sets that were used for analyzing the proposed method are as on minimum similarity follows. 1. Persian Text alignment dataset PAN2015: This dataset is published on the website of PAN contest. Comparing paired generated graphs with each other and 2. Persian Text Alignment dataset PAN2016: This dataset identify similar graphs. (Sentences of the similar graphs contains 2749 training and testing documents and are related to will get similarity label). And Aggregation of similar Persian Plagdet2016 international contest [14, 4, 13], which was labeled Sentences organized by the Institute of Information and Communication Technology (ICTRC) and contest results are available at the contest site. Detecting the existence of plagiarism 4.1 Experimental Results or not Table 2 shows the results of the implementation of the proposed Figure 2- diagram of detecting plagiarism method on the datasets. Table2- Experimental Results on Persian document dataset 5. CONCLUSION AND FUTURE WORK Using a graph structure, we proposed a method to convert dataset Precision Recall Palgdet unstructured text into graphs, Graph-based approach provides the ability that takes advantage of the benefits of graph algorithms PAN2015 0.91 0.89 0.90 and use them in natural language processing algorithms. In this PAN2016(training) 0.90 0.89 0.89 paper, we discussed and analyzed the results of the generalized method to detect plagiarism on the inner levels (Text Alignment). PAN2016(contest) 0.89 0.85 0.87 By achieving the Benchmark of Plagdet 87% without using linguistic corpora and grammatical rules, it is expected that more works in graph based approaches achieve better results in As shown in the results, according to the evaluation criteria, the plagiarism detection. Furthermore, being independent from rules graph-based method has achieved favorable results without using and corpora enables this method to detect plagiarism in other linguistic corpora and only due to the structure of the text. Graph languages. As a future work we want to increase the accuracy of approach has unique features to detect similar documents. Among the algorithm to detect semantic plagiarism using FarsNet lexical these features, one can mention paying attention to non-adjacent network. Since it is possible to add word's synonyms to the words in a sentence. This feature makes plagiarism detection easy, corresponding graph of the document, by adding synonym words, because plagiarism is done by rearranging words, but in the the accuracy of detecting semantic plagiarism is increased. character-based methods attention is just on the relationship Another important category of modern plagiarism, is plagiarism between adjacent words. Furthermore, another feature of the on summary of a text. Due to the flexibility of graph approach in graph is considering the minimum threshold of similarity between detecting plagiarism, graph approach is also efficient in detecting plagiarism on summary of a text. 1 The logarithm of the granularity to smooth its influence on the overall score. 5. REFERENCES Overview of the 5th international competition on plagiarism detection, in CLEF Conference on Multilingual and Multimodal Information Access Evaluation, CELCT. [1] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and Potthast, M., 2016. Algorithms and Corpora for Persian [13] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E. Plagiarism Detection: Overview of PAN at FIRE 2016. In and Stein, B., 2014, September. Improving the Working notes of FIRE 2016 - Forum for Information Reproducibility of PAN’s Shared Tasks. In International Retrieval Evaluation, Kolkata, India, December 7-10, 2016, Conference of the Cross-Language Evaluation Forum for CEUR Workshop Proceedings, CEUR-WS.org. European Languages (pp. 268-299). Springer International Publishing. [2] Barrón-Cedeño, A., Potthast, M., Rosso, P., Stein, B., and Eiselt, A. 2010. Corpus and Evaluation Measures for [14] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P., Automatic Plagiarism Detection, In LREC. 2010, August. An evaluation framework for plagiarism detection. In Proceedings of the 23rd international [3] Flowerdew, J and Li, Y. 2007. Plagiarism and second conference on computational linguistics: Posters (pp. 997- language writing in an electronic age, Annual Review of 1005). Association for Computational Linguistics. Applied Linguistics, vol. 27, pp. 161--183. [15] Shamsfard, M., 2008. Developing FarsNet: A lexical [4] Gollub, T., Stein, B. and Burrows, S., 2012, August. Ousting ontology for Persian. In 4th Global WordNet Conference, ivory tower research: towards a web framework for Szeged, Hungary. providing experiments as a service. In Proceedings of the 35th international ACM SIGIR conference on Research and [16] Sonawane, S.S. and Kulkarni, P.A., 2014. Graph based development in information retrieval (pp. 1125-1126). ACM. Representation and Analysis of Text Document: A Survey of Techniques. International Journal of Computer [5] ICT Research Institute, ACECR ,Iran, 2016. [Online]. Applications, 96(19). Available: http://ictrc.ac.ir/plagdet/. [17] Torres, S. and Gelbukh, A., 2009. Comparing similarity [6] Khoshnavataher, K., Zarrabi, V., Mohtaj, S., and Asghari, H. measures for original WSD lesk algorithm. Research In 2015. Developing Monolingual Persian Corpus for Extrinsic Computing Science, 43, pp.155-166. Plagiarism Detection Using Artificial Obfuscation. Notebook for PAN at CLEF2015, 8-11 September, Toulouse, France. [18] Zager, L., 2005. Graph similarity and matching (Doctoral dissertation, Massachusetts Institute of Technology). [7] Kumar, N. 2014.A Graph Based Automatic Plagiarism Detection Technique to Handle Artificial Word Reordering [19] Zini ,M. 2006 Plagiarism Detection through Multilevel Text and Paraphrasing, Computational Linguistics and Intelligent Comparison, In Automated Production of Cross Media Text Processing, pp. 481--494. Content for Multi-Channel Distribution,Second International Conference. [8] Leacock, C., Miller, G.A. and Chodorow, M., 1998. Using corpus statistics and WordNet relations for sense [20] Zini, M., Fabbri, M., Moneglia, M., and Panunzi. 2006. identification. Computational Linguistics, 24(1), pp.147-165. Alessandro, Plagiarism detection through multilevel text comparison, In Second International Conference, IEEE. [9] Leilei, K., Haoliang, Q., Shuai, W., Cuixia, D., Suhong, W., and Yong, H. 2012. Approaches for candidate document retrieval and detailed comparison of plagiarism detection, in Notebook for PAN at CLEF 2012. [10] [Online]. Available: hazm package, http://www.sobhe.ir/hazm. [Accessed 2015-04-30]. [11] [Online]. Available: igraph package, http://igraph.org/python/. [Accessed 2015-03-15]. [12] Potthast, M and Hagen, M ., Gollub, T., Tippmann, M., Kiesel, J., Rosso, P., Stamatatos, E., and Stein, B. 2013.