Persian Plagiarism Detection Using Sentence Correlations Muharram Mansoorizadeh Taher Rahgooy Department of Computer engineering Department of Computer engineering Bu-Ali Sina University, Hamedan, Iran Bu-Ali Sina University, Hamedan, Iran mansoorm@basu.ac.ir taher.rahgooy@gmail.com ABSTRACT the competition participants can be explained in terms of these This report explains our Persian plagiarism detection system building blocks [5]. which we used to submit our run to Persian PlagDet competition The process starts with a suspicious document dq and a collection at FIRE 2016. The system was constructed through four main D of documents from which dq’s author may have plagiarized. stages. First is pre-processing and tokenization. Second is Within a so-called heuristic retrieval step a small number of constructing a corpus of sentences from combination of source candidate documents, Dx, which are likely to be sources for and suspicious document pair. Each sentence considered to be a plagiarism, are retrieved from D. Note that D can be as large as document and represented as a tf-idf vector. Third step is to the entire web. Hence, it is impractical to compare dq with all of construct a similarity matrix between source and suspicious its members. Usually an initial inspection is made to select a document. Finally the most similar documents which their rough subset of D as prospective candidates. In a so-called similarity is higher than a specific threshold marked as plagiarized detailed analysis step, dq is compared section-wise with the segments. Our performance measures on the training corpus were retrieved candidates. All pairs of sections (sq, sx) with sq ∈ dq and promising (precision=0.914, recall=0.848, granularity=3.85). sx ∈ dx, dx ∈ Dx, are to be retrieved such that sq and sx have a high similarity under some retrieval model. In a knowledge-based post- CCS Concepts processing step those sections are filtered for which certain • Information systems➝Information retrieval ➝Retrieval exclusion criteria hold, such as the use of proper citation or literal tasks and goals. Near-duplicate and plagiarism detection. speech. The remaining suspicious sections are presented to a [1] [2] [3] [4] [5] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] human, who may decide whether or not a plagiarism offense is given. Keywords Persian PlagDet, Plagiarism detection, document retrieval, tf-idf 1. INTRODUCTION Plagiarism in academia is rising and multiple authors have worked to describe these phenomena [11]. As commented by Hunt in [7] , “Internet Plagiarism” is referred sometimes as a consequence of the “Information Technology revolution”, as it proves to be a big problem in academia. According to Park [11], plagiarism is analyzed from various perspectives and considered as a problem that is growing over time. To tackle this problem, the most common approach so far is to detect plagiarism using automated algorithms based on text processing and string matching algorithms. Figure 1: Generic retrieval process for external plagiarism detection [5]. Two main strategies for plagiarism detection have been considered by researchers, namely, Intrinsic and external Given a set of suspicious and source documents written in plagiarism detection [17] [5]. Intrinsic plagiarism detection aims Persian, the plagiarism detection task of Persian PlagDetect at discovering plagiarism by examining only the input document, competition [2] at FIRE 2016, as an external plagiarism deciding whether parts of the input document are not from the detection setup, is to find all plagiarized sections in the suspicious same author. External plagiarism detection is the approach where documents and, if available, the corresponding source sections. suspicious documents are compared against a set of possible The problem consists of a set of source and suspicious documents references. From exact document copy, to paraphrasing, different and a list of source-suspicious pairs to evaluate. The challenge is levels of plagiarism techniques can be used in several contexts, to find and locate plagiarized fragments in each suspicious and according to Meyer zu Eissen [17]. source document for each pair. We tackle this challenge by using a four-step method which is described in the following. For external plagiarism detection Stein, Meyer zu Eissen, and Potthast [15] introduce a generic three-step retrieval process. The authors consider that the source of a plagiarism case may be hidden in a large reference collection, as well as that the detection 2. RELATED WORK results may not be perfectly accurate. Figure 1 illustrates this Text plagiarism is a long lasting game between authors and retrieval process. In fact, all detection approaches submitted by publishers [8]. Automatic plagiarism detection has a story as long as modern machine learning and text mining [15]. Plagiarism in simplest case can be exact copying of others' published or 3.2 Preprocessing and Feature Extraction unpublished works. More advanced cases try to paraphrase stolen Document retrieval task is defined as: [9] ideas in different words [1]. Studies confirm that usually plagiarists copy and paste the original text or just paraphrase it Given a document collection D and a query document, q, find it's keeping most of the text's keywords [16]. Text mining methods most similar documents in D. can reveal these simpler but widely used cases of plagiarism [10]. The task is usually solved by representing documents in some Deep semantic analysis of texts is studied for detecting more vector space and exploiting similarities between the query advanced paraphrasing cases [3]. document and the collection members in this space. Bag of words 3. PROPOSED ALGORITHM representation of documents is a classic, yet powerful method for In the following we describe the steps we took in order to find document indexing and retrieval. In this method a document is plagiarized segments of the document pairs. assumed to be the set of its terms, ignoring their relative positions in paragraphs and sentences. A dictionary is the set of all the 3.1 Motivation distinct terms in the collection. A document, d, is represented by In her MSc thesis [16], Zahra Taheri has investigated many a one-dimensional vector, v, in which vi denotes relative scientific plagiarism cases. Most of the cases were near-copy or importance of the i-th dictionary term in d. Using this slightly modified versions of the original sentences. Any scientific representation, similarities between documents can be estimated field has a common and widely accepted terminology that the by well-known techniques such as Euclidean distance or cosine of related community adopts in writing papers, books and other types their respective vectors. Usually vi is defined as function of of publications. For example in machine learning literature the appearances of the word in the document and the whole term feature is used for describing attributes of objects. As a collection. Term frequency (TF) of term in a document is the dictionary term property is synonymous to feature but no one uses relative frequency of the term in the document. property and property extraction instead of feature and feature extraction in a scientific writing that discusses classification and Eq 1 pattern recognition. Hence, a plagiarized text inherits many of | | the main terms of the original text. In this equation Fi is the frequency of i-th term in the document and |.| denotes document length; i.e. total number of terms in the document. Assuming that there totally N documents in the Original sentence collection and Ni of them contains i-th term, inverse document ‫نقطه جوش آب صد درجه سانتیگراد است‬ frequency (IDF) is defined by equation Eq 2 Boiling Point of Water is one hundred Degrees. Eq 2 Plagiarized sentences Finally, as an importance measure of the word in a ‫نقطه جوش آب صد درجه سلسیوس است‬ document, tf-idf is defined as: Boiling Point of Water is one hundred Celsius degrees. Eq 3 ‫ درجه سانتیگراد است‬011 ‫نقطه جوش آب‬ tf-idf favors terms with high frequency in the document but low Boiling Point of Water is 100 degrees. frequency in the collection. Common terms such as it, is, and what that appear in most of the documents will have a low tf-idf value. ‫آب در صد درجه سانتیگراد به جوش می آید‬ These so-called stop words cannot distinguish the documents Water boils at one hundred degrees efficiently; hence usually are removed in pre-processing steps of NLP tasks. ‫ می جوشد‬،‫اگر دمای آب به صد درجه سانتیگراد برسد‬ We adopt tf-idf for representation and retrieval of matching If temperature of water reaches one hundred degrees, it will boil documents. Each source and suspicious document is firstly ‫اگر آب را تا صد درجه حرارت دهید می جوشد‬ tokenized and then split to sentences. This step is performed using NLTK 3.0 toolkit [4]. Then, each sentence is treated as an If you heat water up to one hundred degrees, it boils. individual document. All the sentences of source and suspicious documents constitute document collection. Figure 2 a syntactic example of a persian sentence and its possible plagiarized versions 3.3 Similarity matrix construction In this step, the similarity matrix between source sentences and A syntactic example is demonstrated in Figure 2. Key terms in the suspicious sentences is calculated. If there are N source sentences original sentence are water, boiling, and one-hundred. To and M suspicious sentences, the similarity matrix S, has N×M communicate the same concept, plagiarized versions of the elements in which Si,j denotes the similarity between sentence i sentence had to use these terms with re-ordering and /or changing from source document and sentence j from suspicious document. verb tenses. This fact narrows down the task of plagiarism There are many different similarity and distance measures. For detection to classic document matching and retrieval. example an option is to use Euclidian distance, which measures the distance between two vectors u and v in the encompassing Euclidian space. In order to convert the distance to a similarity measure we can use: Eq 4 5. CONCLUSION ‖ ‖ In this paper, we proposed a sentence-level algorithm based on tf- Another choice that we used is cosine similarity: idf features for plagiarism detection task, which shows competitive results on the training data. Eq 5 The algorithm works is designed for near-copy and paraphrasing ‖ ‖‖ ‖ types of plagiarism. It relies on the fact that a plagiarist willing to This measure can range from -1 to 1. When the value is 1, it publish an article in a scientific field must use popular means that u and v are in the same direction and when it is -1, it terminology of the field. Obviously sophisticated cases such as means that u and v are in the opposite directions. We use cosine cross language plagiarism or grabbing another ones idea and similarity in subsequent steps. Initial inspection and evaluation discussing it in one's own words would hard to be detected by this confirmed that it has slightly better results than Euclidean algorithm. In the future works we will consider improving the distance. feature vector of the sentences by incorporating more features and also to use a method to combine overlapping fragments. 3.4 Finding plagiarized fragments Furthermore we will study language modeling and semantic text We used pairs of sentences which their similarity was greater than normalization for detecting harder cases of plagiarism. a pre-specified threshold as plagiarized fragments for both source and suspicious documents. The value we used as a threshold was 0.4, which is obtained through cross validation. 6. REFERENCES [1] Alzahrani, S.M., Salim, N. and Abraham, A., 2012. Understanding plagiarism linguistic patterns, textual features, 4. EVALUATION and detection methods. IEEE Transactions on Systems, Man, The training results were obtained after we run our application on and Cybernetics, Part C (Applications and Reviews), 42(2), TIRA platform [6] [13]. The evaluation measures used for this pp.133-149. task are precision, recall, and granularity. Another measure called plagdet used to combine the aforementioned measures in order to [2] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and enable us to sort the results for all algorithms and compare them Potthast, M., 2016. Algorithms and Corpora for Persian in more objective way [12] [14]. Plagiarism Detection: Overview of PAN at FIRE 2016. In Working notes of FIRE 2016 - Forum for Information The detection granularity of detected documents collection R Retrieval Evaluation, Kolkata, India, December 7-10, 2016, under source documents collection S is defined as: CEUR Workshop Proceedings, CEUR-WS.org. ∑| | Eq 6 [3] Barrón-Cedeño, A., Vila, M., Martí, M.A. and Rosso, P., | | ∈ 2013. Plagiarism meets paraphrasing: Insights for the next Where SR ⊆ S are cases detected by detections in R, and generation in automatic plagiarism detection. Computational Rs ⊆ R are the detections of a given s: Linguistics, 39(4), pp.917-947. | ∈ ∈ [4] Bird, S., 2006, July. NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive | ∈ presentation sessions (pp. 69-72). Association for The domain of gran(S, R) is [1, |R|], with 1 indicating the desired Computational Linguistics. one-to-one correspondence and |R| indicating the extreme case, [5] Potthast, M., Stein, B., Eiselt, A., Barron, Cedeno, A., and where a single s ∈ S is detected multiple times. Rosso, P., 2009. Overview of the 1st international Precision, recall, and granularity allow for a partial ordering competition on plagiarism detection. In 3rd PAN Workshop. among plagiarism detection algorithms. To obtain an absolute Uncovering Plagiarism, Authorship and Social Software order they must be combined to an overall score: Misuse (p. 1) Eq 7 [6] Gollub, T., Stein, B. and Burrows, S., 2012, August. Ousting ivory tower research: towards a web framework for providing experiments as a service. In Proceedings of the 35th international ACM SIGIR conference on Research and Where F denotes the F-Measure, i.e., the weighted harmonic mean development in information retrieval (pp. 1125-1126). of precision and recall. Log of granularity decreases its impact on ACM.. the overall score [14]. The results obtained by our algorithm are showed in table 1. [7] Hunt, R., 2003. Let’s hear it for internet plagiarism. Teaching Table 1. Evaluation results for proposed algorithm Learning Bridges, 2(3), pp.2-5. granularit [8] Mali, Yaser. 2011. Crib from the novelist Or thirteen best Threshold Precision Recall plagdet practices for how to do a clean plagiarism (in Persian). y 0.4 0.914 0.848 3.859 0.385 [9] Manning, Christopher D., Raghavan, Prabhakar, and 0.5 0.821 0.926 4.481 0.354 Schütze, Hinrich. Introduction to information retrieval. Cambridge University Press, 2008. [10] Oberreuter, G. and VeláSquez, J.D., 2013. Text mining applied to plagiarism detection: The use of words for detecting deviations in the writing style. Expert Systems with Applications, 40(9), pp.3756-3763. [11] Park, C., 2003. In other (people's) words: Plagiarism by university students--literature and lessons. Assessment & evaluation in higher education, 28(5), pp.471-488. [12] Potthast, M., Eiselt, A., Barrón Cedeño, L.A., Stein, B. and Rosso, P., 2011. Overview of the 3rd international competition on plagiarism detection. In CEUR Workshop Proceedings. CEUR Workshop Proceedings. [13] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E. and Stein, B., 2014, September. Improving the Reproducibility of PAN’s Shared Tasks. In International Conference of the Cross-Language Evaluation Forum for European Languages (pp. 268-299). Springer International Publishing. [14] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P., 2010, August. An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics: Posters (pp. 997- 1005). Association for Computational Linguistics. [15] Stein, B., zu Eissen, S.M. and Potthast, M., 2007, July. Strategies for retrieving plagiarized documents. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 825-826). ACM. [16] Taheri, Zahra. 2012. Plagiarism Detection in Scientific Texts using Structural and Semantic Relations. [17] Zu Eissen, S.M., Stein, B. and Kulig, M., 2007. Plagiarism detection without reference collections. In Advances in data analysis (pp. 359-366). Springer Berlin Heidelberg.