=Paper=
{{Paper
|id=Vol-1391/144-CR
|storemode=property
|title=Developing Monolingual English Corpus for Plagiarism Detection using Human Annotated Paraphrase Corpus
|pdfUrl=https://ceur-ws.org/Vol-1391/144-CR.pdf
|volume=Vol-1391
|dblpUrl=https://dblp.org/rec/conf/clef/MohtajAZ15
}}
==Developing Monolingual English Corpus for Plagiarism Detection using Human Annotated Paraphrase Corpus==
Developing Monolingual English Corpus for Plagiarism Detection using Human Annotated Paraphrase Corpus Notebook for PAN at CLEF 2015 Salar Mohtaj, Habibollah Asghari, Vahid Zarrabi ICT Research Institute, Academic Center for Education, Culture and Reseach (ACECR), Iran {salar.mohtaj, habib.asghari, vahid.zarrabi}@ictrc.ir Abstract. In this paper, we describe an approach to create monolingual English plagiarism detection corpus for the task of text alignment corpus construction in PAN 2015 competition. We propose two different obfuscation methods to fragment obfuscation for creating the cases of plagiarism. The first method is an artificial obfuscation which consists of variety of obfuscation strategies such as synonym substitution, random change of order, POS preserving change of order and addition/deletion. The second obfuscation method is a simulated obfusca- tion, in which the SemEval dataset is used for creating the cases of plagiarism by using pairs of sentences with their similarity scores. Keywords: Plagiarism Detection, Corpus Construction, Monolingual English Corpus, Human Annotated Paraphrase Corpus 1 Introduction Plagiarism is defined as re-use of another person’s ideas, processes, results, or words without explicitly acknowledging the original source [1]. Plagiarism detection algo- rithms try to search in the large document collections for the retrieval and extraction the patterns of text reuse [2]. Plagiarism detection systems are one of the tools have been using to fight plagiarism and malpractice use of others text [3]. In developing plagiarism detection systems, a plagiarism detection corpus is used for evaluation of the system. It consists of predefined tagged plagiarized materials. The plagiarism detection task has been running for seven years in PAN competi- tion and each year, it provides a corpus for evaluating of submitted systems. The evaluation corpora in PAN are used for text alignment and source retrieval task for plagiarism detection [4]. Variety of obfuscation strategies have been used to create text alignment corpora such as artificial obfuscation, simulated obfuscation, transla- tion and summary obfuscation [2, 4, 5, 6, 7]. In this lab report, we have described our approach to generate a monolingual Eng- lish corpus for the task of text alignment corpus construction. We employ two obfus- cation strategies, artificial and simulated ones. Our main contribution is using the SemEval dataset for constructing simulated plagiarism cases. The similarity score of paired sentences in SemEval dataset have been used for establishing the degree of obfuscation for plagiarism cases. In the following, in section 2 we describe our approach for corpus construction. Then in section 3 we will discuss the statistics of the resulted corpus which is based on Wikipedia articles. Finally, we will conclude and discuss about some future works in section 4. 2 Our Approach In this section, an overview of our approach for constructing a monolingual English plagiarism detection corpus is presented. Our approach includes four main steps: doc- ument clustering, fragment extraction, fragment obfuscation and inserting plagiarism cases into the source and suspicious documents. The process of each step is described in the following sections. 2.1 Documents Clustering The documents which are used in the corpus are derived from the Wikipedia Internet encyclopedia project. In this step, the collection of Wikipedia documents is clustered into different topically related categories. Since pages on similar subjects are intended to be grouped together via categories, a bipartite graph of documents-categories has been created to cluster the documents based on their topics. To detect communities of the graph, the infomap community detection algorithm [9] has been applied to the graph. Finally, documents within a community are considered as similar documents in one cluster. Each suspicious document and its corresponding source documents are selected from the same cluster. 2.2 Fragments Extraction The documents used in the corpus are divided into two categories: 50% of the docu- ments are considered as source and 50% are designated as suspicious documents. Note that only 25% of suspicious documents contain plagiarism cases. We have used two different methods for fragment extraction. In the first method, the fragments are extracted from the source documents, while in the second method, the SemEval datasets is used for fragment extraction. The length of fragments is even- ly distributed between 3 and 12 sentences. The distribution of fragments’ length is shown in Table 1. Table 1. Fragment lengths in sentences Fragment Length Short 3 – 5 sentences Medium 6 – 8 sentences Long 9 – 12 sentences 2.3 Fragments Obfuscation We have proposed two obfuscation strategies for obfuscation of fragments: Artificial obfuscation and simulated obfuscation. In the following, we described our obfusca- tion strategies. Artificial Obfuscation. For the purpose of generating artificial plagiarism, obfusca- tion strategies were applied to fragments extracted from source documents. We have used five obfuscation strategies as follows: None (No Obfuscation) Source fragment without any change considered as the obfuscation fragment. In other words, the obfuscation fragment is an exact copy of source fragment. Random Change of Order Given source fragment, the obfuscation fragment is created by shuffling words at random. POS-preserving Change of Order In order to accomplish this obfuscation strategy, the sequence of parts of speech (POS) tags in source fragment is determined. Then, words are shuffling randomly, while retaining the original POS sequence. Synonym Substitution The plagiarized fragment is created in such a way to replace some words by one of their synonyms. Addition / Deletion The obfuscated fragment is created by inserting or removing words at random. Simulated Obfuscation. The pairs of sentences from the dataset of semantic textual similarity task in SemEval are used for constructing the simulated plagiarism cases. The dataset includes pairs of semantically similar sentences with their corresponding similarity score. The similarity score can range from exact semantic equivalence to complete unrelatedness, corresponding to quantified values between five and zero [8]. In order to create the cases of plagiarism, we ignore unrelatedness sentences with a similarity degree lower than 3. In this strategy, both source and plagiarized fragments are constructed by SemEval dataset sentences. Source fragments constructed by original sentences and corre- sponding plagiarized fragments are created by corresponding sentences of original ones in the dataset. To consider the degree of obfuscation in plagiarized fragments, a combination of sentences with a variety of similarity scores is used in a fragment. The number of sentences and their similarity scores specifies the degree of obfuscation for each pla- giarized fragment. More precisely, using sentences with higher degree of similarity (e.g. 5) could lead to plagiarized fragments with lower degree of obfuscation and vice versa. The distribution of different sentences for creating different degrees of obfusca- tion (namely “Low”, “Medium”, and “High” obfuscation) is shown in Table 2. Table 2. Obfuscation degree in simulated plagiarism cases Degree Similarity Scores of Sentences 3 4 5 Low - 1% -15% 85% - 100% Medium 25% - 45% 55%- 75% High 45% - 65% 35% - 55% 2.4 Inserting Plagiarism Cases into Suspicious Documents In this step, one or more plagiarism cases according to the suspicious document’s length, within the same cluster have been selected. Then, each of them inserted at random positions in suspicious documents. For simulated plagiarism cases, the corre- sponding source fragments also inserted at random positions in source documents. The fraction of plagiarism in each document is not fixed. The percentage of plagia- rism in each suspicious document is distributed between 5% and 60% of its length. The ratio of plagiarism per suspicious documents is shown in Table 3. Table 3. Ratio of Plagiarism fragments in Documents Plagiarism per Document Hardly 5% - 20% Medium 20% - 40% Much 40% - 60% Finally, for each pair of source and suspicious documents, a Metadata file is creat- ed which contains meta information about the plagiarism cases. The tags in the file include: this_length: The length of plagiarism case in the suspicious document. this_offset: Start offset of the plagiarism case in the suspicious document. source_reference: Name of source document. source_length: The length of source fragment in the source document. source_offset: Start offset of the source fragment in the source document. 3 Results In this section, the results and statistics of monolingual English corpus for the PAN 2015 competition is presented. This corpus is based on Wikipedia documents. The results of corpus construction are shown in Table 4. Table 4. Statistics of Human Annotated Paraphrase Corpus Document Statistics Document Purpose The number of source documents: 3309 The number of suspicious documents: 952 Plagiarism per Document Hardly (5% - 20%) 60% Medium (20% - 40%) 25% Much (40% - 60%) 15% Plagiarism Case Statistics Plagiarism cases The number of plagiarism cases: - No obfuscation cases: 10% - With obfuscation cases: - Random obfuscation: 78% - Simulated obfuscation: 12% Case Length Short (3 – 5 sentences): 50% Medium (6 – 8 sentences): 32% Long (9 – 12 sentences): 18% The established English mono-lingual plagiarism detection corpus is available at the website1 of “Research Institute for Information and Communication Technology” for research purposes. 4 Conclusion and Future Work In this lab report, we described our approach for constructing a monolingual plagia- rism detection corpus. We have used two obfuscation strategies to create our corpus. The first is artificial obfuscation strategy in which the plagiarized fragments are au- tomatically created. In the second strategy, named simulated obfuscation, either source or plagiarized fragments were created by SemEval dataset. The degree of ob- fuscation in simulated plagiarism cases is based on similarity scores of paired sen- tences. This corpus is intended to be used for testing the performance of plagiarism detection systems for English language. Although this corpus is in English text, the obfuscation strategy can also be exploited in other languages. In our future work, we plan to improve our corpus by implementing other obfuscation techniques. Acknowledgement This work has been accomplished in ICT research Institute, ACECR, under the support of Vice Presidency for Science and Technology of Iran - grant No. 1164331. The authors gratefully acknowledge the support of aforementioned organizations. Special thanks go to the members of ITBM research group for their valuable collabo- ration. References 1. Barrón-Cedeño, Alberto, Marta Vila, M. Antònia Martí, and Paolo Rosso. "Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection." Computational Linguistics 39, no. 4 (2013): 917-947. 2. Potthast, Martin, Matthias Hagen, Anna Beyer, Matthias Busse, Martin Tippmann, Paolo Rosso, and Benno Stein "Overview of the 6th International Competition on Plagiarism De- tection." CLEF (Online Working Notes/Labs/Workshop). 2014. 3. Juričić, Vedran, Vanja Štefanec, and Siniša Bosanac. "Multilingual plagiarism detection corpus." In MIPRO, 2012 Proceedings of the 35th International Convention, pp. 1310- 1314. IEEE, 2012. 4. Potthast, Martin, Matthias Hagen, Tim Gollub, Martin Tippmann, Johannes Kiesel, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. "Overview of the 5th international compe- 1 http://www.ictrc.ir/plaglab/corpora/MonoLingual_English_Corpus(mohtaj15).zip tition on plagiarism detection." In CLEF Conference on Multilingual and Multimodal In- formation Access Evaluation, pp. 301-331. CELCT, 2013. 5. Potthast, Martin, Alberto Barrón-Cedeño, Andreas Eiselt, Benno Stein, and Paolo Rosso. "Overview of the 2nd International Competition on Plagiarism Detection." In CLEF (Notebook Papers/LABs/Workshops). 2010. 6. Potthast, Martin, Alberto Barrón-Cedeño, Andreas Eiselt, Benno Stein, and Paolo Rosso. "Overview of the 3rd International Competition on Plagiarism Detection." In CLEF (Note- book Papers/LABs/Workshops). 2011. 7. Potthast, Martin, Tim Gollub, Matthias Hagen, Jan Graßegger, Johannes Kiesel, Maximili- an Michel, Arnd Oberländer, Martin Tippmann, Alberto Barrón-Cedeño, Parth Gupta, Pao- lo Rosso, and Benno Stein. Overview of the 4th International Competition on Plagiarism Detection. In Working Notes Papers of the CLEF 2012 Evaluation Labs, September 2012. ISBN 978-88-904810-3-1. ISSN 2038-4963. 8. Agirre, Eneko, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. "sem 2013 shared task: Semantic textual similarity, including a pilot on typed-similarity." In In* SEM 2013: The Second Joint Conference on Lexical and Computational Semantics. Asso- ciation for Computational Linguistics. 2013. 9. Rosvall, Martin, and Carl T. Bergstrom. "Maps of random walks on complex networks re- veal community structure." Proceedings of the National Academy of Sciences 105, no. 4 (2008): 1118-1123.