=Paper=
{{Paper
|id=Vol-1587/T5-4
|storemode=property
|title=RDI System for Intrinsic Plagiarism Detection (RDI_RID), Working Notes for PANAraPlagDet at FIRE 2015
|pdfUrl=https://ceur-ws.org/Vol-1587/T5-4.pdf
|volume=Vol-1587
|authors=Ashraf Y. Mahgoub,Ahmed Magooda,Mohsen Rashwan,Magda B.Fayek,Hazem Raafat
|dblpUrl=https://dblp.org/rec/conf/fire/MahgoubMRFR15
}}
==RDI System for Intrinsic Plagiarism Detection (RDI_RID), Working Notes for PANAraPlagDet at FIRE 2015==
RDI System for Intrinsic Plagiarism Detection (RDI_RID) Working Notes for PAN-AraPlagDet at FIRE 2015 Ashraf Y. Mahgoub Ahmed Magooda Mohsen Rashwan Computer Engineering Computer Engineering Communication department, Department, Department, Cairo University Cairo University Cairo University rashwan@rdi-eg.com ashraf.youssef.mahgoub@g ahmed.ezzat.gawad mail.com @gmail.com Magda B.Fayek Hazem Raafat Computer Engineering Department, Computer Science Department, Cairo University Kuwait University magdafayek@ieee.org hazem@cs.ku.edu.kw ABSTRACT The system proposed in this paper deals with the first track (in- Many researchers have been investigating the task of plagiarism trinsic track). Intrinsic plagiarism detection is similar to the prob- detection lately. In this paper we present RDI system for intrinsic lem of authorship detection. To our knowledge there is no reliable plagiarism detection (RDI_RID). RDI_RID system was the only system that detects the plagiarism intrinsically on Arabic docu- system that participated in intrinsic track of the Arabic language ments. plagiarism detection competition. RDI_RID system achieved a While a considerable portion of research in this field was targeting PlagDet (Plagiarism Detection score) of 19% compared to 38% the English Language, the well-constructed data sets in addition to achieved by the base line system. The proposed system is based previous Arabic processing knowledge gained in related tasks on vector representation of stylometric features extracted from were the main motivations for participating in this competition. document’s text. 2. METHOD Keywords The proposed RDI_RID system consists of three basic modules, Intrinsic Plagiarism Detection; Stylometry; Natural Language (1) Document chunking module (2) Vector representation mod- Processing; Part of Speech. ule(3) Filtering module. 1. INTRODUCTION 1. Chunking module. This module divides the document text us- Due to major advances in plagiarism techniques, plagiarized doc- ing a sliding window with the following configuration: uments have become too difficult and sophisticated to be detected (a) The Window should contain (param1) alphanumeric char- by traditional plagiarism detection methodologies. Therefore, acters (no punctuation or diacritics).To do so, the window efficient plagiarism detection techniques are needed to detect in- can be allowed to expand up to 1.5*(param1) characters to telligently manipulated text. guarantee the existence of param1 alphanumeric characters within the window or the window reaches the maximum Plagiarism detection systems are special types of Information size of 1.5*(param1). Retrieval (IR) systems as their task is not limited to recognize (b) The window slides by (param1)/2 characters at a time. relevant documents only, they also analyze suspicious documents 2. After the chunking process, the chunks conversion to vector is and detect chunks of text which are plagiarized from another carried out. The following features are extracted to represent source. Therefore, plagiarism detection can be viewed as a process the chunk as a vector according to the method proposed by of reverse engineering applied on suspicious document to refor- Zechner et al.[1]. For each chunk the following is estimated: mulate it to its origins. (a) Stop words frequency: These include all Arabic stop words There exists two types of plagiarism detection systems. The first such as (... إلى, على, )منeach stop word has a specific di- type uses stylometric features extracted from the suspicions doc- mension in the final vector; each dimension is set with the ument itself in order to highlight chunks of text that do not align corresponding frequency. with the rest of the document and hence are more probable to be (b) Punctuation frequency: These include all Arabic punctua- brought from another document. This type of systems is called tions such as (“" : . ,”). Each punctuation type has a specif- intrinsic plagiarism detection systems. ic dimension in the final vector; each dimension is set with the corresponding frequency. The second type is called extrinsic plagiarism detection systems. (c) Part of speech frequency: for each part of speech category In these systems, the suspicious document is tested against a set of the module counts the frequency of this POS category in external sources in order to detect which parts have been plagia- the considered chunk. Each POS category has a specific rized from which source document. dimension in the final vector; each dimension is set with 129 the corresponding frequency. Documents are processed us- Software Specifications: ing RDI_POS tagger [2]. ─ Operating System: Windows 7 x64 (d) Word type frequency: a corpus that combines (Gigaword ─ Development Environment: Visual Studio 2013 [3] and classical Arabic) was used to calculate word fre- ─ Programming Language: .Net C# quency for Arabic language, we have categorized the Ara- RDI_RID was trained using the supported training data to get the bic words into a fixed number of classes (26 classes in our best set of parameters for (param1 and param2) case) by calculating . For example The training time for the system is: class 1, represents words that occur between 2 and 4 times, class 2 represents words that occur between 4 and 8 times, Table 3. Training Timeand testing time. etc... . For each class the module counts the frequency of Training 11527 seconds this class in the considered chunk. Each class category has a specific dimension in the final vector; each dimension is Testing 10440 seconds set with the corresponding frequency. It should be noted that the previously reported training time is per After representing all chunks as vectors, comes the role of the iteration not the whole process of tuning. filtering module. The filtering module constructs a mean vector for all chunks’ vectors. After calculating the mean vector, the 5. CONCLUSION cosine distance is calculated between each chunk vector and mean In this paper, the RDI_RID system was introduced for intrinsic vector. The module then calculates a mean cosine distance for the plagiarism detection task. The RDI_RID system depends on vec- previously calculated distances. Then using all chunk vectors the tor representation of chunks using stylometric features. Despite module calculates standard deviation. Using the mean value and being behind the baseline system, the lack of Arabic language standard deviation for each chunk vector if cosine distance with resources made the process of developing such a system a hard the mean vector is less than (mean_value – (param2) * stand- task. RDI_RID system was the only system that participated in the ard_deviation) this chunk is classified as plagiarism. The consecu- competition; the unavailability of comparison to other sys- tive chunks are then combined and reported as one plagiarism tems(other than the baseline) limited the evaluation scope of part. RDI_RID system. We intend to introduce other features for RDI_RID system to boost RDI_RID performance by a huge fac- 3. EVALUATION tor. In the training phase, the RDI_RID system is trained in order to tune the two parameters controlling the RDI_RID system 6. REFERENCES (param1) & (param2). The training process was held using the [1] Zechner, Mario, et al. External and intrinsic plagiarism detec- following configuration: param1 = 500, param2 = 0.5, the best tion using vector space models. Proc. SEPLN. 2009. performance achieved is presented in Table1. [2] www.rdi-eg.com/index.htm Table 1. Best results RDI_RID achieved on training and [3] catalog.ldc.upenn.edu/LDC2011T11 testing data Recall Precision Granularity PlagDet Training 0.14 0.18 1.0 0.16 Testing 0.18 0.19 1.0 0.19 Table 2 summarizes final results of the intrinsic plagiarism detec- tion sub-task at AraPlagDet-2015 competition. Table 2. Results of Arabic intrinsic plagiarism competition for year 2015 Method Magooda Baseline Macro precision 0.19 0.27 Macro recall 0.20 0.78 Micro precision 0.15 0.29 Micro recall 0.20 0.49 Granularity 1.00 1.09 Plagdet (macro) 0.19 0.38 4. TECHNICAL DETAILS The systems evaluation carried out over training and test data was performed on a personal machine with plausible specs, the follow- ing specifications are the specifications used during the whole system evaluation process: Hardware Specifications: ─ CPU: Intel coreI7 4500U - 2 Cores – 1.8 ~ 3.0 GHz ─ RAM: 16 GB of RAM 130