=Paper= {{Paper |id=Vol-1587/T5-4 |storemode=property |title=RDI System for Intrinsic Plagiarism Detection (RDI_RID), Working Notes for PANAraPlagDet at FIRE 2015 |pdfUrl=https://ceur-ws.org/Vol-1587/T5-4.pdf |volume=Vol-1587 |authors=Ashraf Y. Mahgoub,Ahmed Magooda,Mohsen Rashwan,Magda B.Fayek,Hazem Raafat |dblpUrl=https://dblp.org/rec/conf/fire/MahgoubMRFR15 }} ==RDI System for Intrinsic Plagiarism Detection (RDI_RID), Working Notes for PANAraPlagDet at FIRE 2015== https://ceur-ws.org/Vol-1587/T5-4.pdf
     RDI System for Intrinsic Plagiarism Detection (RDI_RID)
                         Working Notes for PAN-AraPlagDet at FIRE 2015

        Ashraf Y. Mahgoub                                 Ahmed Magooda                                      Mohsen Rashwan
        Computer Engineering                             Computer Engineering                           Communication department,
           Department,                                      Department,                                     Cairo University
          Cairo University                                 Cairo University                               rashwan@rdi-eg.com
 ashraf.youssef.mahgoub@g                               ahmed.ezzat.gawad
          mail.com                                         @gmail.com

                      Magda B.Fayek                                                                Hazem Raafat
            Computer Engineering Department,                                              Computer Science Department,
                    Cairo University                                                           Kuwait University
                 magdafayek@ieee.org                                                         hazem@cs.ku.edu.kw



ABSTRACT                                                                 The system proposed in this paper deals with the first track (in-
Many researchers have been investigating the task of plagiarism          trinsic track). Intrinsic plagiarism detection is similar to the prob-
detection lately. In this paper we present RDI system for intrinsic      lem of authorship detection. To our knowledge there is no reliable
plagiarism detection (RDI_RID). RDI_RID system was the only              system that detects the plagiarism intrinsically on Arabic docu-
system that participated in intrinsic track of the Arabic language       ments.
plagiarism detection competition. RDI_RID system achieved a              While a considerable portion of research in this field was targeting
PlagDet (Plagiarism Detection score) of 19% compared to 38%              the English Language, the well-constructed data sets in addition to
achieved by the base line system. The proposed system is based           previous Arabic processing knowledge gained in related tasks
on vector representation of stylometric features extracted from          were the main motivations for participating in this competition.
document’s text.
                                                                         2.       METHOD
Keywords                                                                 The proposed RDI_RID system consists of three basic modules,
Intrinsic Plagiarism Detection; Stylometry; Natural Language             (1) Document chunking module (2) Vector representation mod-
Processing; Part of Speech.                                              ule(3) Filtering module.
1.    INTRODUCTION                                                           1. Chunking module. This module divides the document text us-
Due to major advances in plagiarism techniques, plagiarized doc-                ing a sliding window with the following configuration:
uments have become too difficult and sophisticated to be detected               (a) The Window should contain (param1) alphanumeric char-
by traditional plagiarism detection methodologies. Therefore,                       acters (no punctuation or diacritics).To do so, the window
efficient plagiarism detection techniques are needed to detect in-                  can be allowed to expand up to 1.5*(param1) characters to
telligently manipulated text.                                                       guarantee the existence of param1 alphanumeric characters
                                                                                    within the window or the window reaches the maximum
Plagiarism detection systems are special types of Information                       size of 1.5*(param1).
Retrieval (IR) systems as their task is not limited to recognize                (b) The window slides by (param1)/2 characters at a time.
relevant documents only, they also analyze suspicious documents              2. After the chunking process, the chunks conversion to vector is
and detect chunks of text which are plagiarized from another                    carried out. The following features are extracted to represent
source. Therefore, plagiarism detection can be viewed as a process              the chunk as a vector according to the method proposed by
of reverse engineering applied on suspicious document to refor-                 Zechner et al.[1]. For each chunk the following is estimated:
mulate it to its origins.                                                       (a) Stop words frequency: These include all Arabic stop words
There exists two types of plagiarism detection systems. The first                   such as (... ‫ إلى‬,‫ على‬,‫ )من‬each stop word has a specific di-
type uses stylometric features extracted from the suspicions doc-                   mension in the final vector; each dimension is set with the
ument itself in order to highlight chunks of text that do not align                 corresponding frequency.
with the rest of the document and hence are more probable to be                 (b) Punctuation frequency: These include all Arabic punctua-
brought from another document. This type of systems is called                       tions such as (“" : . ,”). Each punctuation type has a specif-
intrinsic plagiarism detection systems.                                             ic dimension in the final vector; each dimension is set with
                                                                                    the corresponding frequency.
The second type is called extrinsic plagiarism detection systems.               (c) Part of speech frequency: for each part of speech category
In these systems, the suspicious document is tested against a set of                the module counts the frequency of this POS category in
external sources in order to detect which parts have been plagia-                   the considered chunk. Each POS category has a specific
rized from which source document.                                                   dimension in the final vector; each dimension is set with


                                                                       129
         the corresponding frequency. Documents are processed us-         Software Specifications:
         ing RDI_POS tagger [2].                                         ─      Operating System: Windows 7 x64
     (d) Word type frequency: a corpus that combines (Gigaword           ─      Development Environment: Visual Studio 2013
         [3] and classical Arabic) was used to calculate word fre-       ─      Programming Language: .Net C#
         quency for Arabic language, we have categorized the Ara-        RDI_RID was trained using the supported training data to get the
         bic words into a fixed number of classes (26 classes in our     best set of parameters for (param1 and param2)
         case) by calculating                          . For example     The training time for the system is:
         class 1, represents words that occur between 2 and 4 times,
         class 2 represents words that occur between 4 and 8 times,                  Table 3. Training Timeand testing time.
         etc... . For each class the module counts the frequency of                    Training                       11527 seconds
         this class in the considered chunk. Each class category has
         a specific dimension in the final vector; each dimension is                   Testing                        10440 seconds
         set with the corresponding frequency.
                                                                         It should be noted that the previously reported training time is per
After representing all chunks as vectors, comes the role of the          iteration not the whole process of tuning.
filtering module. The filtering module constructs a mean vector
for all chunks’ vectors. After calculating the mean vector, the          5.    CONCLUSION
cosine distance is calculated between each chunk vector and mean         In this paper, the RDI_RID system was introduced for intrinsic
vector. The module then calculates a mean cosine distance for the        plagiarism detection task. The RDI_RID system depends on vec-
previously calculated distances. Then using all chunk vectors the        tor representation of chunks using stylometric features. Despite
module calculates standard deviation. Using the mean value and           being behind the baseline system, the lack of Arabic language
standard deviation for each chunk vector if cosine distance with         resources made the process of developing such a system a hard
the mean vector is less than (mean_value – (param2) * stand-             task. RDI_RID system was the only system that participated in the
ard_deviation) this chunk is classified as plagiarism. The consecu-      competition; the unavailability of comparison to other sys-
tive chunks are then combined and reported as one plagiarism             tems(other than the baseline) limited the evaluation scope of
part.                                                                    RDI_RID system. We intend to introduce other features for
                                                                         RDI_RID system to boost RDI_RID performance by a huge fac-
3.     EVALUATION                                                        tor.
In the training phase, the RDI_RID system is trained in order to
tune the two parameters controlling the RDI_RID system                   6.    REFERENCES
(param1) & (param2). The training process was held using the             [1] Zechner, Mario, et al. External and intrinsic plagiarism detec-
following configuration: param1 = 500, param2 = 0.5, the best                tion using vector space models. Proc. SEPLN. 2009.
performance achieved is presented in Table1.                             [2] www.rdi-eg.com/index.htm
     Table 1. Best results RDI_RID achieved on training and              [3] catalog.ldc.upenn.edu/LDC2011T11
                            testing data
                Recall      Precision     Granularity      PlagDet
 Training        0.14         0.18            1.0            0.16
 Testing         0.18         0.19            1.0            0.19


Table 2 summarizes final results of the intrinsic plagiarism detec-
tion sub-task at AraPlagDet-2015 competition.
Table 2. Results of Arabic intrinsic plagiarism competition for
                          year 2015
         Method                  Magooda                Baseline
     Macro precision              0.19                   0.27
      Macro recall                0.20                   0.78
     Micro precision              0.15                   0.29
       Micro recall               0.20                   0.49
       Granularity                1.00                   1.09
     Plagdet (macro)              0.19                   0.38

4.     TECHNICAL DETAILS
The systems evaluation carried out over training and test data was
performed on a personal machine with plausible specs, the follow-
ing specifications are the specifications used during the whole
system evaluation process:
 Hardware Specifications:
─   CPU: Intel coreI7 4500U - 2 Cores – 1.8 ~ 3.0 GHz
─   RAM: 16 GB of RAM

                                                                       130