A Text Alignment Algorithm Based on Prediction of
              Obfuscation Types Using SVM Neural Network
                       Fatemeh Mashhadirajab                                                 Mehrnoush Shamsfard
                        NLP Research Lab,                                                    NLP Research Lab,
           Faculty of Computer Science and Engineering,                         Faculty of Computer Science and Engineering,
                     Shahid Beheshti University                                           Shahid Beheshti University
                               Iran                                                                 Iran
                  f.mashhadirajab@mail.sbu.ac.ir                                             m-shams@sbu.ac.ir


ABSTRACT                                                               In this paper we discuss our proposed algorithm which has
In this paper, we describe our text alignment algorithm that           participated in Persian Plagdet 2016 and ranked first among
achieved the first rank in Persian Plagdet 2016 competition. The       participants. Our approach firstly uses a neural network for
Persian Plagdet corpus includes several obfuscation strategies.        detecting the type of obfuscation in each document pair. Then it
Information about the type of obfuscation helps plagiarism             sets the parameters in the text alignment algorithm based on the
detection systems to use their most suitable algorithm for each        detected type of obfuscation. The rest of the paper explains the
type. For this purpose, we use SVM neural network for                  proposed algorithm with the special focus on the obfuscation type
classification of documents according to the type of obfuscation       detection module. Then the result of our evaluation of the system
strategy used in a document pair. Then, we set the parameter           on Persian Plagdet corpus 2016 is discussed.
values in our text alignment algorithm based on the detected type
of obfuscation. The results of our algorithm on the test dataset and
                                                                       2. RELATED WORK
training dataset in the Persian Plagdet 2016 are shown in this         At the PAN competition, the text alignment algorithms are
article.                                                               evaluated by the evaluation corpora that contain different types of
                                                                       obfuscation. For example in PAN 2013- 2014 competitions, the
CCS Concepts                                                           evaluation corpus consisted of the obfuscation types: none,
• Information systems → Near-duplicate and plagiarism                  random, translation and summary. In PAN text alignment corpora
                                                                       it is assumed that just one type of obfuscation is employed in each
detection•Text mining➝ Paraphrase detection➝ Plagiarism
                                                                       document pair. Based on this assumption most participants try to
detection➝ Support Vector Machine.                                     predict the type of obfuscation strategy used in a document pair
                                                                       and detect similarities based on the predicted type. At PAN 2014
Keywords                                                               competition in Glinos’ algorithm [10] all of plagiarism documents
Plagiarism detection; Text alignment; SVM neural network.
                                                                       are divided into two categories: order-based and non-order based.
1. INTRODUCTION                                                        The order-based plagiarism involves none and random
                                                                       obfuscations. The non-order based plagiarism involves translation
In recent years, automatic discovery of plagiarism has been
                                                                       and summary obfuscation. They use Smith-Waterman algorithm
considered by many researchers, and many plagiarism detection
                                                                       [13] to detect aligned sequences of document pairs and so, detect
systems have been developed [5, 6, 7]. Plagiarism detection has
                                                                       the order-based plagiarism cases. If no aligned sequences have
become a task of PAN competition1 which is held every year
                                                                       been found, then document pairs are given to the clustering
since 2009 to evaluate participants’ plagiarism detection
                                                                       component to detect non-order based plagiarism cases. Miguel et
algorithms. At the PAN competition the plagiarism detection task
                                                                       al [1, 9] categorize document pairs of PAN 2014 corpus into three
is divided into source retrieval and text alignment subtasks. The
                                                                       categories: Verbatim, Summary and Other plagiarism cases and
task of source retrieval is to retrieve documents similar to the
                                                                       set the parameters in their algorithm based on the categories. They
suspicious document from the set of source documents, and the
                                                                       use the Longest Common Substring (LCS) algorithm to find every
duty of text alignment task is to extract all the plagiarized
                                                                       single common sequence of word (th-Verbatim). If at least one
passages from the given source-suspicious document pair. Figure
                                                                       Verbatim case have been found, the document pair is considered
1 shows different parts of a plagiarism detection system.
                                                                       as Verbatim plagiarism. If no Verbatim case have been found and
As mentioned, in text alignment, the documents in the data sets
                                                                       the length of plagiarism fragments in the suspicious document is
used to evaluate similarity detector systems are divided into two
                                                                       much smaller than the length of source fragments, the document
categories of source documents and suspicious documents. Each
                                                                       pair is considered as Summary plagiarism, otherwise the
Suspicious document contains one or more parts of a source
                                                                       document pair is considered as Other plagiarism cases. Also
document in its original or edited or rephrased form. The duty of
                                                                       Palkovskii et al [11] in their algorithm use a graphical clustering
text alignment -which is the focus of this paper- is to find the
                                                                       algorithm to detect type of plagiarism in a document pair. They
plagiarized parts of the source document in the suspicious
                                                                       classify document pairs of PAN 2014 text alignment corpus into
document for each pair of source and suspicious document [8].
                                                                       four categories: Verbatim Plagiarism, Random Plagiarism,
Persian Plagdet 2016 competition2 which is a subtask of PAN Fire
                                                                       Summary type Plagiarism and Undefined type. Afterward, they
2016 competition3 is held for Persian language. It means that the
                                                                       set the parameters based on the detected type of plagiarism. In
text alignment algorithms are evaluated on a Persian corpus.
                                                                       Persian plagdet 2016 corpus there are three type of obfuscation:
1
    http://pan.webis.de/
                                                                       3
2
    http://ictrc.ac.ir/plagdet/                                            http://fire.irsi.res.in/fire/2016/home
none, random and simulated. In our proposed approach, the               Where            is the vector of ith sentence from suspicious
document pairs of the Persian plagdet 2016 corpus are classified        document and         is the vector from jth sentence of source and |.|
into two categories: Verbatim plagiarism and Simulated                  is the Euclidean length. Cosine measure and Dice coefficient are
plagiarism. We use SVM neural network to detect type of                 calculated for all pairs of sentences and if the similarity of two
plagiarism. The SVM neural network has been trained by type of          vectors of         and        is more than threshold of 0.3 (chosen
obfuscation in the Persian plagdet 2016 training corpus. Then we        based on [1]) based on the both criteria above, this pair of
set the parameters based on the detected type of plagiarism.            sentences are considered as seed, and for the pairs of sentences
                                                                        whose similarity is more than 0.1 and less than 0.3 (chosen based
                                                                        on our experiments), the similarity will be evaluated semantically.
                                                                        For this purpose, using SVM neural network 4, the type of
                                                                        obfuscation strategy used in the document pairs will be specified.
                                                                        We use cosine similarity percentage between all pairs of sentences
                                                                        of two suspicious and source documents to create our SVM input
                                                                        vector. SVM input vector for suspicious and source document pair
                                                                        is calculated as follows:
                                                                        An 8-bit vector for each document pair is considered. The range
                                                                        of similarities is divided into 8 intervals. Each bit of the vector
                                                                        corresponds to one of these intervals and indicates if there are two
                                                                        sentences in the document pair whose similarity is in the
                                                                        corresponding interval. In other words,         indicates sentence
                                                                        similarity between the value and value. Where for
                                                                                   ,
             Figure 1. Plagiarism detection systems.
                                                                        If there are sentence pairs whose cosine similarity are between
                                                                        and in a document pair, then       = 1; otherwise,    = 0.
3. METHODOLOGY
Our proposed text alignment algorithm like many other text              This vector is given to SVM neural network previously trained by
alignment algorithms [12] includes four stages of preprocessing,        Persian Plagdet training dataset 2016, and document pair
seeding, extension, and filtering. Each of these four stages will be    obfuscation strategy is projected. We set maximum and minimum
explained in this section. In addition, Figure 2 is an overarching      similarity threshold in semantic similarity measure based on the
scheme of our text alignment algorithm that shows these four            type of obfuscation and the amount of similarity between the pairs
stages.                                                                 of sentences. To calculate the semantic similarity we use FarsNet
                                                                        [3] to extract synsets of each term and STeP_1 to extract
3.1 Preprocessing                                                       inflectional and derivational stems of each term. Thus, for each
In the preprocessing stage, first, the text is segmented into           term, a set of words called         is considered as shown in Figure
sentences and then tokenized by STeP_1 [2], and Stopwords [4]           3. Then, for each         from vector            , if        overlaps
are removed, and inflectional and derivational stems of tokens are           ́ of each ́ of vector        ,    of vector       is replaced by
extracted and restored by STeP_1. Preprocessing is done for a pair        ́ of vector       . Finally, the similarity of cosine and Dice is
of suspicious and source document and the sentences of
                                                                        calculated for the two resulting vectors, and the similarity between
suspicious and source document will be given the seeding stage.
                                                                        the results at this stage and results of cosine and Dice in the
3.2 Seeding                                                             previous stage are averaged; if the result is greater than the
In this stage, the purpose is to extract the similar sentence pairs     threshold, the pair of         and       are considered as seed. The
from source and suspicious documents that we call them seed. For        set of seeds obtained in this stage enter the extension stage.
seeding, our method is initially based on the method introduced
by [1]. We expanded the mentioned method by using SVM neural            3.3 Extension
net to predict the obfuscation type in order to adjust parameters to    The purpose of the extension stage is the extraction of the longest
gain better results. In this approach, based on vector space model      similar passages from the suspicious and source documents. As
(VSM) method, first, tf-idf vector is calculated for all sentences of   shown in Figure 2, extension consists of two parts: clustering and
suspicious and source documents. Where tf is the term frequency         validation. In the clustering stage, the document is clustered into
in the corresponding sentence and idf is the inverse sentence           pieces, so that each piece contains a number of seeds where the
frequency. Then, the similarity of each sentence pair of suspicious     (similarity) distance between them does not exceed a threshold. In
and source document is calculated using cosine measure and Dice         the validation stage, among the pair of passages created in the
coefficient according to Eq. 1, 2 and 3.                                clustering stage; those that are not similar enough are removed.
                                                                        Again, for the extension stage, we adopt and enhance the method
                                                                        proposed by [1]. The difference is that in the validation stage, we
         (           )       |       ||
                                                              (1)       use semantic similarity measure instead of cosine measure to
                                              |
                                                                        determine the similarity between pairs of passages.
                                 |                    |
     (           )
                         |                |       |       |


         {                                                              4
                                                                            http://www.csie.ntu.edu.tw/~cjlin/libsvm/
Suspicious         Source
Document          Document


   Preprocessing
                                                           Seeding


  Sentence Splitting                                        tf_idf


                                                                                      Training
                                                       Cosine Measure                 Dataset
     Tokenizing
                                                       Dice Coefficient


  Remove Stopwords                                      Classification                  SVM
                                                                                       neural
                                                                                      network

     Stemming                                         Setting Parameters
                                STeP_1


                                                          Semantic                    FarsNet
                                                      Similarity Measure


                       Extension


                       Clustering


                       Validation


                                                                          Filtering


                                                                 Resolving Overlapping


                                                                 Removing Small Cases


                                                                         Plagiarism
                                                                          Passages


                               Figure 2. The proposed text alignment algorithm.
3.4 Filtering                                                             obfuscation and parameter settings based on the type of
The filtering stage removes some passages that either overlap or          obfuscation, precision and recall have been improved dramatically
are too short. To remove overlapping passages we use the                  on all types of obfuscation.
proposed method in [1].To remove too short passages, we use a
recursive algorithm. If the length of a passage is less than a            Table 1. The results of the proposed text alignment algorithm
threshold, we first assume that other seeds should have been                             on Persian Plagdet corpus 2016
existed in this passage, but we had not identified them. So we
                                                                           Corpus       Plagdet      Recall     Precision      Granularity
decrease the threshold on semantic similarity measure and go
back to the seeding stage, and we extract the seeds based on the             Test         0.92        0.92         0.93             1.00
new threshold; and repeat all the stages to remove the too short
parts. If the part was not big enough this time, the part will be          Training       0.94        0.94         0.93             1.00
removed.

                                                                           Table 2. The proposed algorithm on types of obfuscation in
                                         ́                                           Persian Plagdet training dataset 2016
                                 ́
                                                                            Obfuscation                            P_1       P_2       P_3
                             =                       ́1
             1           …                                    …       ́
                                                                                                  Plagdet          0.94      0.96      0.97
                                                                                                   Recall          0.96      0.98      0.99
                                             ́                                  None
                                                                                                  Precision        0.92      0.95      0.94
                                                                                                 Granularity        1          1           1
        {                                        {                                                Plagdet          0.81      0.84      0.94
                                                                                                   Recall          0.78      0.84      0.93
   n = the number of terms in            m = the number of terms in           Random
                                                                                                  Precision        0.85      0.84      0.94
                                                                                                 Granularity        1          1           1
                 ́                                                                                Plagdet          0.55      0.69      0.86
                                     ́                                                             Recall          0.41      0.61      0.83
                     ́                                                       Simulated
                                                                                                  Precision        0.84      0.80      0.91
                                                                                                 Granularity        1          1           1
     Where


         Figure 3. New vectors for semantic similarity.                   5. Conclusions and Future Work
                                                                          We described our algorithm for the task of text alignment, and
                                                                          presented the results of the evaluation of this algorithm on test and
4. RESULTS                                                                training dataset in Persian Plagdet 2016, that it was the best result
We implemented our algorithm in C#.Net and evaluated it based             compared with the results of other participants. In our method, we
on the PAN evaluation setup [15, 16, 17]. In the evaluating stage         used SVM neural network to identify the type of obfuscation and
we ran our algorithm on the Persian Plagdet 2016 training and test        then to set the parameters on the basis of obfuscation; the results
dataset [14]. The results of this evaluation also are shown in Table      showed that this is effective in improving the precision and recall.
1. As can be seen, the results of our algorithm on both training          In the future, we are going to improve the semantic similarity
and test corpora are very close. Training corpus in this                  measure in the seeding stage of our system. We want to use the
competition includes a variety of obfuscation strategies including        neural network to estimate the semantic similarity of pair of
None, Random and simulated obfuscation category. Table 2                  sentences. We also want to use methods such as genetic
shows the results of our algorithm on any of the obfuscation              algorithms to automatically adjust the parameters.
strategies in training dataset. In Table 2, column P_1 shows the
results of our algorithm on the types of obfuscation in training
                                                                          6. REFERENCES
dataset where the semantic similarity measure is not used. P_2
column shows the algorithm results using semantic similarity              [1] Sanchez-Perez, M. A., Gelbukh, A. F., Sidorov, G. 2015.
measure. Column P_3 shows the results of our algorithm after                  Dynamically adjustable approach through obfuscation type
adding the criterion of semantic similarity, and also adjusting the           recognition. In: Working Notes of CLEF 2015 - Conference
parameters based on the detected type of obfuscation using neural             and Labs of the Evaluation forum, (Toulouse, France,
network. As can be seen, in column P_2, by adding semantic                    September 8-11, 2015). CEUR Workshop Proceedings, vol.
similarity criteria, the recall for the types of obfuscation in               1391. CEUR-WS.org.
training corpus is improved, but the precision has been declined in       [2] Shamsfard, M., Kiani, S. and Shahedi, Y. STeP-1: standard
some cases while in the column P_3, it is seen that by adding a               text preparation for Persian language, CAASL3 Third
neural network to the system for the diagnosis of type of                     Workshop on Computational Approaches to Arabic Script-
                                                                              Languages.
[3] Shamsfard, M. 2008. Developing FarsNet: A lexical                      (15-18 September, Sheffield, UK). CEUR-WS.org. ISSN
    ontology for Persian. proceedings of the 4th global WordNet            1613-0073.
    conference.                                                       [12] Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann,
[4] Davarpanah, M. R., sanji, M. and Aramideh, M. 2009. Farsi              M., Rosso, P., Stein, B. 2014. Overview of the 6th
    lexical analysis and StopWord list. Library Hi Tech, vol. 27,          International Competition on Plagiarism Detection. In:
    pp 435–449.                                                            Working Notes for CLEF 2014 Conference, (Sheffield, UK,
[5] FIEDLER, R. and KANER, C. 2010. Plagiarism Detection                   15-18 September). CEUR Workshop Proceedings, vol. 1180,
    Services: How Well Do They Actually Perform. IEEE                      pp. 845-876. CEUR-WS.org.
    Technology And Society Magazine, pp. 37-43.                       [13] Smith, T., Waterman, M. 1981. Identification of common
[6] Alzahrani, M., Salim, N. and Abraham, A. 2012.                         molecular subsequences. Journal of molecular biology. Vol.
    Understanding plagiarism linguistic patterns, textual features,        147(1), pp. 195–197.
    and detection methods. IEEE Trans. SYSTEMS, MAN, AND              [14] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and
    CYBERNETICS—PART            C:     APPLICATIONS           AND          Potthast, M., 2016. Algorithms and Corpora for Persian
    REVIEWS, vol. 42, no. 2.                                               Plagiarism Detection: Overview of PAN at FIRE 2016. In
[7] Ali, A. M. E. T., Abdulla, H. M. D. and Snasel, V. 2011.               Working notes of FIRE 2016 - Forum for Information
    Survey of plagiarism detection methods. IEEE Fifth Asia                Retrieval Evaluation, Kolkata, India, December 7-10, 2016,
    Modelling Symposium (AMS), pp. 39_42.                                  CEUR Workshop Proceedings, CEUR-WS.org.

[8] Potthast, M., Göring, S. 2015. Towards data submissions for       [15] Potthast, M., Stein, B., Barrón-Cedeño, A., and Rosso, P.
    shared tasks: first experiences for the task of text alignment.        2010. An Evaluation Framework for Plagiarism Detection.
    Working Notes Papers of the CLEF 2015 Evaluation Labs,                 In 23rd International Conference on Computational
    CEUR Workshop Proceedings, ISSN 1613-0073.                             Linguistics (COLING 10), pp. 997-1005.

[9] Sanchez-Perez, M., Sidorov, G., Gelbukh, A. 2014. The             [16] Gollub, T., Stein. B. and Burrows, S. 2012. Ousting Ivory
    Winning Approach to Text Alignment for Text Reuse                      Tower Research: Towards a Web Framework for Providing
    Detection at PAN 2014. In: Notebook for PAN at CLEF                    Experiments as a Service. In 35th International ACM
    2014. (15-18 September, Sheffield, UK). CEUR Workshop                  Conference on Research and Development in Information
    Proceedings, ISSN 1613-0073, Vol. 1180, CEUR-WS.org,                   Retrieval (SIGIR 12), pp. 1125-1126. ACM. ISBN 978-1-
    pp. 1004–1011.                                                         4503-1472-5.

[10] Glinos, D. 2014. A Hybrid Architecture for Plagiarism            [17] Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos,
     Detection—Notebook for PAN at CLEF 2014. CLEF 2014                    E., and Stein, B. 2014. Improving the Reproducibility of
     Evaluation Labs and Workshop – Working Notes Papers,                  PAN's Shared Tasks: Plagiarism Detection, Author
     (15-18 September, Sheffield, UK). CEUR-WS.org. ISSN                   Identification, and Author Profiling. In Information Access
     1613-0073.                                                            Evaluation meets Multilinguality, Multimodality, and
                                                                           Visualization. 5th International Conference of the CLEF
[11] Palkovskii, Y. and Belov, A. 2014. Developing High-                   Initiative (CLEF 14), pp. 268-299, Berlin Heidelberg New
     Resolution Universal Multi-Type N-Gram Plagiarism                     York. Springer. ISBN 978-3-319-11381-4.
     Detector—Notebook for PAN at CLEF 2014. CLEF 2014
     Evaluation Labs and Workshop – Working Notes Papers,