=Paper= {{Paper |id=Vol-1737/T4-4 |storemode=property |title=A Deep Learning Approach to Persian Plagiarism Detection |pdfUrl=https://ceur-ws.org/Vol-1737/T4-4.pdf |volume=Vol-1737 |authors=Erfaneh Gharavi,Kayvan Bijari,Kiarash Zahirnia,Hadi Veisi |dblpUrl=https://dblp.org/rec/conf/fire/GharaviBZV16 }} ==A Deep Learning Approach to Persian Plagiarism Detection== https://ceur-ws.org/Vol-1737/T4-4.pdf
           A Deep Learning Approach to Persian Plagiarism
                             Detection
             Erfaneh Gharavi                                   Kayvan Bijari                              Kiarash Zahirnia
           University of Tehran                            University of Tehran                         University of Tehran
 Faculty of new Science and Technology                  Faculty of new Science and                   Faculty of new Science and
     Data & Signal processing Lab                               Technology                                   Technology
            e.gharavi@ut.ac.ir                            kayvan.bijari@ut.ac.ir                        zahirnia.kia@ut.ac.ir


                                                                 Hadi Veisi
                                                           University of Tehran
                                                 Faculty of new Science and Technology
                                                     Data & Signal processing Lab
                                                             h.veisi@ut.ac.ir


ABSTRACT                                                                presenting them as one's own without explicitly acknowledging
                                                                        the original source which is considered immoral and illegal [9]. In
Plagiarism detection is defined as automatic identification of          this regard, detection and prevention such duplications has vital
reused text materials. General availability of the internet and easy    importance.
access to textual information enhances the need for automated
                                                                        In order to be processed in natural language processing
plagiarism detection. In this regard, different algorithms have
                                                                        algorithms, textual data should be numerically described. In
been proposed to perform the task of plagiarism detection in text
                                                                        traditional approaches, list of the words are considered as distinct
documents. Due to drawbacks and inefficiency of traditional
                                                                        features for the textual data. In such methods, the similarity
methods and lack of proper algorithms for Persian plagiarism
                                                                        between the synonym words is not taken into account.
detection, in this paper, we propose a deep learning based method
                                                                        Furthermore, due to the sparseness of new feature space and time
to detect plagiarism. In the proposed method, words are
                                                                        complexity of feature extraction, these approaches are not
represented as multi-dimensional vectors, and simple aggregation
                                                                        efficient [5]. To overcome deficiencies of the traditional feature
methods are used to combine the word vectors for sentence
                                                                        extraction methods, deep learning techniques are used which have
representation. By comparing representations of source and
                                                                        resulted in promising performance in many application such as
suspicious sentences, pair sentences with the highest similarity are
                                                                        NLP [11]. The essential goal of deep learning [19] is to improve
considered as the candidates for plagiarism. The decision on being
                                                                        the processing, and pre-processing methods of NLP in an
plagiarism is performed using a two level evaluation method. Our
                                                                        automatic, efficient, and fast way. In text mining applications,
method has been used in PAN2016 Persian plagiarism detection
                                                                        deep learning methods represent words as a vector of numerical
contest and results in %90.6 plagdet, %85.8 recall, and % 95.9
                                                                        values [9]. This new representation contains a major part of
precision on the provided data sets.
                                                                        synthetic as well as semantic rules of the text data. In applications
CCS Concepts                                                            such as similarity detection and text classification, much larger
                                                                        units such as phrases, sentences and documents should be
• Information systems → Near-duplicate and plagiarism                   described as a vector. For this purpose, there are a number of
detection • Information systems → Evaluation of retrieval               methods ranging from simple mathematical approaches [30] to
results.                                                                neural networks-base combination functions [36]. Vectorized
                                                                        representation of text data makes it easy to compare words and
Keywords                                                                sentences as well as minimizing the need to use lexicons. In this
Deep Learning; Word Vector Representation; Persian Plagiarism           paper, deep learning approach is used for Persian plagiarism
Detection.                                                              detection in PAN plagiarism detection contest. This method
                                                                        results in %90.6 plagdet, %85.8 recall, %95.9 precision on the
                                                                        PAN provided data sets.
1. INTRODUCTION                                                         Rest of this paper is organized as follow: in Section 2 we
Due to the growth and expansion of the global networks and the
                                                                        described plagiarism and the act of plagiarism detection, followed
increasing volume of unstructured data by both men and machine,
                                                                        by presenting related works in Section 3.Section 4 is devoted to
an automated intelligent processing and knowledge extraction
                                                                        illustrate deep learning and the approach of using it in NLP
system is required. The primary goal of language processing
                                                                        applications. Section 5 defines proposed method and Section 6
methods is to achieve direct human computer interaction as the
                                                                        demonstrates the experimental results. Finally we explain
main purpose of artificial intelligent [26]. Natural language
                                                                        privileges of our methods in Section 7.
processing (NLP) encompasses wide variety of tasks and
applications including: part of speech tagging (POS), text              2. PLAGIARISM DETECTION
classification, machine translation, text similarity detection, and
etc. One well-known application of text similarity detection is to      Plagiarism is an attempt to use the other's idea and present it as
identify plagiarism especially for scientific documents. Plagiarism     your personal work, which is considered both illegal and immoral.
is defined as the act of taking someone else's works or ideas and       The era of the internet and quick access to wide range of
information, exacerbates acts such as plagiarism. Plagiarism is         Synthetic changes: Changes in the structure includes rearranging
being done in various ways, and often it is difficult to prove          words and expressions, and turning sentences from active to
whether a text is plagiarized or not. Previously, the plagiarism was    passive and vice versa.
detected only manually and based on the reviewer’s knowledge.           Semantic changes: This kind of plagiarism is more fundamental
But nowadays, due to the difference between human cognition             and usually includes paraphrase as well as semantic and
and vast amount of information, the process of plagiarism               vocabulary changes. Detecting such changes requires semantic
detection is very challenging to be performed manually.                 analysis of the information in the text data to see whether or not
Therefore, automated plagiarism detection gets wide attention in        the texts imply a same sense.
the recent years [8, 9].
                                                                        Plagiarism detection can also be divided into two main categories:
In 2000, only 5 systems have been developed for the purpose of          external plagiarism detection, and intrinsic plagiarism detection.
plagiarism detection, four of which was used to detect plagiarism       External plagiarism detection tries to extract plagiarism in a text
in text and one system was used to detect copied programming            by checking all given source documents. Intrinsic plagiarism
codes [22]. This number growth to 47 in 2010 which indicates an         detection analyzes the given suspicious document, and tries to
increase in demand of such systems as well as the need to improve       discover parts of the input document which are not written by the
speed and efficiency. It should be noted that previous approaches       same author. In this study we propose a new method to detect
often benefit from string matching scheme in order to detect            external plagiarism for Persian documents using deep learning
copied texts. The inadequacy of existing systems leads the              approach [21].
research direction to new approaches for plagiarism detection.
The main drawback in this area is system's inability to recognize       3. RELATED WORK
the syntactic and semantic changes in the text data. Although it
seems very simple for human beings, but the computer is facing
                                                                        In this section some plagiarism detection methods are reviewed.
many difficulties in this detection, especially when the detection is
                                                                        These methods categorized based on features that are used to
dependent on exact text matching. Plagiarism detection steps is
                                                                        determine the similarity between two documents which address
outlined in the below algorithm.
                                                                        different kind of plagiarism:
   Algorithm: Plagiarism Detection steps                                    Lexical methods: These methods consider text as a sequence
                                                                             of characters or terms. In this methods the assumption is that
    Data pre-processing: preparation of the input data                      the more terms both documents have in common, the more
     including original and plagiarized text.                                similar they are. Methods that use features such as longest
    Similarity comparison: In this step, texts from original                common subsequence, n-grams and fingerprint are
     and plagiarized source are compared based on a                          considered as this kind of methods. These methods usually
     similarity measure. The output of this step is a rate which             end up with a great outcome when the words are not changed
     indicates the similarity of the input texts.                            by their synonyms [2, 7, 13, 14, 17, 21, 31, 38 and 40].
    Filtering: based on a predefined threshold, the generated              Syntactical methods: Some methods use text’s syntactical
     rates in the previous step are used to identify candidate
                                                                             units for comparing the similarity between documents. This
     pairs.
                                                                             is a realization of the intuition that similar documents would
    Further processing: at this point, pairs are evaluated base
                                                                             have similar syntactical structure. This methods make use of
     on other similarity measures.
                                                                             characteristics such as POS tag to compare the similarity
    Classification: The final step is to assign a label
                                                                             between different documents. [24,25]
     indicating whether the texts are plagiarized or not. This
                                                                            Semantic methods: These methods use semantic similarity
     can be done using the calculated rate resulted from the 4-
                                                                             for comparing documents. Methods that use synonyms,
     th step.
                                                                             antonyms, hypernyms, and hyponyms are placed in this
                                                                             category [7, 39].
Scientific plagiarized text comprises of word sequences including
n-grams which are exactly the same or paraphrased form of the           To the best of our knowledge, due to lack of Persian corpus
original text. This sequence of words can be in different lengths to    (Persian tagged data) [16], there exist only few studies on Persian
include whole or a part of the original documents. Examples of          plagiarism detection. Mahdavi et al., [24] introduce Persian
rules that show how the plagiarism in scientific fields is occurred,    plagiarism detector based on bag of word model. Their approach
are provided in the following [27].                                     has two steps: at first, most relevant source documents are
                                                                        retrieved by using cosine similarity, then, using the overlap
         Inadequate referencing                                        coefficient and tri-gram model, plagiarism is identified.
         Direct copy from one or more sources of text                  Mahmoodi et al., [25] use different combination of n-grams,
         Displacement of words in a sentence                           Clough metric [9] and Jaccard similarity coefficient for automatic
         Paraphrase and rewrite the texts, present other's ideas       Persian plagiarism detection.
          with different words
                                                                        Most of conducted studies in Persian plagiarism detection are
         Translation, expression of an idea in one language into
                                                                        placed among lexical methods. As it is mentioned earlier, this
          another one
                                                                        kind of methods does not acts well when the words are changed
Plagiarism can include changes in the vocabulary, or syntactic,         and rewritten. Applying semantic similarity in Persian language
and semantic representation of the text. These types will be            has some limitations due to the constraints of the Persian
discussed further in the following:                                     WordNets.
Vocabulary changes: Including the addition, deletion or                 Socher et al propose a deep method for paraphrase detection based
replacement of words in a given text. Such changes would be             on recursive autoencoder networks [37]. In this article a deep
indistinguishable by string matching approach.                          learning approach is introduced which uses semantic and lexical
features to detect plagiarism in Persian documents. To the best of      and semantic rules, but also the relationship between words can be
our knowledge there is no reported study that uses deep learning        modeled by vectors’ offset. This offset can also presents the
for Persian plagiarism detection.                                       plurality, syntactic label (noun, verb, etc.), semantic feature (pet,
                                                                        animal, car, etc.) of a word.

4. DEEP LEARNING FOR FEATURE                                            This representation is used in all NLP tasks like Name-Entity-
                                                                        Recognition (NER), word-sense-disambiguation, parsing, and
EXTRACTION                                                              machine translation [10].
                                                                        There are two approaches to learning word vector representation:
Deep learning is a branch of machine learning which tries to find       1) General matrix decomposition methods such as Latent
more abstract features using deep multiple layer graph. Each layer      Semantic Analysis (LSA) and 2) context-base methods such as
has linear or non-linear function to transform data into more           skip-grams, continuous bag of words [28, 32].
abstract ones [3]. One of the reasons that the deep learning helps
to improve NLP is the hierarchical nature of concepts. Concepts         Skip-grams and continuous bag of words, which are employed by
exist in natural world are generally hierarchical. For example a cat    this study, are two-layer neural networks that are trained for
is a domestic animal which itself is a branch of animals. In most,      language modeling task. Skip-gram used one-on representation of
not all, cases the word “cat” can be replaced by “dog” in any           words in a limited window size as an input and try to predict the
sentence with no change in resulting sentence. So abstract              middle word of the context. Another version of this network,
concepts in higher level are less sensitive to changes [4].             continuous bag of words, is used to predict the context
                                                                        considering a middle word. The resulted vectors, which are the
Recently, three factors contributed to the better performance of        weights of the neural network, are the same for semantically
deep architecture: large datasets, faster computers and parallel        similar words.
processing in addition to the increasing number of machine
learning methods for normalization and improvement of
algorithms [12].                                                        4.2 Text Document Vector Representation
Due to the large amount of textual data and mentioned problems
for natural language processing tasks, using automatic methods          There are so many algorithms which are used as the composition
like deep learning seem mandatory. Advantages of using deep             function for combining word vectors to generate a representation
methods for NLP task are listed below:                                  for text document.
                                                                        Paragraph Vector is an unsupervised algorithm that learns
         No hand crafted feature engineering is required
                                                                        representation for variable-length pieces of texts, such as
         Fewer number of features
                                                                        sentences, paragraphs, and documents. The algorithm used the
         No labeled data is required                                   idea of word vector training and considered a matrix for each
Multi-layer networks in deep learning, called deep belief network,      piece of text. This matrix also update during language modeling
can also lead to analogous set of features for all natural language     task. Paragraph vector outperform other methods such as bag-of-
processing tasks [10]. Using these representations reduces the          words models for many applications [23].
number of features and the text can be described by far fewer            Socher [36] introduce Recursive Deep Learning methods which
features through combination functions.                                 are variations and extensions of unsupervised and supervised
                                                                        recursive neural networks (RNNs). This method uses the idea of
                                                                        hierarchical structure of the text and encodes two word vectors
4.1 Word Vector Representation                                          into one vector by auto-encoder networks. Socher also presents
                                                                        many variation of these deep combination functions such as
Most of language processing algorithms consider words as single
                                                                        Recurrent Neural Network (RNN) and Matrix-Vector Recursive
symbols. This kind of representation suffers from sparsity since
                                                                        Neural Networks (MV-RNN).
the length of vector corresponds to the size of word glossary. This
vector has zero in all elements except one. This approach, called       There are also some simple mathematical methods which applied
One-On, is unable to distinguish similarity between two synonym         as a composition function generally used as benchmarks [30].
words. To address this challenge, an idea of representing a word
by its neighbors was introduced by Firth [15].
                                                                        5. PROPOSED METHOD
In application of deep learning in natural language processing,
each word is described by the surrounding context. The vector           In this study, in order to detect plagiarism, a sentence by sentence
generated automatically by a deep neural networks and contain           comparison is carried out in two phases. We first extract word
semantic and syntactic information about the word. Distributed          vectors by word2vec algorithm [28], then remove Persian stop
word representation, generally known as word-embedding, is used         words while text pre-processing. After that, for each sentence an
to solve the aforementioned problems of high dimensionality and         average of all word vectors is calculated as in equation 1.
sparsity in language model. Here the similar words have the
similar vectors [36].
                                                                             ∑
 Distributed representation learning introduced by Hinton for the                                                                     (1)
first time [20] and developed in language modeling concept by
Bengio [6]. Collobert [11] shows that distributed representation of     Where S is the vector representation for sentences and wi is the
words with almost no engineered features can be shared by               word vector for ith word of the sentences and n is the number of
several NLP tasks resulting the equal or more accuracy than the         words in that sentence.
state of the art methods. Finally, authors in [29] indicate that this   After feature extraction, in phase 1, each sentence in a suspicious
kind of presentation not only encompass a huge part of syntactic        document is compared with all the sentences in the source
documents. Cosine similarity is used as a comparison metric,           6. EXPERIMENTS
which is described in equation 2.
                                                                       6.1 Dataset
                                                                       We train our learning parameters on Persian PAN2016 dataset,
                      ‖     ‖‖ ‖
                                                                 (2)   since PAN2016 dataset has not been released yet, detailed
                             ∑
                                                                       information cannot be described. More detail in [1].
                       √∑          √∑
                                                                       6.2 Parameter Definition
                                                                       In this paper there are two parameters to be optimized. The task is
                                                                       to answer the following questions.
Where S1 is the sentence vector of the sentence from suspicious           What is the optimized threshold for the cosine similarity
documents and S2 is the sentence vector of the sentence from               measure?
source documents and K denoted the dimension of the vectors.
                                                                          What is the optimized threshold for the Jaccard similarity
After this step which helps us to find the most nearest sentences in       measure?
real time, in phase 2, lexical similarity of two sentences is
evaluated by the Jaccard similarity measure. Jaccard similarity        Two sentences are considered as plagiarism if they pass the cosine
score is calculated as in equation 3.                                  similarity threshold (α). The second threshold (β) filters the
                                                                       selected sentences to assure lexical similarity. These thresholds
                                                           (3)         were fine-tuned by several trial on the training corpus. The results
                                                                       achieved when α=0.3 and β=0.2.

Where S1 is the set of unique words in the first sentence and S2 is    6.3 Evaluation Metrics
the set of unique words in the second sentence.                        Evaluation measures on this text alignment task include:
                                                                       Precision, recall, and granularity, which are combined into the
Two sentences which pass Jaccard similarity threshold considered       plagdet score [34].
as plagiarism at final step. We used training corpus to fine-tune
the thresholds. The workflow of our method is represented in                                    |⋃                 |
                                                                                            ∑
figure 1.                                                                             | |                | |                       (4)


                                                                                              |⋃               |
                                                                                          ∑                                        (5)
                                                                                    | |              | |

                                                                                   Where                  {


                                                                       Where S is the set of plagiarism cases in the corpus and R is the
                                                                       set of detected plagiarism.
                                                                       Granularity is defined to address overlapping or multiple detection
                                                                       for one plagiarism case and is defined as bellow.

                                                                                              ∑|     |                           (6)
                                                                                      |   |


                                                                       All these measure combined into a single score, palgdet, as
                                                                       follows:

                                                                                                                                 (7)
                                                                                                (                      )


                                                                       Where F1 is the harmonic mean of precision and recall.


                                                                       6.4 RESULTS
                                                                       The results of applying this method to Persian PAN2016 corpus is
                                                                       presented in table 1, Rank 2, which is also reported in [1]. Persian
                                                                       plagiarism detection contest, PAN2016, was hosted on Tira [18,
    Figure 1: Steps of our plagiarism detection method                 33], a framework for shared tasks, and evaluated based on
                                                                       evaluation framework presented in [34].
 7. CONCLUSION                                                        average of two same sentences word vectors are exactly the same.
                                                                      This methods also detect plagiarism with synthetic changes,
 In this paper, we used deep representation of words for plagiarism   include change of word's order, which have the same average
 detection task. Sentence-by-sentence comparison is used to find      vectors, as well. Vocabulary change, include adding or omitting
 text similarities. Advantages of this method among others are its    words, which would be indistinguishable by string matching
 simplicity and its fast sentence comparison. This methods has        approach, could be identify by the proposed method. The reason is
 resulted in %90.6 plagdet, %85.8 recall, %95.9 precision on the      that the average vector is insensitive to few number of changes in
 PAN2016 provided data sets.                                          a sentence vocabulary. On semantic changes, which is our main
                                                                      privilege in this task among others, plagiarism could easily be
 Why our method works? Since our comparison transformed from          detected due to the similarity of synonym word vectors which
 word-by-word or n-gram-by-n-gram representation of text to           make no or little changes on final sentence vector. Therefore, time
 numerical one, the calculation of similarity execute in a much       consuming synonym word retrieval from lexicon has become
 faster and more convenient way. Our method could easily and          inessential.
 immediately address plagiarism with no obfuscation since the

                       Table 1: Results of text alignment software submissions in PersianPlagDet-2016 (PAN16)
 Rank                                          Team                                      Plagdet    Granularity     Precision    Recall
                            Fatemeh Mashhadi, Mehrnoush Shamsfard
   1                                                                                      0.922         1.001         0.927       0.919
                          Shahid Beheshti University, NLP Research Lab
                   Hadi Veisi, Kayvan Bijari, Kiarash Zahirnia, Erfaneh Gharavi
   2                                                                                      0.906         1.000         0.959       0.858
                        University of Tehran, Data & Signal processing Lab
                      Mozhgan Momtaz, Kayvan Bijari, Davood Heidarpour
   3                                                                                      0.871         1.000         0.893       0.850
                                 University of Tehran, COIN Lab
                                          Mahdi Niknam,
   4                                                                                      0.830         1.040         0.920       0.796
                                        University of Qom
                              Faezeh Esteki, Faramarz Safi Esfahani
   5                                                                                      0.801         1.000         0.933       0.701
                           Najafabad Branch, Islamic Azad University
             Alireza Talebpour, Mohammad Shirzadi, Zahra Aminolroaya, Mohammad
                                Adibi, Ahmad Mahmoudi-Aznaveh
   6                                                                                      0.775         1.228         0.964       0.836
                                    Shahid Beheshti University,
                             Content lab /cyberspace research institute
                                            Nava Ehsan
   7                                                                                      0.727         1.000         0.750       0.705
                                       University of Tehran
                                 Lee Gillam, Anna Vartapetiance
   8                                                                                      0.400         1.528         0.755       0.414
                                       University of Surrey
                                     Muharram Mansoorizadeh
   9                                                                                      0.390         3.537         0.900       0.807
                                      Bu-Ali Sina University



                                                                      [5] Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C., 2003.
 8. REFERENCES                                                            A neural probabilistic language model. Journal of machine
                                                                          learning research, 3(Feb), 1137-1155.
[1] Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., and
                                                                      [6] Bengio, Y., Schwenk H., Senécal J.-S., Morin F., and
    Potthast, M., 2016. Algorithms and Corpora for Persian
                                                                          Gauvain J.-L., 2006. Neural probabilistic language models,
    Plagiarism Detection: Overview of PAN at FIRE 2016. In
    Working notes of FIRE 2016 - Forum for Information                    in Innovations in Machine Learning, pp. 137-186.
    Retrieval Evaluation, Kolkata, India, December 7-10, 2016,        [7] Chen, C.-Y., Yeh, J.-Y., and Ke, H.-R., 2010. Plagiarism
    CEUR Workshop Proceedings, CEUR-WS.org.                               detection using ROUGE and WordNet. arXiv preprint
[2] Barrón-Cedeño, A., Vila, M., Martí, M.A., and Rosso, P.,              arXiv:1003.4065.
    2013. Plagiarism meets paraphrasing: Insights for the next        [8] Chong, M.Y.M., 2013. A study on plagiarism detection and
    generation in automatic plagiarism detection. Computational           plagiarism direction identification using natural language
    Linguistics 39, 4, 917-947.                                           processing techniques.
[3] Bengio, Y., 2009. Learning deep architectures for                 [9] Clough, P., 2003. Old and new challenges in automatic
    AI. Foundations and trends® in Machine Learning, 2(1), 1-             plagiarism detection. In National Plagiarism Advisory
    127.                                                                  Service, 2003; http://ir. shef. ac. uk/cloughie/index. html.
[4] Bengio, Y., Courville, A., and Vincent, P., 2013.                 [10] Collobert, R. and Weston, J., 2008. A unified architecture
    Representation learning: A review and new perspectives.                for natural language processing: Deep neural networks with
    IEEE transactions on pattern analysis and machine                      multitask learning. In Proceedings of the 25th international
    intelligence 35, 8, 1798-1828.                                         conference on Machine learning ACM, 160-167.
                                                                      [11] Collobert, R., Weston, J., Bottou, L., Karlen, M.,
                                                                           Kavukcuoglu, K., and Kuksa, P., 2011. Natural language
     processing (almost) from scratch. Journal of machine             [26] Manning, C. D., & Schütze, H., 1999. Foundations of
     learning research 12, Aug, 2493-2537.                                 statistical natural language processing (Vol. 999).
[12] Dahl, G., Mohamed, A.-R., and Hinton, G.E., 2010. Phone               Cambridge: MIT press.
     recognition with the mean-covariance restricted Boltzmann        [27] Maurer, H. and Zaka, B., 2007. Plagiarism–a problem and
     machine. In Advances in neural information processing                 how to fight it. Proceeding of Ed-Media 2007, 4451-4458.
     systems, 469-477.                                                [28] Mikolov, T., Chen, K., Corrado, G., and Dean, J., 2013.
[13] Elhadi, M. and Al-Tobi, A., 2008. Use of text syntactical             Efficient estimation of word representations in vector space.
     structures in detection of document duplicates. In Digital            arXiv preprint arXiv:1301.3781.
     Information Management, ICDIM 2008. Third International          [29] Mikolov, T., Yih, W.-T., and Zweig, G., 2013. Linguistic
     Conference on IEEE, 520-525.                                          Regularities in Continuous Space Word Representations. In
[14] Elhadi, M. and Al-Tobi, A., 2009. Duplicate detection in              HLT-NAACL, 746-751.
     documents and webpages using improved longest common             [30] Mitchell, J., & Lapata, M., 2010. Composition in
     subsequence and documents syntactical structures. In                  distributional models of semantics. Cognitive science, 34(8),
     Computer Sciences and Convergence Information
                                                                           1388-1429.
     Technology, ICCIT'09. Fourth International Conference on
     IEEE, 679-684.                                                   [31] Nahnsen, T., Uzuner, O., and Katz, B., 2005. Lexical chains
                                                                           and sliding locality windows in content-based text similarity
[15] Firth, J.R., 1957. A synopsis of linguistic theory, in Studies
                                                                           detection.
     in Linguistic Analysis, Philological Society, Oxford.
                                                                      [32] Pennington, J., Socher, R., and Manning, C.D., 2014. Glove:
[16] Franco-Salvador, M., Bensalem, I., Flores, E., Gupta, P., and
                                                                           Global Vectors for Word Representation. In EMNLP, 1532-
     Rosso, P., 2015. PAN 2015 Shared Task on Plagiarism
                                                                           1543.
     Detection: Evaluation of Corpora for Text Alignment. In
     Volume 1391 of CEUR workshop proceedings CLEF and                [33] Potthast, M., Gollub, T., Rangel, F., Rosso, P.,
     CEUR-WS. org.                                                         STAMATATOS, E., and STEIN, B., 2014. Improving the
                                                                           Reproducibility of PAN’s Shared Tasks. In International
[17] Glinos, D.S., 2014. A Hybrid Architecture for Plagiarism
                                                                           Conference of the Cross-Language Evaluation Forum for
     Detection. In CLEF (Working Notes), 958-965.
                                                                           European Languages Springer, 268-299.
[18] Gollub, T., Stein, B., and Burrows, S., 2012. Ousting ivory
                                                                      [34] Potthast, M., Stein, B., Barrón-Cedeño, A., and Rosso, P.,
     tower research: towards a web framework for providing
                                                                           2010. An evaluation framework for plagiarism detection. In
     experiments as a service. In Proceedings of the 35th
                                                                           Proceedings of the 23rd international conference on
     international ACM SIGIR conference on Research and
                                                                           computational linguistics: Posters Association for
     development in information retrieval ACM, 1125-1126.
                                                                           Computational Linguistics, 997-1005.
[19] Hinton, G. E., Osindero, S., & Teh, Y. W., 2006. A fast
                                                                      [35] Sanchez-Perez, M.A., Gelbukh, A., and Sidorov, G.
     learning algorithm for deep belief nets. Neural
                                                                           Dynamically Adjustable Approach through Obfuscation
     computation, 18(7), 1527-1554.
                                                                           Type Recognition.
[20] Hinton, G.E., 1986. Learning distributed representations of
                                                                      [36] Socher, R., 2014. Recursive Deep Learning for Natural
     concepts. In Proceedings of the eighth annual conference of
                                                                           Language Processing and Computer Vision PhD thesis,
     the cognitive science society Amherst, MA, 12.
                                                                           Stanford University.
[21] Hoad, T.C. and Zobel, J., 2003. Methods for identifying
                                                                      [37] Socher, R., Huang, E.H., Pennin, J., Manning, C.D., and Ng,
     versioned and plagiarized documents. Journal of the
                                                                           A.Y., 2011. Dynamic pooling and unfolding recursive
     American society for information science and technology 54,
                                                                           autoencoders for paraphrase detection. In Advances in
     3, 203-215.
                                                                           Neural Information Processing Systems, 801-809.
[22] Lathrop, A. and Foss, K., 2000. Student Cheating and
                                                                      [38] Suchomel, S., Kasprzak, J., and Brandejs, M., 2012. Three
     Plagiarism in the Internet Era. A Wake-Up Call. ERIC.
                                                                           Way Search Engine Queries with Multi-feature Document
[23] Le, Q.V. and Mikolov, T., 2014. Distributed Representations           Comparison for Plagiarism Detection. In CLEF (Online
     of Sentences and Documents. In ICML, 1188-1196.                       Working Notes/Labs/Workshop) Citeseer, 1-8.
[24] Mahdavi, P., Siadati, Z., and Yaghmaee, F., 2014. Automatic      [39] Torres, S. and Gelbukh, A., 2009. Comparing similarity
     external Persian plagiarism detection using vector space              measures for original WSD lesk algorithm. Research in
     model. In Computer and Knowledge Engineering (ICCKE),                 Computing Science 43, 155-166.
     2014 4th International eConference on IEEE, 697-702.
                                                                      [40] Zini, M., Fabbri, M., Moneglia, M., and Panunzi, A., 2006.
[25] Mahmoodi, M. and Varnamkhasti, M.M., 2014. Design a                   Plagiarism detection through multilevel text comparison. In
     Persian Automated Plagiarism Detector (AMZPPD). arXiv                 2006 Second International Conference on Automated
     preprint arXiv:1403.1618.                                             Production of Cross Media Content for Multi-Channel
                                                                           Distribution (AXMEDIS'06) IEEE, 181-185.