NLP-NITMZ@DPIL-FIRE2016: Language Independent
                  Paraphrases Detection

             Sandip Sarkar                                 Saurav Saha                              Jereemi Bentham
 Computer Science and Engineering              Computer Science and Engineering            Computer Science and Engineering
       Jadavpur University                               NIT Mizoram                                 NIT Mizoram
    sandipsarkar.ju@gmail.com                           me@sauravsaha.in                      jereemibentham@gmail.com

             Partha Pakray                                 Dipankar Das                            Alexander Gelbukh
 Computer Science and Engineering              Computer Science and Engineering             CIC, Instituto Politécnico Nacional
           NIT Mizoram                               Jadavpur University                                   Mexico
      parthapakray@gmail.com                    dipankar.dipnil2005@gmail.com                     gelbukh@gelbukh.com

ABSTRACT                                                             can be divided into three categories: Paraphrase generation,
In this paper we describe the detailed information of NLP-NITMZ      Paraphrase extraction and Paraphrase recognition.
system on the participation of DPIL1 shared task at Forum for        This paper describes the NLP-NITMZ system which participated
Information Retrieval Evaluation (FIRE 2016). The main aim of        in DPIL shared task [6]. DPIL (Detecting Paraphrases in Indian
DPIL shared task is to detect paraphrases in Indian Languages.       Languages) task is focused on sentence level paraphrase
Paraphrase detection is an important part in the field of            identification for Indian languages (Tamil, Malayalam, Hindi and
Information Retrieval, Document Summarization, Question              Punjabi). DPIL shared task is divided into two sub-tasks.
Answering, Plagiarism Detection etc. In our approach, we used        In Sub-Task 1, the participants have to classify sentences into two
language independent feature-set to detect paraphrases in Indian     categories viz. Paraphrase (P) and Non-Paraphrase (NP).
languages. Features are mainly based on lexical based similarity.
Our system’s three features are: Jaccard Similarity, length                  Table 1. Sentences pair with classification Tag
normalized Edit Distance and Cosine Similarity. Finally, these
                                                                                         Pair of Sentences                       Tag
feature-set are trained using Probabilistic Neural Network (PNN)
to detect the paraphrases. With our feature-set, we achieved           പിഞ്ചുകുഞ്ഞുങ്ങളെ വിഷം ളകൊ ടുത്തു
88.13% average accuracy in Sub-Task 1 and 71.98% average               ളകൊന്ന് യുവതി ആത്മഹതയ ളെയതു.
accuracy in Sub-Task 2.                                                                                                           P
                                                                       രണ്ടു മക്കളെ വിഷം ളകൊടുത്തു ളകൊ
Keywords                                                               ന്നശേഷം യുവതി ആത്മഹതയ ളെയതു.
Probabilistic Neural Network (PNN), Plagiarism Detection, DPIL,
Jaccard Similarity.                                                    மும்பை குண்டுவெடிப்பு வழக்கில்
                                                                       மேலும் ஒருவர் கைது.
1. INTRODUCTION
Ambiguity is one of major difficulties in Natural Language             பிரசெல்ஸ் குண்டுவெடிப்பு முக்கிய கு                        NP
Processing (NLP). In ambiguity, one text can be represented using      ற்றவாளி நஜீம் லாஷ்ராவி ஐ.எஸ் அ
many forms like lexical and semantic. This is known as                 மை ப்பில் ஜெயிலராக இருந்தார்.
paraphrasing. Here we consider only lexical level similarity for
paraphrase detection. Paraphrase detection is a very important and     ਹੁਣ ਵਿਭਾਗ ਨੂੰ ਬਣਦਾ ਕਿਰਾਇਆ ਅਦਾ ਕਰਨ ਲਈ ਕੇਸ
challenging task in Information Retrieval, Question Answering,         ਬਣਾ ਕੇ ਭੇਜ ਦਿੱਤਾ ਹੈ ਤੇ ਜਲਦ ਹੀ ਕਿਰਾਇਆ ਅਦਾ
Text Simplification, Plagiarism Detection, Text summarization          ਕਰਦਿੱਤਾ ਜਾਵੇਗਾ।
and even paraphrase detection on SMS [1]. In Information                                                                          SP
Retrieval, relevant documents are retrieved using paraphrase           ਹੁਣ ਵਿਭਾਗ ਨੂੰ ਬਣਦਾ ਕਿਰਾਇਆ ਅਦਾ ਕਰਨ ਲਈ ਕੇਸ
detection. Similarly, in Question Answering System, the best           ਬਣਾ ਕੇ ਭੇਜ ਦਿੱਤਾ ਹੈ|
answer is identified using paraphrase detection. Paraphrase
detection is also used in plagiarism detection to detect the
                                                                       क्रिकेट के भगवान सचिन को जन्मदिन मुबारक हो,
sentences    which     are    paraphrases     of     each   other.
                                                                       दीजिए बधाई|
Researcher used different type of approaches [2] [3] [4] like
Lexical Similarity, Syntactic Similarity [5] and other approaches                                                                 P
                                                                       के हुए सचिन तेंदुलकर जन्मदिन मुबारक हो, दीजिए
to detect paraphrases. Research problem based on paraphrasing
                                                                       बधाई|


1 http://nlp.amrita.edu/dpil_cen/
                                                                     Similarly in Sub-Task 2, the participants have to classify
                                                                     sentences into a three point scale i.e., three categories: Completely
                                                                     Equivalent (E), Roughly Equivalent (RE) and Not Equivalent
(NE) i.e. (Paraphrase, Non-paraphrase, and Semi-paraphrase).            probability of two sentences to be paraphrases is high when the
Table 1 describes the examples of DPIL training dataset.                edit distance of those two sentences is small.
In Section 2 we provide the detailed architecture of our system                                            𝑬𝒅𝒊𝒕𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆(𝒂, 𝒃)
                                                                                 𝑬𝒅𝒊𝒕𝑹𝒂𝒕𝒊𝒐(𝒂, 𝒃) = 𝟏 −
like feature-set and machine-learning technique. Section 3                                                      |𝒂| + |𝒃|
describes the detailed statistics of test and training data which are
used by our system. The result on test data is described in
Section 4. Section 5 describes the conclusion and future work.          Example of Levenshtein Ratio is given in Table 3.
                                                                                           Table 3. Levenshtein Ratio
2. SYSTEM ARCHITECTURE
In this section, we elaborate our proposed architecture. As shown                                    Sentences                  Score
in Figure 1, our system NLP-NITMZ is based on three language-                              भारतीय मुस्लिमों की वजह से नहीं
                                                                           Sentence 1
independent features: Jaccard Similarity, Levenshtein Ratio and                                 पनप सकता आईएस|
Cosine Similarity. To find the Jaccard Similarity, first we                                                                     0.7712
                                                                                           भारत में कभी वर्चस्व कायम नहीं
calculate the number of similar unigram between two texts. After           Sentence 2
that, similarity score is obtained by dividing the count by the total                            कर सकता आईएस|
unigram of those two sentences. Next one is Levenshtein Ratio
which calculates total number of operations required to                 2.1.3 Cosine Similarity
change one string to another form. Final feature is Cosine              Cosine similarity is another widely used feature to measure the
Similarity where each word of sentences is represented using            similarity between two sentences. In this feature, each sentence is
Vector Space model.                                                     represented using word vectors. Here word vectors are mainly the
For machine learning portion we have used Probabilistic Neural          frequency of words in the sentences. After that cosine similarity is
Network to predict the class. Probabilistic Neural Network (PNN)        calculated using the dot product of those two word vectors divided
is derived from Bayesian network. PNN is normally used in               by the product of their lengths.
classification problem and it has 4 layers. Those layers are namely                                                    𝑨. 𝑩
Input layer, Pattern layer, Summation layer and Output layer. The                       𝑪𝒐𝒔𝒊𝒏𝒆 𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑨, 𝑩) =
                                                                                                                      |𝑨||𝑩|
advantage of PNN is that, that are much faster than feed forward
Neural Network.                                                         Table 4 describes the operation of cosine similarity on Hindi
                                                                        sentence pair.
2.1 Features                                                                                Table 4. Cosine Similarity
Our system NLP-NITMZ used three types of features which are
Language Independent. We used lexical based features which are                                       Sentences                  Score
mainly used to find the similarity between sentences for all                               भारतीय मुस्लिमों की वजह से नहीं
Languages [7] [8].                                                         Sentence 1
                                                                                                पनप सकता आईएस|
                                                                                                                                 0.523
2.1.1 Jaccard Similarity                                                                   भारत में कभी वर्चस्व कायम नहीं
                                                                           Sentence 2
The similarity and difference of two sets is calculated using                                    कर सकता आईएस|
Jaccard Similarity coefficient. For our task, Jaccard similarity
coefficient between two sentences is the ratio between the
numbers of unigram match to the total number of unique words in
those two sentences. If S1 and S2 are two sets, then the Jaccard
similarity is defined using following equation.
                                                  𝐒𝟏 ∩𝐒𝟐
            𝑱𝒂𝒄𝒄𝒂𝒓𝒅 𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑺𝟏 , 𝑺𝟐 ) =
                                                 𝐒𝟏 ∪𝐒𝟐


Table 2 shows the example of Jaccard Similarity.
                   Table 2. Jaccard Similarity
                              Sentences                    Score
                   भारतीय मुस्लिमों की वजह से नहीं
   Sentence 1
                        पनप सकता आईएस|
                                                            0.2
                   भारत में कभी वर्चस्व कायम नहीं
   Sentence 2
                         कर सकता आईएस|


2.1.2 Levenshtein Ratio
The most common feature to compare two strings is the
Levenshtein Distance which is obtained by minimum number
of operations required (i.e. replacements, insertions, and
deletions) to convert one string to another [9]. In our task we
assign same weight, e.g. 1 to all operations. Here we consider                            Figure 1. Architecture of PNN
character level distance between words of sentences. The
                                                                        The advantage of PNN networks is that the training process is
                                                                        easy and quick. They can be used in real time. For our experiment
                                                                        we used existing MATLAB toolkit to classify test data2.
                                                                        3. Dataset
                                                                        DPIL shared task includes sentence pairs of four languages:
                                                                        Tamil, Malayalam, Hindi, and Punjabi. This shared task is divided
                                                                        into two sub-tasks. In Sub-Task 1, the main aim was to classify
                                                                        those four sentences as paraphrases (P) or not paraphrases (NP).
                                                                        Similarly Sub-Task 2 is to assign those sentences into three
                                                                        categories completely equivalent (E) or roughly equivalent (RE)
                                                                        or not equivalent (NE). Table 5 describes the details statistics of
                                                                        training and test dataset.
                                                                                Table 5. Statistics of Training and Test datasets
                                                                         LANGUAGE         TASK        Count(Train)        Count(Test)
                                                                               Hindi      Task 1         2500                 900
                                                                               Hindi      Task 2         3500                 1400
                                                                           Malayalam      Task 1         2500                 900
                                                                           Malayalam      Task 2         3500                 1400
                                                                             Punjabi      Task 1         1700                 500
                                                                             Punjabi      Task 2         2200                 750
                                                                               Tamil      Task 1         2500                 900
                                                                               Tamil      Task 2         3500                 1400


        Figure 2. System Architecture of NLP-NITMZ.                     4. RESULT
                                                                        The individual accuracy and F1 score is describe in Table 6. At
                                                                        the same time the comparison between winner’s score and our
2.2 CLASSIFICATION APPROACH                                             score is also described in Table 6. We can see that our proposed
For this classification task we used Probabilistic Neural Network       method achieved very good result on Panjabi and Hindi language
(PNN) to classify those sentences. The PNN was first introduced         whereas our system struggles on Malayalam and Tamil language.
by Specht [10], and it is mainly based on Bayes Parzen                  F1 score is the harmonic mean of Precision and Recall. Macro F1
classification. The PNN is one of the supervised learning               score is used for Task 2 score evaluation. Precision, Recall, F1
networks. It is implemented using the probabilistic model, such as      score, F1 Macro and accuracy can be described using the
Bayesian classifiers. In this network we don’t require to set the       following equations where True Positive = (TP), True Negative =
initial weights of the network. The overall structure of the            (TN), False Positive = (FP), False Negative = (FN).
probabilistic neural network is illustrated in Figure 2. The PNN
[11] has four layers: the Input layer, Pattern layer, Summarization                                               𝑻𝑷
                                                                                                     𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 =
layer and Output Layer. PNN have many advantages like it is                                                     𝑻𝑷 + 𝑭𝑷
much faster than well-known back propagation algorithm and has                                                    𝑻𝑷
                                                                                                       𝑹𝒆𝒄𝒂𝒍𝒍 =
simple structure, PNN networks generate accurate predicted target                                               𝑻𝑷 + 𝑭𝑵
probability scores, PNN approach Bayes optimal classification                                                 𝟐𝑻𝑷
[12]. In the same time, it is robust to noise examples.                                             𝑭𝟏 =
                                                                                                         𝟐𝑻𝑷 + 𝑭𝑷 + 𝑭𝑵
A simple probabilistic density function (pdf) for class k is as                                           𝑻𝑷 + 𝑻𝑵
                                                                                          𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 =
follows where X = unknown (input), Xk = “Kth” sample, σ =                                            𝑻𝑷 + 𝑻𝑵 + 𝑭𝑷 + 𝑭𝑵
smoothing parameter and p = length of vectors
                                  𝟏       −||𝒙−𝒙𝒌 ||𝟐
                  𝒇𝒌 (𝑿) =       𝒑      𝒆    𝟐𝝈𝟐
                             (𝟐𝝅)𝟐 . 𝝈𝒑
The accuracy of PNN classification depends mainly on probability
density function. The probability density function for single
population is described using the following equation where n = no
of samples in the population.
                                        𝒏𝒊              𝟐
                              𝟏    𝟏     −||𝒙−𝒙𝒌 ||
               𝒈𝒊 (𝑿) =     𝒑         ∑ 𝒆 𝟐𝝈𝟐
                        (𝟐𝝅)𝟐 . 𝝈𝒑 𝒏𝒊   𝒌=𝟏

If there are two classes i, j then classification criteria is decided
                                                                        2 http://in.mathworks.com/help/nnet/ref/newpnn.html
using the following comparison:
                 gi (X) > gj(X) for all j ≠ i
                             Table 6. Comparison between Winners’s Score and Our System Score
                                                               Our System           Winner’s System
                            LANGUAGE           TASK
                                                         Accuracy F1 Score       Accuracy    F1 Score
                                Hindi          Task 1    0.91555     0.91        0.92        0.91
                                Hindi          Task 2    0.78571     0.76422     0.90142     0.90001
                              Malayalam        Task 1    0.83444     0.79        0.83777     0.81
                              Malayalam        Task 2    0.62428     0.60677     0.74857     0.74597
                               Punjabi         Task 1    0.942       0.94        0.946       0.95
                               Punjabi         Task 2    0.812       0.8086      0.92266     0.923
                                Tamil          Task 1    0.83333     0.79        0.8333      0.79
                                Tamil          Task 2    0.65714     0.63067     0.755       0.73979

                                                                         learning algorithm, In (2015) International Journal of
5. CONCLUSION AND FUTURE WORK                                            Applied Engineering Research, 10 (17), pp. 13929-13934
In this paper, we presented our NLP-NITMZ system used for
                                                                    [5] Socher R., Huang E., Pennin J., Manning C. D. and And Ng
DPIL shared task. Overall, our approach looks promising, but
                                                                        A.Y. 2011. Dynamic pooling and unfolding recursive
needs some improvement. There are some disadvantages of PNN
                                                                        autoencoders for paraphrase detection. Advances in Neural
like: require large memory, slow execution. In future we want to
                                                                        Information Processing Systems (pp. 801-809).
overcome those problems using better machine learning approach
and also want to implement semantic features for all languages to   [6] Anand Kumar M., Singh, S., Kavirajan, B., and Soman, K P.
increase performance. We can also identify stop words of all four       2016. DPIL@FIRE2016: Overview of shared task on
languages so that we can omit them from the corpus. Since our           Detecting Paraphrases in Indian Languages. In Working
approach is based on language independent feature set so our            notes of FIRE 2016 – Forum for Information Retrieval
methodology can be extended to various languages.                       Evaluation, Kolkata, India, December 7-10, 2016, CEUR
                                                                        Workshop Proceedings, CEUR-WS.org
6. ACKNOWLEDGMENTS                                                  [7] Pakray P. and Sojka P. 2014. An Architecture for Scientific
This work presented here is under the Research Project Grant No.        Document Retrieval Using Textual and Math Entailment
YSS/2015/000988 and supported is by the Department of Science           Modules. In RASLAN 2014: Recent Advances in Slavonic
& Technology (DST) and Science and Engineering Research                 Natural Language Processing, Karlova Studánka, Czech
Board (SERB), Govt. of India. Authors are also acknowledges the         Republic, December 5-7, 2014.
Department of Computer Science & Engineering of National
Institute of Technology Mizoram, India for providing                [8] Lynum, A., Pakray, P., Gamback, B. and Jimenez, S. 2014.
infrastructural facilities and support.                                 NTNU: Measuring semantic similarity with sublexical
                                                                        feature representations and soft cardinality. In Proceedings of
7. REFERENCES                                                           the 8th International Workshop on Semantic Evaluation
[1] Wu W.,Ju Y., Li X. and Wang Y. 2010. Paraphrase detection           (SemEval 2014), pages 448–453, Dublin, Ireland, August 23-
    on SMS messages in automobiles. In Acoustics Speech and             24, 2014.
    Signal Processing (ICASSP), 2010 IEEE International             [9] Sarkar S, Das D, Pakray P., Gelbukh A., 2016. JUNITMZ at
    Conference on (pp. 5326-5329).                                      SemEval-2016 Task 1: Identifying Semantic Similarity
[2] Dolan, B., and Brockett, C. 2005. Automatically                     Using Levenshtein Ratio. In Proceedings of SemEval-2016,
    Constructing a Corpus of Sentential Paraphrases. In Third           pages 702–705, San Diego, California.
    International Workshop on Paraphrasing (IWP2005)                [10] Specht, D.F., 1990. Probabilistic neural networks. Neural
[3] Sundaram, M. S., Madasamy, A. K., and Padannayil, S. K.              Networks 3, 109 – 118
    2005.      AMRITA_CEN@SemEval-2015:            Paraphrase       [11] Donald F. S. 1990. Probabilistic Neural Networks and the
    Detection for Twitter using Unsupervised Feature Learning            Polynomial Adaline as Complementary Techniques for
    with Recursive Autoencoders. In Proceedings of the 9th               Classification. In IEEE Transactions on Neural Networks,
    International Workshop on Semantic Evaluation (SemEval               vol. I. No. I, march 1990
    2015), 45-50
                                                                    [12] Hajmeer, M and Basheer, I. 2002. A probabilistic neural
[4] Mahalakshmi, S., Anand Kumar and M., Soman,                          network approach for modeling and classification of bacterial
    K.P. Paraphrase detection for Tamil language using deep              growth/no-growth data. In Journal of microbiological
                                                                         methods, vol. 51, No. 2, 217—226.