NLP-NITMZ@DPIL-FIRE2016: Language Independent Paraphrases Detection Sandip Sarkar Saurav Saha Jereemi Bentham Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering Jadavpur University NIT Mizoram NIT Mizoram sandipsarkar.ju@gmail.com me@sauravsaha.in jereemibentham@gmail.com Partha Pakray Dipankar Das Alexander Gelbukh Computer Science and Engineering Computer Science and Engineering CIC, Instituto Politécnico Nacional NIT Mizoram Jadavpur University Mexico parthapakray@gmail.com dipankar.dipnil2005@gmail.com gelbukh@gelbukh.com ABSTRACT can be divided into three categories: Paraphrase generation, In this paper we describe the detailed information of NLP-NITMZ Paraphrase extraction and Paraphrase recognition. system on the participation of DPIL1 shared task at Forum for This paper describes the NLP-NITMZ system which participated Information Retrieval Evaluation (FIRE 2016). The main aim of in DPIL shared task [6]. DPIL (Detecting Paraphrases in Indian DPIL shared task is to detect paraphrases in Indian Languages. Languages) task is focused on sentence level paraphrase Paraphrase detection is an important part in the field of identification for Indian languages (Tamil, Malayalam, Hindi and Information Retrieval, Document Summarization, Question Punjabi). DPIL shared task is divided into two sub-tasks. Answering, Plagiarism Detection etc. In our approach, we used In Sub-Task 1, the participants have to classify sentences into two language independent feature-set to detect paraphrases in Indian categories viz. Paraphrase (P) and Non-Paraphrase (NP). languages. Features are mainly based on lexical based similarity. Our system’s three features are: Jaccard Similarity, length Table 1. Sentences pair with classification Tag normalized Edit Distance and Cosine Similarity. Finally, these Pair of Sentences Tag feature-set are trained using Probabilistic Neural Network (PNN) to detect the paraphrases. With our feature-set, we achieved പിഞ്ചുകുഞ്ഞുങ്ങളെ വിഷം ളകൊ ടുത്തു 88.13% average accuracy in Sub-Task 1 and 71.98% average ളകൊന്ന് യുവതി ആത്മഹതയ ളെയതു. accuracy in Sub-Task 2. P രണ്ടു മക്കളെ വിഷം ളകൊടുത്തു ളകൊ Keywords ന്നശേഷം യുവതി ആത്മഹതയ ളെയതു. Probabilistic Neural Network (PNN), Plagiarism Detection, DPIL, Jaccard Similarity. மும்பை குண்டுவெடிப்பு வழக்கில் மேலும் ஒருவர் கைது. 1. INTRODUCTION Ambiguity is one of major difficulties in Natural Language பிரசெல்ஸ் குண்டுவெடிப்பு முக்கிய கு NP Processing (NLP). In ambiguity, one text can be represented using ற்றவாளி நஜீம் லாஷ்ராவி ஐ.எஸ் அ many forms like lexical and semantic. This is known as மை ப்பில் ஜெயிலராக இருந்தார். paraphrasing. Here we consider only lexical level similarity for paraphrase detection. Paraphrase detection is a very important and ਹੁਣ ਵਿਭਾਗ ਨੂੰ ਬਣਦਾ ਕਿਰਾਇਆ ਅਦਾ ਕਰਨ ਲਈ ਕੇਸ challenging task in Information Retrieval, Question Answering, ਬਣਾ ਕੇ ਭੇਜ ਦਿੱਤਾ ਹੈ ਤੇ ਜਲਦ ਹੀ ਕਿਰਾਇਆ ਅਦਾ Text Simplification, Plagiarism Detection, Text summarization ਕਰਦਿੱਤਾ ਜਾਵੇਗਾ। and even paraphrase detection on SMS [1]. In Information SP Retrieval, relevant documents are retrieved using paraphrase ਹੁਣ ਵਿਭਾਗ ਨੂੰ ਬਣਦਾ ਕਿਰਾਇਆ ਅਦਾ ਕਰਨ ਲਈ ਕੇਸ detection. Similarly, in Question Answering System, the best ਬਣਾ ਕੇ ਭੇਜ ਦਿੱਤਾ ਹੈ| answer is identified using paraphrase detection. Paraphrase detection is also used in plagiarism detection to detect the क्रिकेट के भगवान सचिन को जन्मदिन मुबारक हो, sentences which are paraphrases of each other. दीजिए बधाई| Researcher used different type of approaches [2] [3] [4] like Lexical Similarity, Syntactic Similarity [5] and other approaches P के हुए सचिन तेंदुलकर जन्मदिन मुबारक हो, दीजिए to detect paraphrases. Research problem based on paraphrasing बधाई| 1 http://nlp.amrita.edu/dpil_cen/ Similarly in Sub-Task 2, the participants have to classify sentences into a three point scale i.e., three categories: Completely Equivalent (E), Roughly Equivalent (RE) and Not Equivalent (NE) i.e. (Paraphrase, Non-paraphrase, and Semi-paraphrase). probability of two sentences to be paraphrases is high when the Table 1 describes the examples of DPIL training dataset. edit distance of those two sentences is small. In Section 2 we provide the detailed architecture of our system 𝑬𝒅𝒊𝒕𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆(𝒂, 𝒃) 𝑬𝒅𝒊𝒕𝑹𝒂𝒕𝒊𝒐(𝒂, 𝒃) = 𝟏 − like feature-set and machine-learning technique. Section 3 |𝒂| + |𝒃| describes the detailed statistics of test and training data which are used by our system. The result on test data is described in Section 4. Section 5 describes the conclusion and future work. Example of Levenshtein Ratio is given in Table 3. Table 3. Levenshtein Ratio 2. SYSTEM ARCHITECTURE In this section, we elaborate our proposed architecture. As shown Sentences Score in Figure 1, our system NLP-NITMZ is based on three language- भारतीय मुस्लिमों की वजह से नहीं Sentence 1 independent features: Jaccard Similarity, Levenshtein Ratio and पनप सकता आईएस| Cosine Similarity. To find the Jaccard Similarity, first we 0.7712 भारत में कभी वर्चस्व कायम नहीं calculate the number of similar unigram between two texts. After Sentence 2 that, similarity score is obtained by dividing the count by the total कर सकता आईएस| unigram of those two sentences. Next one is Levenshtein Ratio which calculates total number of operations required to 2.1.3 Cosine Similarity change one string to another form. Final feature is Cosine Cosine similarity is another widely used feature to measure the Similarity where each word of sentences is represented using similarity between two sentences. In this feature, each sentence is Vector Space model. represented using word vectors. Here word vectors are mainly the For machine learning portion we have used Probabilistic Neural frequency of words in the sentences. After that cosine similarity is Network to predict the class. Probabilistic Neural Network (PNN) calculated using the dot product of those two word vectors divided is derived from Bayesian network. PNN is normally used in by the product of their lengths. classification problem and it has 4 layers. Those layers are namely 𝑨. 𝑩 Input layer, Pattern layer, Summation layer and Output layer. The 𝑪𝒐𝒔𝒊𝒏𝒆 𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑨, 𝑩) = |𝑨||𝑩| advantage of PNN is that, that are much faster than feed forward Neural Network. Table 4 describes the operation of cosine similarity on Hindi sentence pair. 2.1 Features Table 4. Cosine Similarity Our system NLP-NITMZ used three types of features which are Language Independent. We used lexical based features which are Sentences Score mainly used to find the similarity between sentences for all भारतीय मुस्लिमों की वजह से नहीं Languages [7] [8]. Sentence 1 पनप सकता आईएस| 0.523 2.1.1 Jaccard Similarity भारत में कभी वर्चस्व कायम नहीं Sentence 2 The similarity and difference of two sets is calculated using कर सकता आईएस| Jaccard Similarity coefficient. For our task, Jaccard similarity coefficient between two sentences is the ratio between the numbers of unigram match to the total number of unique words in those two sentences. If S1 and S2 are two sets, then the Jaccard similarity is defined using following equation. 𝐒𝟏 ∩𝐒𝟐 𝑱𝒂𝒄𝒄𝒂𝒓𝒅 𝑺𝒊𝒎𝒊𝒍𝒂𝒓𝒊𝒕𝒚(𝑺𝟏 , 𝑺𝟐 ) = 𝐒𝟏 ∪𝐒𝟐 Table 2 shows the example of Jaccard Similarity. Table 2. Jaccard Similarity Sentences Score भारतीय मुस्लिमों की वजह से नहीं Sentence 1 पनप सकता आईएस| 0.2 भारत में कभी वर्चस्व कायम नहीं Sentence 2 कर सकता आईएस| 2.1.2 Levenshtein Ratio The most common feature to compare two strings is the Levenshtein Distance which is obtained by minimum number of operations required (i.e. replacements, insertions, and deletions) to convert one string to another [9]. In our task we assign same weight, e.g. 1 to all operations. Here we consider Figure 1. Architecture of PNN character level distance between words of sentences. The The advantage of PNN networks is that the training process is easy and quick. They can be used in real time. For our experiment we used existing MATLAB toolkit to classify test data2. 3. Dataset DPIL shared task includes sentence pairs of four languages: Tamil, Malayalam, Hindi, and Punjabi. This shared task is divided into two sub-tasks. In Sub-Task 1, the main aim was to classify those four sentences as paraphrases (P) or not paraphrases (NP). Similarly Sub-Task 2 is to assign those sentences into three categories completely equivalent (E) or roughly equivalent (RE) or not equivalent (NE). Table 5 describes the details statistics of training and test dataset. Table 5. Statistics of Training and Test datasets LANGUAGE TASK Count(Train) Count(Test) Hindi Task 1 2500 900 Hindi Task 2 3500 1400 Malayalam Task 1 2500 900 Malayalam Task 2 3500 1400 Punjabi Task 1 1700 500 Punjabi Task 2 2200 750 Tamil Task 1 2500 900 Tamil Task 2 3500 1400 Figure 2. System Architecture of NLP-NITMZ. 4. RESULT The individual accuracy and F1 score is describe in Table 6. At the same time the comparison between winner’s score and our 2.2 CLASSIFICATION APPROACH score is also described in Table 6. We can see that our proposed For this classification task we used Probabilistic Neural Network method achieved very good result on Panjabi and Hindi language (PNN) to classify those sentences. The PNN was first introduced whereas our system struggles on Malayalam and Tamil language. by Specht [10], and it is mainly based on Bayes Parzen F1 score is the harmonic mean of Precision and Recall. Macro F1 classification. The PNN is one of the supervised learning score is used for Task 2 score evaluation. Precision, Recall, F1 networks. It is implemented using the probabilistic model, such as score, F1 Macro and accuracy can be described using the Bayesian classifiers. In this network we don’t require to set the following equations where True Positive = (TP), True Negative = initial weights of the network. The overall structure of the (TN), False Positive = (FP), False Negative = (FN). probabilistic neural network is illustrated in Figure 2. The PNN [11] has four layers: the Input layer, Pattern layer, Summarization 𝑻𝑷 𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 = layer and Output Layer. PNN have many advantages like it is 𝑻𝑷 + 𝑭𝑷 much faster than well-known back propagation algorithm and has 𝑻𝑷 𝑹𝒆𝒄𝒂𝒍𝒍 = simple structure, PNN networks generate accurate predicted target 𝑻𝑷 + 𝑭𝑵 probability scores, PNN approach Bayes optimal classification 𝟐𝑻𝑷 [12]. In the same time, it is robust to noise examples. 𝑭𝟏 = 𝟐𝑻𝑷 + 𝑭𝑷 + 𝑭𝑵 A simple probabilistic density function (pdf) for class k is as 𝑻𝑷 + 𝑻𝑵 𝑨𝒄𝒄𝒖𝒓𝒂𝒄𝒚 = follows where X = unknown (input), Xk = “Kth” sample, σ = 𝑻𝑷 + 𝑻𝑵 + 𝑭𝑷 + 𝑭𝑵 smoothing parameter and p = length of vectors 𝟏 −||𝒙−𝒙𝒌 ||𝟐 𝒇𝒌 (𝑿) = 𝒑 𝒆 𝟐𝝈𝟐 (𝟐𝝅)𝟐 . 𝝈𝒑 The accuracy of PNN classification depends mainly on probability density function. The probability density function for single population is described using the following equation where n = no of samples in the population. 𝒏𝒊 𝟐 𝟏 𝟏 −||𝒙−𝒙𝒌 || 𝒈𝒊 (𝑿) = 𝒑 ∑ 𝒆 𝟐𝝈𝟐 (𝟐𝝅)𝟐 . 𝝈𝒑 𝒏𝒊 𝒌=𝟏 If there are two classes i, j then classification criteria is decided 2 http://in.mathworks.com/help/nnet/ref/newpnn.html using the following comparison: gi (X) > gj(X) for all j ≠ i Table 6. Comparison between Winners’s Score and Our System Score Our System Winner’s System LANGUAGE TASK Accuracy F1 Score Accuracy F1 Score Hindi Task 1 0.91555 0.91 0.92 0.91 Hindi Task 2 0.78571 0.76422 0.90142 0.90001 Malayalam Task 1 0.83444 0.79 0.83777 0.81 Malayalam Task 2 0.62428 0.60677 0.74857 0.74597 Punjabi Task 1 0.942 0.94 0.946 0.95 Punjabi Task 2 0.812 0.8086 0.92266 0.923 Tamil Task 1 0.83333 0.79 0.8333 0.79 Tamil Task 2 0.65714 0.63067 0.755 0.73979 learning algorithm, In (2015) International Journal of 5. CONCLUSION AND FUTURE WORK Applied Engineering Research, 10 (17), pp. 13929-13934 In this paper, we presented our NLP-NITMZ system used for [5] Socher R., Huang E., Pennin J., Manning C. D. and And Ng DPIL shared task. Overall, our approach looks promising, but A.Y. 2011. Dynamic pooling and unfolding recursive needs some improvement. There are some disadvantages of PNN autoencoders for paraphrase detection. Advances in Neural like: require large memory, slow execution. In future we want to Information Processing Systems (pp. 801-809). overcome those problems using better machine learning approach and also want to implement semantic features for all languages to [6] Anand Kumar M., Singh, S., Kavirajan, B., and Soman, K P. increase performance. We can also identify stop words of all four 2016. DPIL@FIRE2016: Overview of shared task on languages so that we can omit them from the corpus. Since our Detecting Paraphrases in Indian Languages. In Working approach is based on language independent feature set so our notes of FIRE 2016 – Forum for Information Retrieval methodology can be extended to various languages. Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings, CEUR-WS.org 6. ACKNOWLEDGMENTS [7] Pakray P. and Sojka P. 2014. An Architecture for Scientific This work presented here is under the Research Project Grant No. Document Retrieval Using Textual and Math Entailment YSS/2015/000988 and supported is by the Department of Science Modules. In RASLAN 2014: Recent Advances in Slavonic & Technology (DST) and Science and Engineering Research Natural Language Processing, Karlova Studánka, Czech Board (SERB), Govt. of India. Authors are also acknowledges the Republic, December 5-7, 2014. Department of Computer Science & Engineering of National Institute of Technology Mizoram, India for providing [8] Lynum, A., Pakray, P., Gamback, B. and Jimenez, S. 2014. infrastructural facilities and support. NTNU: Measuring semantic similarity with sublexical feature representations and soft cardinality. In Proceedings of 7. REFERENCES the 8th International Workshop on Semantic Evaluation [1] Wu W.,Ju Y., Li X. and Wang Y. 2010. Paraphrase detection (SemEval 2014), pages 448–453, Dublin, Ireland, August 23- on SMS messages in automobiles. In Acoustics Speech and 24, 2014. Signal Processing (ICASSP), 2010 IEEE International [9] Sarkar S, Das D, Pakray P., Gelbukh A., 2016. JUNITMZ at Conference on (pp. 5326-5329). SemEval-2016 Task 1: Identifying Semantic Similarity [2] Dolan, B., and Brockett, C. 2005. Automatically Using Levenshtein Ratio. In Proceedings of SemEval-2016, Constructing a Corpus of Sentential Paraphrases. In Third pages 702–705, San Diego, California. International Workshop on Paraphrasing (IWP2005) [10] Specht, D.F., 1990. Probabilistic neural networks. Neural [3] Sundaram, M. S., Madasamy, A. K., and Padannayil, S. K. Networks 3, 109 – 118 2005. AMRITA_CEN@SemEval-2015: Paraphrase [11] Donald F. S. 1990. Probabilistic Neural Networks and the Detection for Twitter using Unsupervised Feature Learning Polynomial Adaline as Complementary Techniques for with Recursive Autoencoders. In Proceedings of the 9th Classification. In IEEE Transactions on Neural Networks, International Workshop on Semantic Evaluation (SemEval vol. I. No. I, march 1990 2015), 45-50 [12] Hajmeer, M and Basheer, I. 2002. A probabilistic neural [4] Mahalakshmi, S., Anand Kumar and M., Soman, network approach for modeling and classification of bacterial K.P. Paraphrase detection for Tamil language using deep growth/no-growth data. In Journal of microbiological methods, vol. 51, No. 2, 217—226.