JU_NLP@DPIL-FIRE2016: Paraphrase Detection in Indian Languages - A Machine Learning Approach Tanik Saikh Sudip Kumar Naskar Sivaji Bandyopadhyay Computer Science and Engineering Department, Jadavpur University Kolkata, India tanik4u@gmail.com , {sudip.naskar, sbandyopadhyay}@cse.jdvu.ac.in ABSTRACT these measures were considered as feature values to build the This paper presents our system report on our participation in the models. The models were used to train the machine learning based shared task on “Detecting Paraphrases in Indian Languages classifiers. Naïve Bayes, SVM and SMO were employed for this (DPIL)” organized in the “Forum for Information Retrieval purpose. Evaluation (FIRE)”- 2016, in both the tasks (Task1 and Task2) The work reported in [14] made use of same kinds of features in defined in this shared task in four Indian languages (Tamil, taking textual entailment decision between a pair of texts on the Malayalam, Hindi and Punjabi). We made use of different datasets released in the shared task on recognizing textual similarity measures and machine translation evaluation metrics as entailment in RTE-1, RTE-2 and RTE-3. In the present work we features and used machine learning framework to take paraphrase demonstrated that the same features and techniques are also decision between a pair of text snippets. We obtained the effective in taking paraphrase decision between two sentences. accuracies of 97.08%, 94.2%, 97.32% and 98.29% in Task1 and 86.68%, 77.37%, 84% and 98.77% in Task2 for Tamil, 2. DATA Malayalam, Hindi and Punjabi respectively on the respective The shared task on detecting paraphrases in Indian languages training sets using a 10-fold cross validation framework. (DPIL) defined two subtasks namely Task1 and Task2. Training datasets were provided for each of the subtasks in four Indian Keywords languages − Hindi, Panjabi, Tamil and Malayalam. The statistics Paraphrase detection; Similarity Measures; Machine Learning. of the training and test datasets are shown in table1 and table2 respectively. 1. INTRODUCTION A paraphrase is a restatement of a text expressed using other Table 1. Statistics of the training sets words. Alternatively, two texts say text1 and text2 can be defined as paraphrases if they textually entail each other bi-directionally, # of sentence pairs i.e. if text1 entails text2 and text2 entails text1. Textual entailment Language Task1 Task2 and paraphrase relations between a pair of text snippets are highly Total P NP Total E NE RE correlated. There are two different tasks related to paraphrases - Tamil 2500 1000 1500 3500 1000 1500 1000 paraphrase identification and paraphrase generation. In this shared Malayalam 2500 1000 1500 3500 1000 1500 1000 task [19] the focus is on identifying sentence level paraphrases in Indian languages namely Tamil, Malayalam, Hindi, and Punjabi. Hindi 2500 1000 1500 3500 1000 1500 1000 Two subtasks, Task1 and Task2, were defined in the shared task. Given a pair of sentences from newspaper domain, Task1 is to Punjabi 1700 700 1000 2200 700 1000 500 classify them as paraphrases (P) or not paraphrases (NP) and Task2 is to identify whether they are completely equivalent (E), Table 2.Statistics of the test sets not equivalent (NE) or roughly equivalent (RE). Task2 is similar to Task1 except that it is a ternary classification problem. Identifying paraphrase between a pair of sentence in Indian # of sentence pairs Language languages is not an easy task due to the scarcity of tools and Task1 Task2 resources in such languages. Semantic knowledge bases like Tamil 900 1400 WordNet [13] and BabelNet [10] are also very useful resources for knowledge based approaches to detecting paraphrases and Malayalam 900 1400 textual entailment and Indian languages also suffer from very Hindi 900 1400 poor coverage in this respect. Therefore, for this study we adopted Punjabi 500 750 lexical level of analysis on the sentences, which is considered to be a shallow level of processing in any Natural Languages Processing task. 3. Features For both the tasks (i.e., Task1 and Task2) we made use of various Features play a pivotal role in machine learning based kinds of lexical similarity measures namely Cosine similarity, frameworks. Therefore, analysis of features which take part in unigram matching with respect to sentence1, unigram matching predicting the target class is very crucial. Features which have with respect to sentence2, Jaccard similarity [12], Dice coefficient been used in this study can be broadly divided into two categories [11], overlap, harmonic mean and machine translation (MT) − similarity based features and MT evaluation metrics based evaluation metrics namely BLEU and METEOR. The scores of features. 3.1 Similarity Based Features [4], NIST [6], General Text Matcher (GTM) [5] etc. Among these For the present study, we considered different similarity based BLEU and METEOR are perhaps the most popular and widely features such as vector based (cosine similarity, dice similarity), used ones. In this present work, we made use of these two MT lexical based (unigram matching with respect to sentence1 and evaluation metrics as features to predict paraphrase relation sentence2), set based (Jaccard, overlap and harmonic) which are between a pair of sentences. discussed below. 3.2.1 BLEU 3.1.1 Cosine Similarity BiLingual Evaluation Understudy (BLEU) is an algorithm to Cosine similarity measures the similarity between two vectors of evaluate the quality of a machine translated text. It compares n- an inner product space that measures the cosine of angle between grams of the translation hypothesis with the n-grams of the them. The lower the angle between the two vectors the more reference translation(s) and counts the number of n-gram matches. similar the two vectors are. It is essentially a modified n-gram precision measure. To avoid bias towards shorter translation candidates, BLEU uses a brevity 3.1.2 Unigram Matching penalty that penalizes candidate translations whose length differs Unigram (i.e., word) matches between two sentences are taken significantly from that of the reference translation. into consideration. Here two variations of the unigram matching are considered, i.e. unigram matching with respect to sentence1 3.2.2 METEOR and sentence2. These are calculated by the number of unigram Metric for evaluation of translation with explicit ordering matching between two sentences normalized by the number of (METEOR) calculates n-gram overlaps between a translation unigrams in sentence1 and sentence2 respectively. hypothesis and the reference translation(s). If multiple reference translations are available for an MT output, the translation is 3.1.3 Jaccard scored against each reference translation independently and the Jaccard similarity is a set based measure which can be defined as best scoring pair is used for evaluation. Given a pair of sentences to be compared, METEOR creates a word alignment between the 𝐴∩𝐵 𝐽(𝐴, 𝐵) = two sentences. An alignment is a mapping between the sentences 𝐴∪𝐵 such that every word in each sentence maps to at most one word where A and B are two sets of element. It provides number of in the other sentences. This alignment is incrementally produced common elements between two sets. by a sequence of word mapping modules which considers exact matching, stem matching and synonymy matching. Based on the 3.1.4 Dice number of mapped unigrams found between the two strings (m), Dice is a vector based similarity measure the value of which lies the total number of unigrams in the translation (t) and the total between 0 to 1. It can be calculated by the following equation number of unigrams in the reference (r), unigram precision is 𝐴∩𝐵 calculated as 𝑃 = 𝑚⁄𝑡 and unigram recall is calculated as 𝑅 = 𝐷𝑖𝑐𝑒(𝐴, 𝐵) = 𝑚⁄ . Finally, it computes the F score as a parameterized (|𝐴| + |𝐵|) 𝑟 𝑃∗𝑅 where A and B are two sets. harmonic mean of P and R as 𝐹𝑚𝑒𝑎𝑛 = , where 𝛼 is a 𝛼∗𝑃+(1−𝛼)∗𝑅 constant. To address word ordering, METEOR calculates a 3.1.5 Overlap reordering penalty based on how many chunks in the translation Overlap is a set based text similarity metric where a text can be hypothesis need to be moved around to get the reference text. represented as a set and the set elements are words. It is similar to Finally the METEOR score for the alignment between the two Dice with a minor difference that it assumes a full match between sentences is calculated as follows. two strings if one is subset of another. The similarity of this measure lies in the range of 0 to 1. It can be measured by the 𝑠𝑐𝑜𝑟𝑒 = (1 − 𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝑠𝑐𝑜𝑟𝑒) ∗ 𝐹𝑚𝑒𝑎𝑛 following equation. |𝐴 ∩ 𝐵| 4. System Description 𝑂𝑣𝑒𝑟𝑙𝑎𝑝(𝐴, 𝐵) = We extract sentence pairs (say sentence1 and sentence2) from the 𝑚𝑖𝑛(|𝐴|, |𝐵|) XML format training dataset for each language. An example where A and B are two sets. training sentence pair is shown below from the Hindi dataset. 3.1.6 Harmonic भारतीयमुस्लिमोंकीवजहसेनह ींपनप calculated by employing the following equation. सकताआईएस|भारतमेंकभीवर्चलवकायम |𝐴 ∩ 𝐵|. (|𝐴| + |𝐵|) 𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐(𝐴, 𝐵) = नह ींकरसकताआईएस|P. where A and B are two sets. In the above example the relation between sentence1 and 3.2 Machine Translation Evaluation Metrics sentence2 is paraphrase which is tagged as “P” in the training Machine translation (MT) evaluation metrics are generally used to dataset. We calculate different similarity scores between the measure the closeness between the MT translation hypotheses and sentences pairs extracted from the training datasets released in the the reference translations. The closer a translation hypothesis is to shared task. These scores are used as feature values to build the the reference translation, the better the translation quality is. There models for the four languages. Four models were prepared by are several MT evaluation metrics available like Word Error Rate combining different features for every language. Model1 (WER) [7], Position Independent word error rate (PER) [8], considers only lexical features (LF), i.e., cosine similarity, BLEU [1], METEOR [2] [3], Translation error/edit rate (TER) unigram matching with respect to sentence1, unigram matching with respect to sentence2, jaccard, dice, overlap and harmonic. Table 4. Accuracies on training sets in Task2. Model2 considers LF and BLEU. LF and METEOR were Classifiers Models considered for Model3. Model4 considers all the features, i.e., LF, BLEU and METEOR. These models were used to train three LF LF+B LF+M LF+B+M machine learning classifiers namely naïve bayes [17], support Naïve Byes 73.14 75.37 73.45 74.97 vector machine (SVM) [16] and sequential minimal optimization Hindi SVM 82.6 82.82 82.71 82.85 algorithm (SMO) [9]. Weka1 [15] tool was used for this purpose which is freely available in web; the java implementations of SMO 82.88 83.17 83.37 84 various machine learning algorithms are available in this tool Naïve Byes 71.6 78.82 71.25 78.54 including the classifiers used for our experiments. We carried out Tamil SVM 76.57 77.77 76.54 77.77 experiments based on 10-fold cross validation on the training set. So our system can predict paraphrase class of an unknown SMO 76.71 86.4 77.11 86.68 sentence pair based on these learning. Finally, the Naïve Byes 65.02 72.65 65.54 72.22 classifier−model combinations producing the optimized results Malayalam SVM 67.31 68.51 67.25 68.48 were applied on the corresponding test sets. SMO 68 77.25 68 77.37 5. Results and Discussion Naïve Byes 93.81 97.5 94.45 97.31 We took part in both the tasks (Task1 and Task2) defined in the Punjabi SVM 96.22 98.04 97.09 98.04 shared task in all the four languages (Hindi, Punjabi, Tamil and Malayalam). Accuracies obtained on the training sets (using 10- SMO 97.18 98.77 97.72 98.77 fold cross validation) on Task1 and Task2 using the different classifiers are shown in Table 3 and Table 4 respectively. SVMs are comparatively new machine learning approaches for Table 3. Accuracies on training sets in Task1. solving two class pattern recognition problems. In the field of Classifiers Models NLP, SVMs have been employed for many tasks including text classification and are reported to have obtained high accuracy LF LF+B LF+M LF+B+M without falling into overfitting even with a large number of Naïve Byes 90.04 92.52 90.16 74.97 features [18]. Since SVMs perform better in binary classification Hindi SVM 90.36 92.52 90.36 82.85 problems, our results in Task1 (binary classification problem) also outperforms the results obtained in Task2 (ternary classification SMO 90.84 97.32 90.84 84 problem). SMO is essentially another way of expressing SVMs Naïve Byes 92.44 94.16 92.2 93.92 which implements John Platt's sequential minimal optimization Tamil SVM 93.24 94.44 93.24 94.24 algorithm for training a support vector classifier. It makes use of heuristics to partition the training problem into smaller problems. SMO 76.71 97.04 93.36 97.08 Naïve Byes 83.48 88.36 83.56 87.88 The official results of our submissions released by the task organizers on the test sets for the two subtasks in four languages Malayalam SVM 84.24 86.88 83.96 86.56 are reported in Table 5. SMO 84.68 94.2 84.72 94.2 Table 5. Results on test sets in Task1 and Task2. Naïve Byes 97.64 97.64 97.70 97.58 Punjabi Languages Task Count Accuracy F1 Measure/ SVM 97.76 97.82 97.70 97.7 Macro F1 Measure SMO 98.17 98.23 98.23 98.29 Hindi Task1 900 0.8222 0.74 Hindi Task2 1400 0.68571 0.6841 In Task1 the best accuracy 97.32% was obtained for Hindi in Malayalam Task1 900 0.59 0.16 SMO with LF+B model. Tamil achieved the highest accuracy of 97.08% with the LF+B+M model in SMO. Malayalam resulted in Malayalam Task2 1400 0.42214 0.3078 the highest accuracy of 94.2% with both LF+B and LF+B+F Punjabi Task1 500 0.942 0.94 models in SMO, whereas in Punjabi we got the highest accuracy Punjabi Task2 750 0.88666 0.88664 of 98.29% with LF+B+M model by SMO. Thus SMO provided the optimum results in all four languages on the training datasets. Tamil Task1 900 0.57555 0.09 Tamil Task2 1400 0.55071 0.4319 In Task2, the SMO classifier and LF+B+M model combination produced the best accuracies for all the four languages − Hindi (84%), Tamil (86.68%), Malayalam (77.37%) and Punjabi 6. Conclusions (98.77%). For Punjabi the LF+B model (on SMO) also achieved The paper presents our submissions in the DPIL shared task the highest accuracy along with the LF+B+M model. organized in FIRE 2016. We took part in both the subtasks in all four languages. Different lexical level similarity measures and The features which are used in these experiments are independent two machine translation evaluation metrics namely BLEU and of each other. Naïve Bayes makes the simplification assumption METEOR were employed as features to find similarity scores that features are independent to given class [17]. between pair of sentences. Four different models were built combining these features. The models were used to train three classifiers namely naïve bayes, SVM and SMO. We carried out 1http://www.cs.waikato.ac.nz/ml/weka/ experiments using 10-fold cross validation framework on the training sets. The SMO classifier produced the optimum results Translation. In Proceedings of the 5th European Conference when all the features were combined. on Speech Communication and Technology, Rhodes, Greece, pp. 2667–2670. 7. Acknowledgements [9] John C. Platt. 1999. Fast Training of Support Vector The research work has received funding from the project Machines Using Sequential Minimal Optimization. In “Development of Tree Bank in Indian Languages” funded by The Bernhard Schölkopf, Christopher J. C. Burges and Alexander Department of Electronics and Information Technology (DeitY), J. Smola (eds.). 1999. Advances in Kernel Methods: Support Ministry of Communication and Information Technology, Vector Learning. The MIT Press, Cambridge, MA, pp. 185- Government of India. 208. 8. REFERENCES [10] Roberto Navigli and Simone Paolo Ponzetto. 2012. [1] Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. 2002. BabelNet: The automatic construction, evaluation and BLEU: a Method for Automatic Evaluation of Machine application of a wide-coverage multilingual semantic Translation. In Proceedings of 40th Annual Meeting of the network. Artificial Intelligence, pp. 193:217– 250. Association for Computational Linguistics (ACL), [11] Dice, L. 1945. Measures of the amount of ecologic Philadelphia, PA, pp. 311–318. association between species. Ecology, vol 26, no 3, pp. 297- [2] Banerjee, S. and Lavie, A. 2005. METEOR: An Automatic 302. Metric for MT Evaluation with Improved Correlation with [12] Jaccard, P. 1901. Étude comparative de la distribution Human Judgments. In Proceedings of the ACL Workshop on floraledansune portion des Alpeset des Jura. Bulletin de la Intrinsic and Extrinsic Evaluation Measures for Machine SociétéVaudoise des Sciences Naturelles 37, pp. 547-579. Translation and/or Summarization, Ann Arbor, Michigan, pp. 65–72. [13] Fellbaum, Christine, ed.1998.WordNet: An Electronic Lexical Database, Cambridge, MA: MIT Press. [3] Lavie, A. and Agarwal, A. 2007. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation [14] T Saikh, SK Naskar, C Giri, and S Bandyopadhyay. 2015. with Human Judgments. In Proceedings of the Second ACL Textual Entailment Using Different Similarity Metrics in Workshop on Statistical Machine Translation, Prague, Czech Computational Linguistics and Intelligent Text Processing, Republic, pp. 228–231. pp. 491-501. [4] Snover, M., Dorr, B., Schwartz, R., Micciulla, and L., [15] I.H. Witten and E. Frank. 2005. Data Mining: Practical Makhoul, J. 2006: A study of translation edit rate with machine learning tools and techniques. Morgan Kaufmann, targeted human annotation. In Proceedings of Association for San Francisco, 2nd edition. Machine Translation in the Americas, Cambridge, [16] Vapnik, Vladimir N. 1995. The Nature of Statistical Massachusetts, USA, pp. 223–231. Learning Theory. Springer. [5] Turian, J.P., Shen, L., and Dan Melamed, I. 2003. Evaluation [17] I. Rish. 2001. An empirical study of the naive Bayes of Machine Translation and Its Evaluation. In Proceedings of classifier. In IJCAI 2001 workshop on empirical methods in MT Summit, New Orleans, Luisiana, pp. 386–393. artificial intelligence, vol 22, pp. 41-46. [6] Doddington, G. 2002. Automatic evaluation of machine [18] AEkbal and S Bandyopadhyay. 2008. Bengali Named Entity translation quality using n-gram cooccurrence statistics. In Recognition Using Support Vector Machine. In International Proceedings of the Second International Conference on Joint Conference on Natural Language Processing (IJCNLP), Human Language Technology Research, Morgan Kaufmann pp. 51-58 Publishers Inc, pp. 138–145. [19] Anand Kumar, M., Singh, S., Kavirajan, B., and Soman, K [7] Vidal, E.1997. Finite-State Speech-to-Speech Translation. In .P.2016.DPIL@FIRE2016: Overview of shared task on Proceedings of the International Conference on Acoustics, Detecting Paraphrases in Indian Languages. Working notes Speech and Signal Processing, Munich, Germany, pp. 111– of FIRE 2016 - Forum for Information Retrieval Evaluation, 114. Kolkata,India,December 7-10,CEUR Workshop Proceedings, [8] Tillmann, C., Vogel, S., Ney, H., Sawaf, H., and Zubiaga, A. CEUR-WS.org 1997. Accelerated DP based Search for Statistical