ASE@DPIL-FIRE2016: Hindi Paraphrase Detection using Natural Language Processing Techniques & Semantic Similarity Computations Vani K Deepa Gupta Department of Computer Science & Department of Mathematics Engineering Amrita School of Engineering Amrita School of Engineering Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham Amrita University Amrita University Bangalore, India Bangalore, India g_deepa@blr.amrita.edu k_vani@blr.amrita.edu ABSTRACT paraphrasing becomes more complex when the idea is adopted The paper reports the approaches utilized and results achieved for and rewritten. Effective techniques incorporating syntax-semantic our system in the shared task (in FIRE-2016) for paraphrase techniques, deeper NLP techniques and soft computing identification in Indian languages (DPIL). Since Indian languages approaches may be required to tackle such scenarios [8]. But have a complex inherent nature, paraphrase identification in these when it comes to Indian languages, the task becomes more languages becomes a challenging task. In the DPIL task, the intricate. A paraphrase detection approach using deep learning for challenge is to detect and identify whether a given sentence pairs Tamil language was proposed in [9]. Paraphrase detection in paraphrased or not. In the proposed work, natural language twitter data and for SMS messages were explored in [10] and [11] processing with semantic concept extractions is explored for respectively. A method for paraphrasing Hindi sentences by paraphrase detection in Hindi. Stop word removal, stemming and synonym and antonym replacements and substitutions was part of speech tagging are employed. Further similarity proposed in [12].In FIRE 2016;a shared task for Detecting computations between the sentence pairs are done by extracting Paraphrases in Indian Languages (DPIL) [13] is organized. The semantic concepts using WordNet lexical database. Initially, the tasks are defined for 4 Indian languages: Tamil, Malayalam, Hindi proposed approach is evaluated over the given training sets using and Punjabi. For each language two subtasks are defined as different machine learning classifiers. Then testing phase is used follows: to predict the classes using the proposed features. The results are found to be promising, which shows the potency of natural  Task -1: Given a pair of sentences from news paper domain language processing techniques and semantic concept extractions in the specific language, the task is to classify them as in detecting paraphrases. paraphrases (P) or not paraphrases (NP). CCS Concepts  Task-2: Given two sentences from news paper domainin the specific language, the task is to identify whether they are  Computing methodologies-Natural language completely equivalent (E) or roughly equivalent (RE) or not processing equivalent (NE). It is defined with three classes, viz., paraphrases (P), semi-paraphrase (SP) or not paraphrases  Information systems -Document analysis and (NP) respectively. feature selection; Near-duplicate and paraphrase detection Task-1 is a binary classification problem, while Task-2 is a multi- class problem. The proposed work is carried out for identification Keywords of paraphrases in Hindi language. An approach that utilizes natural language processing (NLP) techniques with semantic Paraphrase Detection; Semantic Concepts; POS Tags; Weka similarity computations is adopted. The main focus is given to Task-1 and the same model is applied for evaluating Task-2. The paper is organized as follows. Section 2 describes the 1. INTRODUCTION proposed approach in detail. In Section 3, data statistics and Paraphrasing is the process of restating the meaning of a text evaluation measures are discussed. Section 4 discuss and analyze using other words or adopting the idea and completely rewriting the results obtained. Section 4 concludes the paper with some the text information. Paraphrase detection is widely explored in insights to future work. English language. The Microsoft Research Paraphrase corpus (MSRP) is most commonly used benchmark database in English paraphrase detections [1].Vector space models (VSM), Latent 2. PROPOSED APPROACH Semantic Analysis (LSA), graph structures and semantic Fig.1 depicts the general work-flow of proposed approach. The similarity based paraphrase detections are explored in English three main modules and the sub modules are described in the language [2-7]. Even in English language, detection of following subsections. Figure 1. General Work-Flow of Proposed Approach Table 1. Hindi Stop Word List क न का हो एस थ कुछ थं क क था कक जो कर म गया करन ककया लिय अपन होता ्वारा हुआ तक साथ करना वाि बाद लिए आप व करत बहुत कहा वगग कई करं होत अपन उनक न अभ जस सभ करता उनकी तरह उस आदद कुि एक मं की ह यह और स हं को पर बन नह ं तो ह या एवं ददया हो इसका इस सकत ककस य इसक सबस इसमं थ दो होन वह यदद हुई जा ना इस कहत जब होत कोई हुए रहा इसकी सकता रह उनका इस रखं अपना प उसक Table 2. Hindi Suffix List used for Stemming Process "ोो", "ो", "ो", "ोु", "ो ", "िो", "ोा","कर", "ोाओ", "िोए", "ोाई", "ोाए", "न", "न ", "ना", "त", "ो ों", "त ", ता", "ोाो", "ोाों", "ोोों", "ोों", "ोाकर", "ोाइए", "ोां", "ोाया", "ोग ", "ोगा", "ोोग ", "ोोग", "ोान", "ोाना", "ोात", "ोात ", "ोाता", "त ं", "ोाओं", "ोाएं", "ोुओ"ं , "ोुए"ं , "ोुआ"ं , "ोाएग ", "ोाएगा", "ोाओग ", "ोाओग", "एंग ", "ोोंग ", "एंग", "ोोंग", "ोोंग ", "ोोंगा", "ोात ं", "नाओं", "नाएं", "ताओं", "ताएं", "िोया", "िोयं", "िोयां", "ोाएंग ", "ोाएंग", "ोाऊंग ", "ोाऊंगा", "ोाइया", "ोाइयं", "ोाइयां" 2.1. Feature Extraction Removal : (NNP),मुबारक(NNP),द िजए(VP),बधाई(NN)] Initially feature extraction is applied for extracting the traits from given sentence pairs. These features are given as the input to the After Stemming: [43(CD),सचिन(NNP),तंदि ु कर(NNP),ज्मददन (NNP),मुबारक(NNP),द ज(VP), बधाई(NN)] classifier. For extracting the feature, sentences are processed using various pre-processing procedures with the incorporation of NLP techniques. The processed sentences are then passed on for pair wise semantic similarity computations and comparisons. 2.1.1. Pre-processing Initially the input sentence pairs are tokenized. Then part of 2.1.3. Semantic Similarity Computations speech tagging is carried out. Once the basic pre-processing and NLP techniques based POS Tagging &POS based Pruning: The word tokens are processing is done, the semantic similarity between the processed 1 tagged with their respective classes using NLTK POS tagger sentences pairs are computed. The metric used extracts the [14]. The word classes include noun, verb, adjective, adverb, semantic concepts in the form of synonyms of given word. Instead preposition, conjunction etc. This is followed by POS based of considering just surface-level word matching, synonym–level pruning. In this pruning process, the tags that can possibly convey matching is also done. This facilitates paraphrase detection, since some meaning or semantics are only retained while others are in many cases paraphrasing is done by replacing the words with pruned out. The retained tags include Noun, Verb, Adjective and their synonyms. The synonyms are extracted using Word Net2 Adverb. The tag for cardinality which includes numbers and lexical database [15-18].The steps for computing the semantic indicates years, cost etc. are also retained. The remaining tags are similarity is explained in following steps. pruned out and not considered in further proceedings. This is 1. For all processed sentence pair,(S1, S2) Repeat steps 2 to 8. followed by stop word removal and stemming. 2. Initialize Count =0. Stop Word Removal: Stop words are the frequent and irrelevant words appearing within the document. The Hindi stop word list 3. For each word win S1, do steps 4 to 7. used in reported work is given in Table 1.Prior to stemming, 4. If w is in S2, then Count = Count +1, else go to step 5. punctuation removal is done. As punctuations play an important 5. Extract synonyms of the word from WordNet. role in structural composition of documents, their removal can alter the results of NLP applications. The scenario becomes more 6. For each synonym syn for word w, do step 7. affected with NLP techniques that operate at document level such 7. If syn is in S2, then Count = Count +1 and goto step 6, else as parsing, chunking, semantic role labeling (SRL) etc. go to step 3. Considering the dependence of NLP techniques on these 8. Compute similarity, sim using Equation (1). structures of a document, punctuation removal is applied after Count sim    (1) POS tagging in our approach. Stemming: Stemming is the process of removal of affixes from max S1 , S 2 the given word. Stemming of the words is done using the suffix list given in Table 2. An example illustration for all these Equation (1) computes similarity between the processed sentences processing’s is done using a sample sentence S. (S1, S2) as the ratio of Count value, to the maximum among the lengths of given sentence pair. For illustration consider two S: 43 sentences S1 and S2. कहुएसचिनतंदि ु करज्मददनमुबारकहो,द िजएब S1:43 कहुएसचिनतंदि ु करज्मददनमुबारकहो,द िजएबधाई| धाई| S2:किकटकभगवानसचिनकोज्मददवसमुबारकहो, द िजएबधाई| After [43,क,हुए,सचिन,तंदि ु कर,ज्मददन,मब ु ारक,हो,,, The sentences after doing tokenization, POS tagging, pruning, Tokenization: द िजए,बधाई, |] stop word removal and stemming are given below. The procedure is same as explained in subsection 2.1.1. After POS [43(CD),क(IN),हुए(IN),सचिन(NNP),तंदि ु कर( Tagging: Processed S1: [43(CD),सचिन(NNP),तंदि ु कर(NNP),ज्मददन( NNP),ज्मददन(NNP),मुबारक(NNP), हो NNP),मुबारक(NNP),द ज(VP), बधाई(NN)] (IN)(,),द िजए(VP), बधाई(NN), | (| )] Processed S2: [किकट(NN),भगवान(NNP),सचिन(NNP),ज्मददव After POS based [43(CD),सचिन(NNP),तंदि ु कर(NNP),ज्मददन Pruning: स(NNP),मुबारक(NNP), द ज(VP), बधाई(NN)] (NNP),मुबारक(NNP),द िजए(VP), बधाई(NN)] In these sentences, each word in S1 in checked for its presence in After Stop [43(CD),सचिन(NNP),तंदि ु कर(NNP),ज्मददन S2. If word is not present, then synonym checking is done. In the Word Removal: given example, 4 exact matches are found, viz., (NNP),मब ु ारक(NNP),द िजए(VP),बधाई(NN)] सचिन,मब ु ारक,द जand बधाई. One word is identified as synonym; After [43(CD),सचिन(NNP),तंदि ु कर(NNP),ज्मददन viz. ज्मददवस is a synonym of ज्मददन. Thus the count value Punctuation 1 2 http://www.nltk.org/ http://wordnet.princeton.edu/ will be, Count =5. The similarity is computed using Equation (1) Confusion matrix is mainly used to evaluate classification which will be:=5/(max(7,7)=0.7142. problems. The true positives (TP), false negatives (FN), true negatives (TN) and false positives (FP) are obtained from this The similarity output obtained is considered as the feature input matrix. General confusion matrix for binary class problem is from a sentence pair. This is the input to machine learning shown in Equation (2). In the proposed work, TP indicates the classifier. number of paraphrased documents correctly classified as 2.2. Machine Learning Classifiers paraphrased. FN indicates the number of paraphrased documents Machine learning (ML) based classifiers are used for the misclassified as non-paraphrased. TN is the number of non- paraphrase identification task. The similarity score which is the paraphrased documents correctly classified as non-paraphrased feature input is fed to the classifier and classification task is done. and FP indicates the number of non-paraphrased documents In the proposed work, different classifiers are tested and the best misclassified as paraphrased. The total population is computed among them is selected based on accuracy. using Equation (3). Accuracy is measured using Equation (4) 2.3. Decision making which is the fraction of number of correctly classified instances to Using the training data, initially training phase is implemented. the total number of instances in the population. Precision, Recall, This is followed by testing, where decision making is done. and F-measure are computed using Equation (5), (6) and (7) Decision is made on whether a given sentence pair is paraphrased respectively. Recall is defined as the number of correctly or not in Task-1. In Task-2,multi-class classification is done to classified documents to the actual number of correct documents to decide whether the sentence pair is paraphrased, semi-paraphrased be identified with respect to a particular class. Precision is defined or non-paraphrased. as the number of correctly classified documents to the total Section 3 describes the data statistics used in evaluation (training number of documents identified as belonging to that class by the and test data) and the evaluation measures. system. F-measure defines the harmonic mean of precision and recall. Receiver Operating Characteristic Curve (ROC) is also plotted for 3. DATA STATISICS & EVALUATION better understanding. ROC curve plots sensitivity Vs 1-specificity. Sensitivity is same as recall or true positive rate (TPR) while MEASURES specificity is the true negative rate (TNR) which is defined by the fraction of documents correctly rejected to the total number of In DPIL,Task-1 provides 2500 sentence pairs for training. The documents to be rejected. 1-specificity is termed fall-out, which is sentences are labeled as either paraphrased (P) or Non- the false positive rate (FPR) defined as the fraction of documents Paraphrased (NP). The set include 1000 instances for P class and misclassified or incorrectly rejected to the total number of 1500 instances for NP class.Task-2 provides 3500 sentence pairs documents to be rejected.ROC curves help us to understand the out of which 1000 are paraphrased (P), 1000 semi paraphrased discriminative power of the classifier. Using these measures, the (SP) and 1500 non-paraphrased (NP). Our main focus was Task-1 performance of proposed approach is evaluated over Task-1 and while we implemented the same model for Task-2 as well. The Task-2. feature input is the semantic similarity computed, i.e., sim , using Equation (1). Result evaluation is carried out using the 4. EXPERIMENTAL RESULTS & classification measures, viz., recall, precision, F-measure and % ANALYSIS accuracy. Initially the proposed approach is evaluated using different classifiers in Weka. Weka3 is an open source machine learning P NP suite. The accuracy obtained using 10 fold cross-validations over P TP FN Task-1 ad Task-2 by the tested classifiers is reported in Table 3.It (2) is observed that decision tree exhibits the maximum accuracy in both tasks. Thus for further evaluations decision tree is NP FP TN considered. The Weka implementation of C4.5 decision tree, viz., J48 is used in proposed work. Total Population  TP  TN  FP  FN (3) For better understanding, the ROC curves obtained using J48 onTask-1 and Task-2 is also plotted.Figure.2 and 3 plots the ROC curve for class P in Task-1 and Task-2 respectively. From Figure TP  TN 2 and 3, it is observed that area under ROC curve (AUC) is 0.9 Accuracy  ( 4) Total Population and 0.799 respectively for Task-1 and 2. The values show that the J48 classifier is able to discriminate the classes significantly in Task-1 and it is not so low in Task-2. TP Precision  (5) TP  FP TP Recall  ( 6) TP  FN recall * precision F  measure  2 * (7) recall  precision 3 http://www.cs.waikato.ac.nz/ml/weka/ Table 3. Comparison of Accuracy between Multiple Table 4. Classification Measures using J48 Classifier in Classifiers Training Set-Task 1 Classifier %Accuracy Class Recall Precision F-measure Task-1 Task-2 P 0.942 0.84 0.888 NP 0.881 0.958 0.918 Naïve Bayes (NB) 90.38% 64.21% Weighted 0.905 0.911 0.906 Support Vector Machine (SVM) 90.36% 64.45% Average Decision Tree (J48) 90.52% 66.77% Table 5. Classification Measures using J48 Classifier in Logistic Regression 90.25% 64.89% Training Set-Task 2 Multilayer Perceptron 90.38% 65.09% Class Recall Precision F-measure P 0.518 0.540 0.529 SP 0.515 0.478 0.496 NP 0.869 0.892 0.880 Weighted 0.668 0.673 0.670 Average Figure 2. ROC Curve with Decision Tree (J48) for Class P in Task-1 Figure 4. Results with Run 1 and Run 2 Submission on Test Data Compared to the training results, during the testing phase, our results exhibited significant variation. Figure 4 plots the test results obtained using the proposed approach. Test results presented a considerable drop. In contrast to the 90.52% accuracy (Task-1) on training set, test set presented only 35.88% accuracy. Similar drop is noted in Task-2 also (Run-1 in Figure 4). On rechecking the submission, we found that the results were submitted wrongly. In Run-1 submission, the first sentence pair was not written to the final output file and hence making the second pair as first, third as second etc. and thus completely Figure 3. ROC Curve with Decision Tree (J48) for Class altering our results. P in Task-2 On request to DPIL, our results were reevaluated. The results of Run 2 are the final results of proposed approach. It is observed Table 4 and Table 5 reports the classification performance with from Figure 4, that an accuracy of 89% in Task-1 and 66.6% in each class for Task-1 and Task-2 respectively using J48 classifier. Task-2 is obtained on test sets for Run-2. This matches the In Task-1, an accuracy of 90.52% is obtained at training phase training results also. while in Task-2, an accuracy of 66.77% is presented. The proposed approach was originally developed for plagiarism [8] Vani,K., and Gupta,D. 2016. Study on Extrinsic Text identification and classification in English language. The results Plagiarism Detection Techniques and Tools,J. of Engg. Sci. obtained in Task-1 reflect the potency of our model to be and Tech. Review., 9(4), 150-164. extended to other languages also. Task-2 can be further improved [9] Mahalakshmi, S., Anand Kumar, M., and Soman, K.P by extracting significant features and using advanced NLP 2015. Paraphrase detection for Tamil language using deep techniques. learning algorithm. Int. J. of Appld. Engg. Res., 10 (17), 13929-13934. 5. CONCLUSIONS & FUTURE WORK In the proposed approach natural language processing techniques [10] Mahalakshmi, S., Anand Kumar, M., and Soman, K.P. and semantic similarity computations are used to classify a Hindi 2015.AMRITA CEN@ SemEval-2015: Paraphrase Detection sentence pair as paraphrased or not. Part of speech tagging is used for Twitter using Unsupervised Feature Learning with for comparing only relevant tags within each sentence pair. A Recursive Autoencoders, SemEval-2015, 45. semantic similarity metric is employed which extracts the word [11] Wei Wu., Yun-Cheng Ju., Xiao Li and Ye-Yi Wang. 2010. synonyms from WordNet to check whether the compared words Paraphrase detection on SMS messages in automobiles.2010. are synonyms or not. This facilitates in detailed analysis and In Acoustics Speech and Signal Processing (ICASSP), 2010, comparisons and helps in unmasking paraphrasing imposed by 5326-5329. synonym replacements. The metric as a whole helps in detecting and classifying paraphrased and non-paraphrased sentence pairs [12] Sethi,N., Agrawal,P., Madaan,Vishu and Kumar Singh effectively. In future, deeper natural language processing S.2016.A Novel Approach to Paraphrase Hindi Sentences techniques and intelligent computing techniques can be explored. using Natural Language Processing.Ind. J. of Sci. and Tech., These advanced techniques are very less explored in Indian 9(28). language paraphrase detections. [13] Anand Kumar, M., Singh, S., Kavirajan, B., Soman,KP. 2016. DPIL@FIRE2016: Overview of shared task on 6. ACKNOWLEDGMENTS DetectingParaphrases in Indian Languages. In Working notes The authors gratefully acknowledge Department of Science and of FIRE 2016-Forum for Information Retrieval Technology, Govt. of India (www.dst.gov.in), for sponsoring Evaluation, Kolkata, India. this research project, Sanction No.SB/FTP/ETA-0212/2014- [14] Charniak, E., 1997.Statistical Techniques for Natural 2016. Language Parsing, AI Magazine 18 (4), 33–44. [15] Miller, G.A.1995. WordNet: A lexical database for English, 7. REFERENCES Commun. of the ACM, 38(11), 39-41. [16] Bhingardive,S., Shukla,R., Saraswati,J., Kashyap, L., [1] Dolan, W.B., Quirk, C., and Brockett, C. 2004. Singh,D., and Bhattacharyya, P. 2016. Synset Ranking of Unsupervised construction of large paraphrase corpora: Hindi WordNet. In Proceedings of theLanguage Resources Exploiting massively parallel news sources. In Proceedings and Evaluation Conference, Portorož, Slovenia. of the 20th International Conference on Computational [17] Gupta, D., Vani,K., and Singh, C.K.2014. Using Natural Linguistics, Geneva, Switzerland. Language Processing techniques and fuzzy-semantic [2] Mihalcea, R., Corley, C., and Strapparava, C. 2006. Corpus- similarity for automatic external plagiarism detection. In based and knowledge-based measures of text semantic Proceedings of theInternational Conference on Advances in similarity, Proceedings of the National Conference on Computing, Communication and Informatics, Noida, 2694- Artificial Intelligence (AAAI 2006), Boston, Massachusetts, 2699. 775-780. [18] Vani,K.,andGupta, D. 2015. Investigating the Impact of [3] Rus, V., and McCarthy, P.M. and Lintean, M.C. and Combined Similarity Metrics and POS tagging in Extrinsic McNamara, D.S. and Graesser, A.C. 2008. Paraphrase Text Plagiarism Detection System. In Proceedings of the identification with lexico-syntactic graph subsumption, International Conference on Advances in Computing, FLAIRS 2008, 201-206. Communication and Informatics, Kochi, India, 1578-1584. [4] Fernando, S., and Stevenson, M. 2008. A semantic similarity approach to paraphrase detection, Computational Linguistics UK (CLUK 2008) 11th Annual Research Colloquium. [5] Blacoe, W., and Lapata, M. 2012. A comparison of vector- based representations for semantic composition, Proceedings of EMNLP, Jeju Island, Korea, 546-556. [6] Islam, A., and Inkpen, D. 2007. Semantic similarity of short texts, Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2007), Borovets, Bulgaria, 291-297. [7] Ul-Qayyum, Z., and Altaf, W. 2012. Paraphrase Identification using Semantic Heuristic Features.Res. J. of Appld. Sci., Engg. and Tech., 4(22), 4894-4904.