BITS_PILANI@DPIL-FIRE2016:Paraphrase Detection in Hindi Language using Syntactic Features of Phrase Rupal Bhargava1 Anushka Baoni2 Harshit Jain3 Yashvardhan Sharma4 WiSoc Lab, Department of Computer Science Birla Institute of Technology and Science, Pilani Campus Pilani-333031 {rupal.bhargava1 , f20136832 , f20132893 ,yash4 } @pilani.bits-pilani.ac.in ABSTRACT Hindi and other Indian languages, not much work has been Paraphrasing means expressing or conveying the same mean- done and there is a lot of scope for research. The most com- ing or essence of a sentence or text using different words or mon way of detecting paraphrases is modeling the problem rearrangement of words. Paraphrase detection is a chal- as a classification problem. This paper implements a super- lenge, especially in Indian languages like Hindi, because it is vised classification model for detecting Paraphrases. POS very essential to understand the semantics of the language. Tags, Stems of Words and Sound-ex codes corresponding to Detecting paraphrases is very relevant in real life because the words in sentences are used as features. it has a lot of importance in applications like Information The rest of the paper is organized as follows: Section 2 Retrieval, Extraction and Text Summarization. This paper discusses related work in the area of Paraphrase Detection. focuses on using Machine Learning classification techniques Section 3 presents the analysis of the Data set provided by for detecting paraphrases in Hindi language for the DPIL DPIL task organizers. Section 4 discusses the methodology Task in Fire 2016. A feature vector based approach has been used and Section 5 explains the proposed algorithm. Section used for detecting paraphrases. The task involves checking 6 gives a detailed analysis of the results obtained and error whether a given pair of sentences conveys the same informa- analysis. Section 7 presents the conclusion and possible fu- tion and meaning even if they are written in different forms. ture work. Given a pair of sentences in Hindi, the proposed technique labels whether the pair of sentences are Paraphrases (P), 2. RELATED WORK Semi-Paraphrases (SP) or Not Paraphrases (NP). Paraphrase detection has been a major area of research in the recent times because of its significance in many ar- CCS Concepts eas of Natural Language Processing. Few of the approaches •Information systems → Summarization; Information adopted for English language are mentioned in this section. integration; Data analytics; Data mining; Huang et al. [4] has proposed an unsupervised recursive auto-encoder architecture for paraphrase detection. The re- Keywords cursive auto-encoder uses tanh as the sigmoid-like activation function and gives the representation of sentences along with Paraphrase Detection, Text Summarization, Classification, their sub-phrases. These representations are then used for Machine Learning paraphrase detection. To extract the same number of fea- tures for different sentence pairs, two approaches are used, 1. INTRODUCTION aggregating representations to form a single feature and us- The word ’paraphrase’ means rephrasing or restating the ing a similarity matrix approach. With first approach they meaning of a paragraph or text using some other words or achieved 66.49% accuracy while with the second method ac- vocabulary. Paraphrase detection is an important task for curacy of 68.06% was achieved . Kotti et al.[10] also pro- many natural language processing applications. Some of the posed an unsupervised feature learning technique with Re- applications involve question-answering systems, machine cursive Auto-encoders (RAE) for detecting paraphrases on translation systems, systems used for plagiarism checks, find- twitter. In their proposed technique they first converted ing similarities between sentences, text summarizers, etc. data to parse trees using phrase-structure parser and then Plagiarized texts usually copy phrases as it is or replace passed it to the RAE for training. The vector generated from some words with similar words. Paraphrase detection will the RAE is converted to form a similarity matrix and thus help in detecting plagiarized work and ensure that the doc- paraphrase detection is done using this matrix. Fernando et uments written are unique and not copied. Question An- al.[3] presented an algorithm using word similarities whereas swering system makes use of paraphrases to find the correct Ngoc et al. [11] proposed simple features like n-grams, edit answers to asked questions. A lot of work has been done distance scores, METEOR word alignment, BLEU for de- in paraphrase detection for English language. However for tecting paraphrases and semantic similarity tasks on twitter data. Similarly, analysis of various similarity measures like sentence-level edit distance measure, simple n-gram overlap measure, exclusive longest common prefix (LCP) n-gram measure, BLEU measure and sumo measure along with a paraphrase detection based on abductive machine learning has been proposed in [2]. Sethi et al. [9] proposed a tech- nique for paraphrasing or re-framing Hindi sentences using NLP. The main steps involved dividing the paragraph into sentences, tokenizing the sentences into words, applying re- framing rules and then combining the results to form new paragraphs. Malakasiotis et al. [5] proposed three methods for paraphrase detection using string similarity measures. 3. DATA ANALYSIS The data-set provided by the task organizers [1] is from newspaper domain and contains pairs of sentences. There are two Subtasks and each Subtask has its own training and testing data. Figure 2: Data Analysis of Paraphrase and Semi 3.1 SubTask 1 Paraphrase for SubTask 2 The pairs of sentences in the Training Data set contains 1000 ’Paraphrases’ (P) and 1500 ’Not Paraphrases’ (NP). is done which involves converting the xml format Data Set Test Data set for SubTask 1 consisted of 900 pairs for Hindi into csv format so that the data can be read from the csv Language.The number of paraphrases with common words file and processed for extracting features. versus the number of common words is shown in Figure 1. Second phase processes the training data to extract im- For e.g A point (5,72) represents that there are 72 such para- portant features from the data so that the proposed classifi- phrases which have five common words. cation model could be trained. The following three features were extracted for the proposed training model: 1. POS Tags: POS (Part-Of-Speech) Tags are labels that are given to words to identify the part of speech or lexical categories of words. The eight parts of speech are: the verb, the noun, the pronoun, the adjective, the adverb, the preposition, the conjunction, and the interjection. Words that have the same POS Tags play similar roles in the grammatical structure of sen- tences. For obtaining the respective POS tags for the Hindi words, RDRPOSTagger1 [6] was used. The in- put passed to the RDRPOSTagger contains the pairs of sentences and the output generated by RDRPOSTag- ger had the respective POS Tags next to each word. Only the POS Tags corresponding to each word in the sentence are extracted from the output and appended to form a string thus obtaining POS Tags for each sen- tence in the data set. Figure 1: Data Analysis of Paraphrase for SubTask 2. Stem of the words: Stemming is a process of ex- 1 tracting the ’word stem’ or ’root’ of the word. For extracting the stem of the Hindi words, a Hindi stem- mer2 was used which implements the suffix-stripping 3.2 SubTask 2 algorithm described in [8]. A string for each sentence For Subtask 2,Training Data set consisted of 1000 pairs in the data set with the corresponding stems of the of sentences that are Paraphrases (P), 1000 pairs that are Hindi words is then obtained. Semi-Paraphrases (SP) and 1500 that are Not Paraphrases (NP). For Test Data set, 1400 pairs of Hindi sentences were 3. Soundex codes: Soundex is a phonetic algorithm provided.The number of Paraphrases and Semi-Paraphrases for indexing names by sound as pronounced in En- with common words versus the number of common words is glish. Soundex3 provides an implementation of the shown in Figure 2. modified version of soundex algorithm for Indian lan- guages including Hindi. This package is used for the 4. PROPOSED TECHNIQUE 1 https://rdrpostagger.sourceforge.net 2 The proposed work has been divided in multiple phases http://research.variancia.com/hindi stemmer/ 3 as shown in Figure 3. Initially pre-processing of the data https://pypi.python.org/pypi/soundex/ corresponding soundex codes for the words in the sen- tences. Using soundex codes for words in the sentence, a string comprising of soundex codes corresponding to each sentence is generated. After extracting these three features, the similarity scores corresponding to each feature has been calculated. The python package, fuzzywuzzy 4 is used to calculate the simi- larity scores. Each similarity score lies in the range [0,1] and uses Levenshtein Distance to calculate the differences be- tween string sequences. The Levenshtein distance between two words is the minimum number of single-character ed- its (i.e. insertions, deletions or substitutions) required to change one word into the other. The similarity score is cal- culated for each pair of POS Tags sentences (feature 1), sentences with stem of the words (feature 2) and sentences with soundex codes corresponding to the Hindi words (fea- ture 3) hence creating a feature vector with the similarity scores corresponding to the sentence pair. After feature vector generation, different machine learn- ing techniques are used for training so that the best model for predicting the labels could be chosen after analysis. For SubTask 1 and SubTask 2, Logistic Regression, Naive Bayes, Random Forest Classifier and Support Vector Machine were used for classification. These models were implemented us- ing the python library sklearn [7]. 5. ALGORITHM Algorithm 1 takes Paraphrases as input where each Para- phrase(P[i]) contains two Hindi sentences (P[i].Sentence)and outputs a Label for its corresponding Paraphrases. The functions PosTags, WordStem and Soundex, each take Sen- tences of Paraphrase as its parameter and return the array of corresponding POS Tagged Sentences, WordStem Sentences and Sentences with Soundex Codes respectively. Similari- tyScore generates the similarity score for each of its input Figure 3: Block diagram for Paraphrase Detection array. SimScore1, SimScore2 and SimScore3 are the indi- vidual vectors for the three features, which are then passed to the CreateVector function to form the final FeatureVec- by changing and adjusting parameters in the functions pro- tor. Classifier function takes the FeatureVector as input, vided by sklearn [7] for Logistic Regression. As SubTask 1 assigns labels to the Paraphrases and then returns a La- was a binary classification problem hence results obtained belVector. Classifier function implements different models via Logistic Regression were better than the others. On (Logistic Regression, Naive Bayes, SVM and Random For- the other hand, SubTask 2 was a multi-class classification est) for predicting labels. problem (Labels-P, NP or SP). Hence in this case, Random Forest gave the best results with 69.2% accuracy and 68.8% 6. EXPERIMENTS AND RESULTS F-measure followed by Naive Bayes (64.6% accuracy and 62.4% F-measure) as shown in Figure 5. Random Forest 6.1 Evaluation and Discussion calculates labels by using sub samples of the data set and uses averaging to improve the accuracy whereas Naive Bayes To test the accuracy and F-measure, data set provided by uses a conditional probability approach for assigning labels. the task organizer was divided into a ratio of 75% and 25% Hence runs submitted for SubTask 1 used Logistic Re- for training and testing respectively. The results (Accuracy gression classifier and SubTask 2 used Random Forest. As and F-Measure) were evaluated using sklearn [7] for the dif- per the final results declared by the Task organizers, the ferent models (Logistic Regression, Naive Bayes, SVM and proposed technique was ranked third when compared with Random Forest). Results obtained for SubTask 1 is shown other teams with Accuracy of 0.897 and F-measure of 0.89 in Figure4. Proposed system gave an accuracy of 90.4% and as shown in Figure 6 and 7 respectively for SubTask 1. In F-measure 87.6% for Logistic Regression followed by Naive SubTask 2, the proposed technique is ranked fifth with Ac- Bayes and Random Forest, both with 89.5% accuracy. For curacy and F-measure of 0.717 and 0.712 as shown in Figure binary classification problems, logistic regression gives the 8 and Figure 9 respectively. best results in most cases because it assigns labels by cal- culating odds ratio and then applies a non-linear log trans- 6.2 Error Analysis formation. Moreover, the performance can be fine-tuned Few errors that could have attributed to the decrease in 4 https://pypi.python.org/pypi/fuzzywuzzy evaluation measures can be- Algorithm 1 Algorithm for Detecting paraphrases 1: Input: Paraphrase P, where all paraphrases have a unique id and contains two sentences (Hindi) 2: Output: LabelVector gives the corresponding labels for the paraphrases. Depending upon the task it can have value of P, NP and SP 3: Initialization: SimScore1[]=0,SimScore2[]=0,SimScore3[]=0 4: for i=0 to P.Count do 5: Pos[]=PosTags (P[i].Sentence) 6: Stem[]=WordStem (P[i].Sentence) 7: Sound[]=Soundex (P[i].Sentence) 8: SimScore1.append (SimilarityScore(Pos[])) 9: SimScore2.append (SimilarityScore(Stem[])) 10: SimScore3.append (SimilarityScore(Sound[])) 11: end for 12: FeatureVector=CreateVector(SimScore1, SimScore2, SimScore3) 13: LabelVector=Classifier(FeatureVector) Figure 5: Results for Subtask 2 using different clas- sifier for proposed system Figure 4: Results for Subtask 1 using different clas- sifier for proposed system Figure 6: Accuracy comparison for all teams in Sub- 1. RDRPOS Tagger- Nguyen et al.[6] states that the RDR- Task 1 POSTagger achieves a very competitive accuracy in comparison to the state-of-the-art results. But a dif- ferent Hindi POS Tagger can also be used to improve this phase. Also RDRPOSTagger can be combined of 89% for SubTask 1 using Logistic Regression. For Sub- with an external initial tagger to increase its accuracy. Task 2, proposed system gave an accuracy of 71.7% and F-measure of 71.2% using Random Forest Classifier as eval- 2. Similarly, the Hindi Stemmer used might have incor- uated by task organizers. The model accuracy can be fur- rectly returned the stem words, which can be a reason ther improved by incorporating more features like calculat- for wrongly classified Paraphrases. The algorithm for ing similarity between two strings having only nouns of the extracting the root words can be improved further to original sentences as identified by the POS Tagger, replac- better the results. ing the nouns by their soundex codes or their stems. Only 3. Other factors that could have led to errors are accuracy verbs of the original sentences can also be used to obtain fea- of soundex library and similarity measure used. tures where the verbs are replaced by their soundex codes or stems. The current model has been trained on the data set provided by task organizers. We can incorporate more 7. CONCLUSIONS AND FUTURE WORK data to extend the model. Using an ensemble classifier and In this paper, a feature vector based approach with three combining different models like Decision Trees, Naive Bayes, features (POS Tags, Word Stems and Soundex codes) is dis- SVM, etc. can be used for predicting labels that may further cussed for paraphrase detection of Hindi Language. Leven- improve results. Moreover the proposed technique only uses shtein Distance was used to calculate the similarity measure. syntactic features, semantic features can be incorporated for Proposed system achieved accuracy of 89.7% and F-measure improvising the algorithm. Figure 7: F-Measure comparison for all teams in Figure 9: F-Measure comparison for all teams in SubTask 1 SubTask 2 sive autoencoder. Source:[http://nlp. stanford. edu/courses/cs224n/2011/reports/ehhuang. pdf ], 2011. [5] P. Malakasiotis. Paraphrase recognition using machine learning to combine similarity measures. In Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, pages 27–35. Association for Computational Linguis- tics, 2009. [6] D. Q. Nguyen, D. D. P. Dai Quoc Nguyen, and S. B. Pham. Rdrpostagger: A ripple down rules-based part- of-speech tagger. In Proceedings of the Demonstra- tions at the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, pages 17–20. Citeseer, 2014. [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, Figure 8: Accuracy comparison for all teams in Sub- R. Weiss, V. Dubourg, et al. Scikit-learn: Machine Task 2 learning in python. Journal of Machine Learning Re- search, 12(Oct):2825–2830, 2011. References [8] A. Ramanathan and D. D. Rao. A lightweight stemmer for hindi. In the Proceedings of EACL, 2003. [1] M. Anand Kumar, S. Shivkaran, B. Kavirajan, and K. P. Soman. DPIL@FIRE2016: Overview of shared [9] N. Sethi, P. Agrawal, V. Madaan, and S. K. Singh. task on detecting paraphrases in indian languages. In A novel approach to paraphrase hindi sentences using Working notes of FIRE 2016 - Forum for Information natural language processing. Indian Journal of Science Retrieval Evaluation, Kolkata, India, December 7-10, and Technology, 9(28), 2016. 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. [10] M. S. Sundaram, K. Madasamy, and S. K. Padannayil. : Paraphrase detection for twitter using unsupervised [2] E.-S. M. El-Alfy, R. E. Abdel-Aal, W. G. Al-Khatib, feature learning with recursive autoencoders. In Work- and F. Alvi. Boosting paraphrase detection through shop Proceedings of the International Workshop on Se- textual similarity metrics with abductive networks. Ap- mantic Evaluation 2015 (Sem Eval-2015), Denver, Col- plied Soft Computing, 26:444–453, 2015. orado, US, pages 45–50. Citeseer, 2009. [11] N. P. A. Vo, S. Magnolini, and O. Popescu. Paraphrase [3] S. Fernando and M. Stevenson. A semantic similarity identification and semantic similarity in twitter with approach to paraphrase detection. In Proceedings of the simple features. In The 3rd International Workshop on 11th Annual Research Colloquium of the UK Special In- Natural Language Processing for Social Media, page 10, terest Group for Computational Linguistics, pages 45– 2015. 52. Citeseer, 2008. [4] E. Huang. Paraphrase detection using recur-