=Paper=
{{Paper
|id=Vol-1737/T6-8
|storemode=property
|title=Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language using Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-1737/T6-8.pdf
|volume=Vol-1737
|authors=Anuj Saini
|dblpUrl=https://dblp.org/rec/conf/fire/Saini16a
}}
==Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language using Machine Learning==
Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language using Machine Learning Anuj Saini Sapient Global Markets Gurgaon, Haryana asaini13@sapient.com ABSTRACT Subtask 1 is to classify two Hindi sentences into two classes: P (Paraphrased) or NP (Non-Paraphrased). Subtask 2 is to classify Every language possesses plausible several interpretations. With them into three classes which are P (Paraphrased), SP (Semi- the evolution of web, smart devices and social media it has Paraphrased) or NP (Non-Paraphrased). For detecting become a challenging task to identify these syntactic or semantic paraphrases, it is very important to understand the language at an ambiguities. In Natural Language Processing, two statements initial stage. Starting from tokenization, stemming, lemmatization, written using different words having same meaning is termed as stop words, phonetics and POS tags etc. need to be identified paraphrasing. At FIRE 2016, we have worked upon the problem before comparing two texts. And as machine learning algorithms of detecting paraphrases for the given Shared Task DPIL works on numeric data, so it’s important to convert our textual (Detecting Paraphrases in Indian Languages) in Hindi Language data into corresponding numbers known as text vectors. A vector specifically. This paper proposed a novel approach to identify if denotes the numerical representation of text comparison of two two statements are paraphrased or not using various machine sentences. We have generated a set of vectors for each data point learning algorithms like Random Forest, Support Vector Machine, using first two steps of our proposed approach and trained the Gradient Boosting and Gaussian Naïve Bayes on the given model in the third step to get the results. training data set of two subtasks. In cross validation experiments, Random Forest leads the other methods in terms of F1-score. The The paper has been organized as follows. Section 2 gives the experimental results depicts that our algorithm gives better description of the training dataset provided by the organizing performance with the ensemble learning method than individual committee. Section 3 presents the proposed approach for approaches for such classification problem. This can be used in paraphrase detection. The experimental evaluation has been various applications such as question-answering system, carried out in Section 4. Section 5 concludes our research work document clustering, machine translation, text summarization, followed by acknowledgment and references given in Section 6 plagiarism detection and many more. and Section 7 respectively. CCS Concepts • Theory of computation Random Forest 2. DATA SET DESCRIPTION • Computing methodologiesNatural language Processing There were a total of 2500 data points in the training set with 2 classes P and NP to be used for classification in Subtask1, Keywords whereas Subtask 2 had 3 classes and a total of 3500 data points. A Paraphrase detection; Machine Learning; Natural Language detail distribution of classes and its count for both the tasks is Processing; Soundex; Semantic Similarity; Random Forest mentioned in Table1. Table 1. Classes & its count for SubTask1 and SubTask2 1. INTRODUCTION Class SubTask1 SubTask2 With the plethora of information generated by the web these days, P 1500 1500 it is challenging to understand the semantics of different NP 1000 1000 languages when each language has its own scripts and linguistic rules. Hindi language is still in its early stage concerning to SP - 1000 natural language processing and applications [1]. Currently significant amount of research work has already been done for English language but there is a huge scope for Hindi language. Each data point contains a pID, a unique id for each data point and Paraphrases are sentences or phrases which conveys the same two Hindi sentences and their final tagged Class. The initial meaning using different words [2]. Paraphrase detection is an analysis of the data surfaced some noise, majorly in NP class. A important building block in Natural Language Processing (NLP) few examples of false negatives and false positives have been pipeline [8]. Previously, many researchers have investigated ways identified and listed down in Table 2. Here the examples for class of automatically detecting paraphrases on formal texts [3].Various NP and class P denotes false negatives and false positives state-of-the-art paraphrase identification techniques have been respectively. summarized in an excellent manner by ACL [4]. The objective of our work is motivated by the shared task of DPIL [11] organized by the Forum for Information Retrieval Evaluation (FIRE 2016). Table 2. False Negatives and False Positives in the dataset There were two subtasks given for the classification problem. pID Sentence1 Sentence2 Class रविशंकर ने मझ ु े लगता है कक NP paraphrases. Classification model includes the model training using the four machine learning algorithms. All the steps are HIN2802 कहा मुझे आईएसआईएस कोई described below. लगता है कक पीस टॉक नह ं चाहता: आईएसआईएस रविशंकर 3.1 Text Preprocessing कोई पीस टॉक नह ं चाहता For each data point, the preprocessing steps are as follows: HIN2852 पाक के फॉरे न पाककस्तान के विदे श NP 3.1.1 Text Encoding सेक्रेटर एजाज सचचि एजाज अहमद We have encoded the sentences using standard UTF-8 encoding अहमद चोधर चोधर हैदराबाद हाउस that handles scripting of Hindi language. मंगलिार को में हाटट ऑफ एशशया 3.1.2 Tokenization हाटट ऑफ सम्मेलन में भाग लेंगे We have tokenized the sentences into words using NLTK library. एशशया के 3.1.3 Phonetics Transformation ऑकफशशयल्स We have applied custom set of rules for phonetics. The की बैठक में normalization of phonetics has been done using soundex algorithm [9]. It looks for specific characters and replaces them भाग लेने के with their corresponding metaphor characters. For example, शलए भारत न ् न or ज़ ज पहुंच रहे हैं 3.1.4 Tokens Stemming HIN2958 इस ककताब में इस ककताब में कांग्रेस NP We have applied some more set of rules for stemming into its कांग्रेस उपाध्यक्ष राहुल गांधी basic form. A set of Hindi suffixes characters were removed to get उपाध्यक्ष राहुल को कररश्माई नेता भी normalized Hindi word. For example, गांधी को बताया गया और इसको [ोो",ोे",ो",ोु",ोी",िो",ोा"][कर",ोाओ",िोए",ोाई",ोाए",ने",नी",ना", कररश्माई नेता लेकर लोकसभा में ते",ोीों",ती",ता",ोाो",ोाों",ोोों",ोेों"] भी बताया गया बध ु िार को सत्तापक्ष [ोाकर",ोाइए",ोाईं",ोाया",ोेगी",ोेगा",ोोगी",ोोगे",ोाने",ोाना",ोाते", है और विपक्ष के बीच ोाती",ोाता",तीं",ोाओं",ोाएं",ोुओ"ं ,ोुएं",ोुआं"] काफी नोकझोंक हुई 3.1.5 Stop Words Filtering HIN3032 बॉल िड ु बॉल िड ु एक्ट्रै स बबपाशा NP We have removed irrelevant words from the sentences using a अशभनेत्री बसु और करण शसंह standard list of 164 Hindi Stop words. For example, बबपाशा बसु ने ग्रोिर ने शननिार को [सारा, से, सो, संग, ह , हुआ, हुई, हुए] शननिार को मुंबई में की शाद 3.1.6 Synonyms Expansion एक ननजी We have used Hindi WordNet, an extensive lexical dictionary of समारोह में Hindi language having 40K~ synsets, to fetch synonyms for Hindi words. It was developed by researchers at the Center for Indian करण शसंह Language Technology, Computer Science and Engineering ग्रोिर के साथ Department, IIT Bombay and we have downloaded it from the शाद रचाई mentioned link in [5]. मुद्राकोष में ऎएमएफ के दस बडे All the text preprocessing steps have been summarized with बढे गा भारत का सदस्य दे शों में भारत भी examples in the below Table 3. HIN0230 रुतबा शाशमल P Table 3. Text Preprocessing Steps PreProcessing Input Preprocessed We have accepted this noise present in the given training data Tokenization कवपल शमाट फोर्बसट [कवपल, शमाट, फोर्बसट] without any filtering of such misclassified records and trained Soundex हजअर लोट हज़अर लौट model on all the data points. Stemming अननयशमतताओं, अननयशमत, ददल्ल 3. THE PROPOSED METHOD ददल्ल The proposed approach includes three major steps which are text Stop Words काफी ननराश था और preprocessing, feature generation and classification model. Text Removal काफी ननराश ड्रेशसंग रूम ड्रेशसंग रूम में लोटते preprocessing is done in various steps such as tokenization, लोटते हुए रोने लगा stemming, soundex, stop word removal and handling synonyms. हुए रोने लगा Feature generation involves the creation of five new features Synonyms इंडडया आखिर भारत फाइनल which will be used as an input to the classifier for classification of and having them “vote” for a final decision according to a majority role [6].We have focused on tuning its hyper parameters 3.2 Feature Generation to enhance the predictive ability of the model. Key parameters are as follows: After the preprocessing of sentences following features have been 3.3.1 n_estimators generated in the form of vectors to be passed as an input to the These are the number of trees that we want to build before having classifier. the vote for the final decision. More the number of trees better the 3.2.1 Common Tokens performance however it also increases the time complexity. The number of common tokens amongst two sentences is used as 3.3.2 max_depth a feature. These tokens have been generated by comparing the It is the maximum depth of the tree which needs to be tuned. preprocessed tokens after removing the stop words and then taking intersection of them symbolized as follows. 3.3.3 min_samples_leaf These are the minimum number of samples or observations Tokens (sentence1) ∩ Tokens (sentence2) required in a terminal node of the tree. 3.2.2 Normalized common Tokens 3.3.4 min_samples_split We have normalized the common tokens generated in the first These are the minimum number of samples or observations feature to compute the proportion of commonality of tokens needed in a node to be considered for splitting.[7] between two sentences. The value will be in the range of 0 and 1.It is 0 when there are no common tokens between two sentences We have selected the best set of hyper parameters for Random and 1 if all tokens between the two sentences are exactly the Forest using Grid Search of Scikit which resulted in the following same. Mathematically, It has been calculated by dividing the values n_estimators - 500, max_depth - 10, min_samples_leaf - 4 common tokens by number of unique tokens in both the sentences and min_samples_split – 4 and trained our training data. Overall as shown below. training time for model is less than 1 second on quad-core Common tokens/Unique Tokens (sent1, sent2) Machine with 8GB of RAM. 3.2.3 Common IDF Score Sum of IDF scores of common tokens from two sentences is used 4. EXPERIMENTAL RESULTS as numeric similarity vector. We have used 10 fold cross validation to compute overall ΣIDF score (common tokens) accuracy for the system. In this work three evaluation metrics IDF (Inverse Document Frequency) is defined as inverse of have been considered which are precision, recall and f1-score. We document frequency which is used to identify the importance of a have calculated the values of the evaluation parameters for all of token in a given corpus. Represented as: the four respective algorithms. For the subtask 1 we have got the overall accuracy of 0.92 with F1 score of 0.94 maximum for Random Forest algorithm. Detailed performance matrix of the model is given as below in Table 4 in which we have to predict for 2 classes. Here a document is an individual sentence. At first IDF has been calculated for all tokens and kept for reference. Then during Table 4. Subtask 1 Scores Summary feature generation process, IDF of common tokens has been Class Precision Recall F1-Score Algorithm calculated and is used as a feature. P 0.94 0.93 0.94 3.2.4 Normalized Common IDF Score Here the proportion of IDF score of common tokens between two NP 0.90 0.91 0.90 Random Forest sentences is computed for normalization and used as a vector to model. It gives us normalized common IDF score ranges between Avg/Total 0.92 0.92 0.92 0 to 1 of common tokens. It can be calculated as follows. NP 0.93 0.91 0.92 IDF of Common tokens/ Gradient P 0.87 0.89 0.88 Total IDF of Unique tokens (sent1, sent2) Boosting Avg/Total 0.9 0.9 0.9 3.2.5 Sentences Length It denotes the count of number of tokens in sentence 1 and NP 0.92 0.84 0.88 sentence 2 as separate columns. P 0.79 0.89 0.84 SVM 3.3 Classification Model Avg/Total 0.87 0.86 0.86 NP 0.92 0.94 0.93 Features generated in section 3.2 have been used as training data to train the classifiers using Python Scikit library. Here, we have Gaussian P 0.91 0.88 0.9 Naïve Bayes used four popular machine learning algorithms which are random forest, support vector machine and gradient boosting and Avg/Total 0.92 0.92 0.92 compared their performance. Since, random forest being an ensemble learning method outperforms the other individual methods, is implemented by growing many classification trees Subtask 2 which had same problem with 3 classes to predict from. Figure 2: Subtask 2 Results Comparison We have used similar approach and similar features set for training our model. With 3 classes and larger training set of 3500 0.9 data points we have got overall accuracy of 0.85 with 10 cross 0.85 folds and a F1 score of 0.91 which is again maximum for Random 0.8 Precision forest algorithm. Detailed summary of performance matrix of 0.75 subtask2 is given in Table 5. 0.7 Recall F1-Score Table 5. Subtask 2 Scores Summary Class Precision Recall F1-Score Algorithm NP 0.90 0.91 0.91 P 0.81 0.80 0.80 5. CONCLUSION Random SP 0.83 0.82 0.82 Forest In this paper we have proposed our novel approach for the Avg/Total 0.85 0.85 0.85 detection of Hindi paraphrases which is a very important building NP 0.89 0.90 0.89 block of semantic text analysis. Building a question answering system, document clustering, knowledge extraction, plagiarism P 0.79 0.80 0.79 detection, building ontologies etc. are the potential applications Gradient for paraphrase identification in NLP [10]. After comparing all the SP 0.84 0.81 0.83 Boosting four machine learning algorithms used in our model, random 0.85 0.85 0.85 forest is giving best results with F1 score of 0.94 for subtask1 and Avg/Total 0.91 for subtask2 which can be further improved by using more NP 0.89 0.82 0.86 robust phonetics and synonyms replacements. One limitation of our research work is that we have not removed outliers from the P 0.74 0.67 0.70 training data which could slightly improve the system SVM performance. In our future work we will include Part of Speech SP 0.68 0.82 0.74 tagging in feature generation which plays an important role in Avg/Total 0.79 0.78 0.78 paraphrase detection, as nouns and verbs are key elements for paraphrasing. NP 0.87 0.93 0.9 P 0.68 0.73 0.71 Gaussian 6. ACKNOWLEDGMENTS Naïve SP 0.76 0.62 0.68 Bayes We would like to thank organizers for conducting this shared task Avg/Total 0.78 0.79 0.78 and also building the training data. We also would like to thank Sapient Corporation for giving us an opportunity to work and explore the world of text analytics. Following figures gives the summarized view of the performance of the various machine learning algorithms for both subtasks 7. REFERENCES Figure 1: Subtask 1 Results Comparison [1] Sethi, N., Agrawal, P., Madaan, V., and Singh, S.K. July 0.94 2016. A Novel Approach to Paraphrase Hindi Sentences 0.92 0.9 using Natural Language Processing. Indian Journal of 0.88 Precision Science and Technology, Vol 9(28), DOI: 0.86 10.17485/ijst/2016/v9i28/98374. 0.84 0.82 Recall [2] Kumar, N. 2014. A Graph Based Automatic Plagiarism DetectionTechnique to Handle Artificial Word Reordering F1-Score and Paraphrasing. A. Gelbukh (Ed.): CICLing 2014, Part II, LNCS 8404, pp. 481–494, Springer-Verlag Berlin Heidelberg 2014. [3] Xu, W., Callison-Burch, C., and Dolan, W. B. 2015. SemEval-2015 Task 1: Paraphrase and Semantic Similarity in Twitter (PIT), Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pages 1–11, Denver, Colorado, June 4-5, Association for Computational Linguistics [4] https://www.aclweb.org/aclwiki/index.php?title=Paraphras e_Identification_(State_of_the_art) [5] http://www.cfilt.iitb.ac.in/wordnet/webhwn/downloaderInf Learning with Recursive Autoencoders. SemEval-2015 o.php (2015): 45. [6] Zhang, W., Zeng, F., Wu, X., Zhang, X., and Jiang, R.A. [9] Mahalakshmi, S., Anand Kumar, M., Soman, K.P. 2009. comparative study of ensemble learning approaches 2015. Paraphrase detection for Tamil language using deep in the classification of breast cancer metastasis, learning algorithm, (2015) International Journal of International Joint Conference on Bioinformatics, Systems Applied Engineering Research, 10 (17), pp. 13929-13934 Biology and Intelligent Computing [10] Socher, R., Huang, E. H., Pennin, J., Manning, C.D., and [7] Banfield, E.R., Student Member, IEEE,Lawrence O. Hall, Andrew, Y. Ng. 2011. Dynamic pooling and unfolding Fellow, IEEE,Kevin W. Bowyer, Fellow, IEEE, and W.P. recursive autoencoders for paraphrase detection. Advances Kegelmeyer, Member, IEEE, JANUARY 2007. A in Neural Information Processing Systems (pp. 801-809). Comparison of Decision Tree Ensemble Creation Techniques, IEEE TRANSACTIONS ON PATTERN [11] Anand Kumar, M., Singh, S., Kavirajan, B., and Soman, ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29, K. P. 2016. DPIL@FIRE2016: Overview of shared task on NO. 1. Detecting Paraphrases in Indian Languages, Working notes of FIRE 2016 - Forum for Information Retrieval [8] Sundaram, Shanmuga, M., Anand Kumar M, and Soman, Evaluation, Kolkata, India, December 7-10, CEUR K.P. 2015. AMRITA CEN@ SemEval-2015: Paraphrase Workshop Proceedings, CEUR-WS.org. Detection for Twitter using Unsupervised Feature