KEC@DPIL-FIRE2016: Detection of Paraphrases on Indian Languages R.Thangarajan S.V.Kogilavani A.Karthic S.Jawahar Professor Assistant Professor (SrG) UG Student UG Student Department of CSE Department of CSE Department of CSE Department of CSE Kongu Engineering College Kongu Engineering College Kongu Engineering College Kongu Engineering College Erode, Tamilnadu Erode, Tamilnadu Erode, Tamilnadu Erode, Tamilnadu +919443014942 +919486153223 +919443840800 +918973744171 rt.cse@kongu.edu kogilavani.sv@gmail karthic1011@gmail. jawahar273@gmail.co .com com m ABSTRACT needs the paraphrase identification technique to detect the This paper presents a report on Detecting Paraphrases in Indian sentences which are paraphrases of others. Identifying paraphrases Languages (DPIL), in particular the Tamil language, by the team is an important task that is used in information retrieval, question NLP@KEC of Kongu Engineering College. Automatic paraphrase answering, text summarization and plagiarism detection. This detection is an intellectual task which has immense applications work mainly focuses on the detection of paraphrases in Tamil like plagiarism detection, new event detection, etc. Paraphrase is language. For example, the sentences in Table 1 express the same defined as the expression of a given fact in more than one way by meaning therefore, they are paraphrases. means of different phrases. Paraphrase identification is a classic Table 1 Sample Sentence in Tamil Language natural language processing task which is of classification type. Though there are several algorithms for paraphrase identification, கேரளமாநில்திு்ூி்ூட்மாணி்ே் reflecting the semantic relations between the constituent parts of a sentence plays a very important role. In this paper we utilize கோயி்திுவிழாுவ்ேியு. sixteen different features to best represent the similarity between sentences. The proposed approach utilizes machine learning ூட்மாணி்ே்கோயி்திுவிழாகோலாே algorithms like Support Vector Machine and Maximum Entropy லமாேுவ்ேியு. for classification of given sentence pair. They have been classified into Paraphrase and Not-a-Paraphrase for task1 and Paraphrase, Not-a-Paraphrase and Semi-Paraphrase for task2. The accuracy Our proposed system utilizes two supervised machine learning and performance of these methods are measured on the basis of approaches using a Support Vector Machine (SVM), Maximum evaluation parameters like accuracy, precision, recall, f-measure Entropy (ME) and learns classifiers based on sixteen features like and macro f-measure. Our methodology got 2nd place in DPIL lexical and POS tagging features in order to detect the paraphrase evaluation track. sentences. The structure of the report is defined as follows: Section-2 describes literature review. Section-3 gives the task Keywords description Section-4 represents the overview of the proposed Natural Language Processing; Paraphrase Identification; Machine system. Section-5 presents the performance evaluation results and Learning Approach; Support Vector Machine; Maximum Entropy; Section-6 concludes the work. Shallow Parser. 2. LITERATURE REVIEW 1. INTRODUCTION In this section, recent research work carried in the field of Variability of semantic expression is a fundamental phenomenon paraphrase identification in general, and paraphrase identification of natural language where in the same meaning can be expressed in Tamil language in particular is discussed. Many researchers on by, or inferred from, different texts. Paraphrases are alternative paraphrase identification make use of existing Natural Language ways to express or convey the same information. One can express Processing (NLP) tools to identify paraphrases. [2] exploits the a single event in many different ways in natural language NLP tools of a QA system to identify paraphrases. [3-6] have sentences which depends on the command of the language the employed lexical semantic similarity information based on writer or speaker has on the language in consideration. A properly resources such as WordNet [7]. Tamil paraphrase detection is written paraphrase expresses the ideas of a fact or event in words tried with deep learning method [18]. and sentence structure inherent to the writer or speaker. It is similar to summarization but the key difference is that paraphrases The ability to identify paraphrase, in which a sentence express the include both key points and sub-points. Because a paraphrase same meaning of another one but with different words, has proven includes detailed information it can sometimes be as long as the useful for a wide variety of Many different natural language original source. A paraphrase typically explains or clarifies the processing applications are there for detecting paraphrases. The text that is being paraphrased. This greatly adds to the difficulty of different approaches can be categorized into supervised methods, detecting paraphrases i.e. [8-9], which are the most promising methods. Paraphrase detection is the task of determining whether two or Most of the existing system utilizes thresholds to determine more sentences represent the same meaning or not [1]. Paraphrase whether two sentences are similar and represents the same detection systems progress the performance of a paraphrase meaning. But the specification of exact threshold will depends up generation by choosing the best sentence from the list of on the training data. Machine Learning (ML)techniques have been paraphrase sentences. Plagiarism detection is another task which applied in order to overcome the problems in setting the threshold. The benefit of applying the ML approach resides based on the ME: The training data are used to set constraint on conditional morphologic, syntactic, semantic features of a sentence. distribution [14]. Each constraint is used to express characteristics of training the data. These constraints then are used for testing the In general, supervised and unsupervised machine learning data. The results obtained from this analysis are compared using techniques are quite useful in paraphrase detection. In supervised different performance evaluation measures. learning technique, the dataset is labeled and trained to obtain a reasonable output which help in proper decision making. Unlike supervised learning, unsupervised learning process does not need Subtask1 Dataset Subtask2 Dataset any label data; therefore they cannot be processed easily. This report work presents the impact of two supervised learning methods on given dataset. 3. TASK DESCRIPTION Shallow Parsing One of the most commonly used corpora for paraphrase detection is the Micro Soft Research in Paraphrase (MSRP) corpus [10], which contains 5,801 English sentence pairs from news articles manually labeled with 67% paraphrases and 33% non- paraphrases. Since there are no annotated corpora or automated Features File Construction semantic interpretation systems available for Indian languages till date, creating benchmark data for paraphrases and utilizing that data in open shared task competitions will motivate the research community for further research in Indian languages [11]. We participated in DPIL task which is focused on sentence level paraphrase identification for Indian languages (Tamil, Malayalam, Hindi and Punjabi). In this context, the task is divided into two Classification using Classification using subtasks. SVM Maximum Entropy 3.1 Sub Task 1: Given a pair of sentences, the system is required to assess if the Result for Sub Result for Sub two sentences carry the same meaning or not and to classify them Task1, Sub Task2 Task1, Sub Task2 into Paraphrase (P), or Not Paraphrase (NP) otherwise. 3.2 Sub Task 2: Figure 1: Overview of Proposed System Given two sentences from newspaper domain, the task is to identify whether they are completely equivalent (P) or roughly 4.1 SHALLOW PARSING equivalent (SP) or Not Equivalent (NE). This task is similar to the Shallow parsing also called as chunking or light parsing is an subtask 1, but the main difference is 3-point scale tag in analysis of a sentence which first identifies basic parts of paraphrases. sentences such as nouns, verbs, adjectives, etc., and then links The training data set given to us is the News Corpus, which them to higher order units that have discrete grammatical contains 2,500 Tamil sentence pairs for subtask1 and 3,500 Tamil meanings. While the most elementary chunking algorithms simply sentence pairs for subtask2. The test set given to us consisted of link constituent parts on the basis of elementary search patterns, 900 Tamil sentence pairs for subtask1 and 1,400 Tamil sentence approaches that use ML techniques can take contextual pairs for subtask2. Both the training and test corpus are available information into account and thus compose chunks in such a way at the link specified in [12]. that they better reflect the semantic relations between the basic constituents. That is, these more advanced methods get around the 4. OVERVIEW OF PROPOSED SYSTEM problem that combinations of elementary constituents can have Subtask1 and subtask2 datasets are processed using Tamil shallow different higher level meanings depending on the context of the parser. The processed results are in text format; but for sentence. The proposed system utilizes Tamil Shallow parser classification of sentences using the machine learning algorithms, developed by IIIT [15]. For example, the following Tamil the text values are converted into numerical matrix, which is then sentence given as input into SVM and ME for further classification. செய்கே்கோ்ேகள ஏவி இ்ு இ்கரா ுதிய SVM: Given data are analyzed and decision boundaries are ொதகை பகட்ு்ளு. defined by having hyper planes. In two category case, the hyper plane separates the document vector of one class from other would be chunked as in Table 2. classes, where the separation is maintained to be large as possible [13]. Table 2 Chunking Result of Sample Sentence 1.1 செய்கே் NN 1.2 கோ்ேகள NN )) )) )) 4.1 ுதிய JJ )) )) 5.2 . SYM 4.2 FEATURES FILE CONSTRUCTION From the output of shallow parsing process, feature file is constructed both for training and test datasets for subtask1 and subtask2. Sample feature file is presented in Table 3 Table 3 Sample Sentence- Feature File Para Present future count phrase NNP class nom past VM ben gen NN soc RB acc loc dat Id JJ TAM 0 10 2 0 2 0 2 3 0 0 2 2 11 0 6 6 P 0001 TAM 0 8 1 0 2 0 0 1 0 0 1 1 9 0 2 4 P 0002 TAM 0 12 1 2 3 0 0 0 0 5 0 0 19 0 5 6 P 0003 TAM 0 15 2 0 1 0 3 2 0 3 2 0 20 0 3 4 P 0004 TAM 0 11 0 1 0 0 0 0 0 0 1 3 16 0 1 7 P 0005 TAM 0 13 3 2 1 0 0 0 0 1 0 0 17 0 7 3 P 0006 TAM 0 6 0 0 1 0 1 3 1 0 0 0 14 0 2 2 P 0007 TAM 0 5 1 1 1 0 0 3 0 2 1 0 12 0 2 7 P 0008 TAM 0 8 0 1 0 0 1 0 0 3 1 1 12 0 2 4 P 0009 TAM 0 7 0 0 2 1 0 4 0 0 0 1 15 0 3 7 P 0010 4.3 TEXT CLASSIFICATION USING multi class sentence classification, each sentence pair is classified as a label C, where C = {P, NP, SP}. In this, P specifies that the MACHINE LEARNING ALGORITHMS sentence pair is paraphrase, NP denotes that the given sentence When supervised machine learning algorithms are considered for pair is not a paraphrase whereas SP specifies that the given classification purpose, the input dataset is desired to be a labeled sentence pair is a semi paraphrase. one. In this study, two different supervised learning techniques are applied for classification purpose such as SVM and ME. 4.3.1 SVM Classification Method SVM is based on the structural risk minimization principle from Classification of sentences may be categorized into two types, i.e., computational learning theory. This method analyzes data and binary sentence classification and multi-class sentence defines decision boundaries by having hyper-planes. In binary classification [16]. For the given dataset, in binary classification classification problem, the hyper-plane separates the given vector type, each sentence pair is classified as a label C, where C = {P, in one class from other class, where the separation between hyper- NP}. In this, P denotes the given sentence pair is a paraphrase and planes is desired to be kept as large as possible. One property of NP denotes that the given sentence pair is not a paraphrase. In SVM is that their ability to learn can be independent of the dimensionality of the feature space. SVM measure the complexity classified as not paraphrase by the classifier, whereas False of hypotheses based on the margin with which they separate the Negative are not paraphrase sentences but classifier classify it as data, not the number of features [17]. Since SVM requires input in paraphrase. Table 5 and Table 6 represents confusion matrix for the form of a vector of numbers, the constructed feature file is sub task 1 sentences and sub task 2 sentences. given as input to SVM. Table 5 Confusion Matrix for Subtask1 Sentences 4.3.2 ME Classification Method Actual Status (SVM) Actual Status (ME) ME is a general technique for estimating probability distributions Predicted NP P NP P from data. The over-riding principle in maximum entropy is that status when nothing is known, the distribution should be as uniform as NP 409 129 420 120 possible, that is, have maximal entropy. Labeled training data is used to derive a set of constraints for the model that characterize P 117 245 106 254 the class-specific expectations for the distribution. Constraints are represented as expected values of “features” any real-valued Table 6. Confusion Matrix for Subtask2 Sentences function of an example. The improved iterative scaling algorithm Actual Status (SVM) Actual Status (ME) finds the maximum entropy distribution that is consistent with the given constraints. Predicted NP P SP NP P SP status Due to the minimum assumptions that the ME classifier makes, we regularly use it when we don’t know anything about the prior NP 499 47 125 514 43 116 distributions and when it is unsafe to make any such assumptions. P 51 113 150 44 131 164 Moreover ME classifier is used when we can’t assume the conditional independence of the features. This is particularly true SP 95 113 207 87 99 202 in text classification problems where our features are usually words which obviously are not independent. The ME method requires more time to train comparing to Naive Bayes, primarily In SVM method, out of 900 sentence pairs, 409 NP sentences are due to the optimization problem that needs to be solved in order to correctly classified as NP sentences and 129 NP sentences are estimate the parameters of the model. Nevertheless, after misclassified as P sentences. Similarly, 245 sentences are computing these parameters, the method provides robust results correctly classified as P sentences and 117 P sentences are and it is competitive in terms of CPU and memory consumption. misclassified as NP sentences. We believe that this In our text classification scenario, maximum entropy estimates the misclassification is mainly due to higher lexical similarity in false conditional distribution of the class label given pair of sentences. paraphrase pairs, which makes them hard to be differentiated from Entire document is represented by a feature file. The labeled true paraphrase pairs. In ME method, out of 900 sentence pairs, training data is used to estimate the expected value on a class-by- 420 NP sentences are correctly classified as NP sentences and 120 class basis. NP sentences are misclassified as P sentences. Similarly, 254 sentences are correctly classified as P sentences and 106 P 5. PERFORMANCE EVALUATION sentences are misclassified as NP sentences. The ME system PARAMETERS AND RESULTS performs fairly well compared to SVM method at identifying true paraphrase pairs, as given in Table 5for sub task 1 and Table 6 for The parameters which are helpful to evaluate performance of sub task2. supervised machine learning algorithm is based on the element from a matrix known as confusion matrix or contingency table. It Based on the values obtained from confusion matrix, other is used in supervised machine learning algorithm to help in parameters such as “precision”, “recall”, “f-measure”, and assessing performance of any algorithm. From classification point “accuracy” are found out for evaluating performance of any of view, terms such as “True Positive”(TP), “False Positive” (FP), classifier. “True Negative” (TN), “False Negative” (FP) are used to compare •Precision: label of classes in this matrix as shown in Table 4. It measures the exactness of the classifier result. It is the Table 4 Confusion Matrix Format ratio of number of examples correctly labeled as paraphrase to Predicted Status Actual Status total number of paraphrase sentences in sub task1 dataset. It can be calculated using the equation 1. Not a Paraphrase Paraphrase TP � ��� � = (1) Not a Paraphrase TN(True Negative) FN(False Negative) TP+FP Paraphrase FP(False Positive) TP(True Positive) •Recall: It measures the completeness of the classifier result. It is the ratio of total number of paraphrase sentences to total sentences In Table 4, True Positive represents the number of sentences those which are truly paraphrase. It can be calculated using the equation are paraphrase and also classified as paraphrase by the classifier, 2. where as False Positive indicates paraphrase sentences, but TP classifier does not classify it as paraphrase. Similarly, True ������ = (2) TP+FN Negative represents the sentences which are not paraphrase also •F-Measure: F1 macro measure can be used when you want to know how the system performs overall across the sets of data. It is the harmonic mean of precision and recall. It is required to optimize the system towards either precision or recall, The comparative analysis based on results obtained using which have more influence on final result. It can be calculated proposed approaches are shown in Table 9. It can be analyzed that using the equation 3. the accuracy obtained using ME method is better than that of SVM because of dependent feature and also high dimensionality and sparseness of text data. P eci i ∗Recall � − ��� � � = 2 ∗ (3) Table 9 Accuracy of Machine Learning Approaches P eci i +Recall Method/Task Accuracy F1 Measure / F1 Macro Measure The tables 7 and 8 represent Precision, Recall and F-Measure summary for sub task 1 and sub task2 in identifying paraphrase SVM – Sub Task1 0.73 0.72 and not a paraphrase sentences. In Table 7 SVM-NP stands for SVM – Sub Task2 0.59 0.53 SVM used for identifying Not Paraphrase (NP) sentences and SVM-P stands for SVM used for identifying Paraphrase (P) ME – Sub Task1 0.75 0.74 sentences. Similarly ME-NP stands for ME used for identifying Not Paraphrase (NP) sentences and ME-P stands for SVM used ME – Sub Task2 0.61 0.56 for identifying Paraphrase (P) sentences. The results show that precision, recall and F-Measure values are high for Non Paraphrase identification by both SVM and ME systems. 6. Conclusion Table 7 Precision, Recall, F-Measure summary for SubTask1 This work makes an attempt to classify given sentence pairs into paraphrases or not using two supervised machine learning Method Precision Recall F-Measure algorithms, such as SVM and ME. In this paper we utilize sixteen different semantic features to best represent the similarity between SVM – NP 0.76 0.78 0.77 sentences. Two machine learning algorithms such as Support SVM – P 0.68 0.66 0.67 Vector Machine and Maximum Entropy have been considered for classification of given sentence pair into Paraphrase (P)and Not-a- ME - NP 0.78 0.80 0.79 Paraphrase(NP) for task1 and Paraphrase(P), Not-a- ME – P 0.71 0.68 0.69 Paraphrase(NP) and Semi-Paraphrase (SP) for task2. The accuracy and performance of these methods are measured on the basis of parameters such as accuracy, precision, recall, f-measure and macro F-measure. The results show that ME method In Table 8, SVM-SP stands for SVM used for identifying Semi outperforms than SVM to identify paraphrases. Paraphrase (SP) sentences and ME-SP stands for ME used for identifying Semi Paraphrase (SP) sentences. The results show that both the systems identify sentences in the order NP, SP and P. 7. References Table 8 Precision, Recall, F-Measure summary for SubTask2 [1] Qayyum, Z., and Altaf, W. 2012. Paraphrase Identification Method Precision Recall F-Measure using Semantic Heuristic Features, Research Journal of Applied Sciences, Engineering and Technology, 4(22) SVM – NP 0.74 0.77 0.75 (pp.4894-4904). SVM – P 0.36 0.41 0.38 [2] Duclaye, F., Yvon F., Collin O., and Cedex L. 2002. Using the Web as a Linguistic Resource for Learning SVM – SP 0.50 0.44 0.47 Reformulations Automatically. In proceedings of the Third International Conference on Language Resources and ME - NP 0.76 0.80 0.78 Evaluation (pp.390-396). ME – P 0.39 0.48 0.43 [3] Finch, A., Hwang, Y., and Sumitha, E. 2005. Using Machine ME – SP 0.52 0.42 0.46 Translation Evaluation Techniques to determine Sentence- level semantic Equivalence. In proceedings of the Third International Workshop on paraphrasing (pp.17-24). •Accuracy: [4] Mihalcea, R., Corley and Strapparava C.2006. Corpus-based and Knowledge-based Measures of Text Semantic Similarity. It is the most common measure of classification process. In proceedings of 21st National Conference on Artificial It can be calculated as the ratio of correctly classified sentences to Intelligence, Vol:1 (pp:775-780). total number of sentences. It can be calculated using the equation 4. [5] Fernando, S., and Stevenson, M.2008. A semantic similarity approach to paraphrase detection. In proceedings of 11th TP+TN Annual research colloquium of the UK special interest group ���� ��� = (4) TP+TN+FP+FN for computational linguistics (pp.45-52). [6] Malakasiotis, P. 2009. Paraphrase recognition using machine •F1 Macro Measure: learning to combine similarity measures. In proceedings of the ACL-IJCNLP 2009 Student Research Workshop (pp.27- [13] Hsu, C., W., Chang, C. C.,and Lin, C. J. 2003. A practical 35). guide to support vector clas- sification. Simon Fraser [7] Miller, G.A.,(19950. WordNet: A lexical database for University, 8888 University Drive, Burnaby BC, Canada, English. Communications of the ACM, Vol;38, No:11. V5A 1S6. [8] Madnani, N., and Chodorow, M.,2012.Reexamining [14] Nigam, K., Lafferty, J., and McCallum, A. 1999. Using Machine Translation Metrics for Paraphrase Identification. maximum entropy for text classification. In IJCAI-99 In proceedings of NAACL HLT 2012 (pp.182-190). workshop on machine learning for information filtering: 1 (pp. 61–67). [9] Socher R. Huang E and Manning C.D 2011. Dynamic Pooling and Unfolding Recursive Autoencoders for [15] http://ltrc.iiit.ac.in/analyzer/tamil/ paraphrase detection, Science (pp.1-9). [16] Tang, H., Tan, S.,and Cheng, X. 2009. A survey on [10] William, B.Dolan and Chris Brockett, “Automatically sentiment detection of reviews. Expert Systems with constructing a Corpus of Sentential Paraphrases”, In Applications, 36 (7), (pp.10760–10773). Proceedings of IWP, 2005. [17] Zhang, Jian, Hai Zhao, Liqing Zhang and Bao-Liang Lu. [11] Anand Kumar, M., Singh, S., Kavirajan, B., and Soman, K “An empirical Comparative study on two large-scale P. 2016. DPIL@FIRE2016: Overview of shared task on Hierarchical Text classification approaches”, International Detecting Paraphrases in Indian Languages, Working notes Journal of Computer processing of Languages, 2011. of FIRE 2016 - Forum for Information Retrieval Evaluation, [18] Mahalakshmi, S., Anand Kumar, M., and Soman, K.P Kolkata, India. 2015. Paraphrase detection for Tamil language using deep [12] nlp.amrita.edu/dpil_cen/index.html#dataset learning algorithm. Int. J. of Appld. Engg. Res., 10 (17), 13929-13934.