=Paper=
{{Paper
|id=Vol-1737/T6-6
|storemode=property
|title=HIT2016@DPIL-FIRE2016: Detecting Paraphrases in Indian Languages based on Gradient Tree Boosting
|pdfUrl=https://ceur-ws.org/Vol-1737/T6-6.pdf
|volume=Vol-1737
|authors=Leilei Kong,Kaisheng Chen,Liuyang Tian,Zhenyuan Hao,Zhongyuan Han,Haoliang Qi
|dblpUrl=https://dblp.org/rec/conf/fire/KongCTHHQ16
}}
==HIT2016@DPIL-FIRE2016: Detecting Paraphrases in Indian Languages based on Gradient Tree Boosting==
HIT2016@DPIL-FIRE2016:Detecting Paraphrases in Indian Languages based on Gradient Tree Boosting Leilei Kong* Kaisheng Chen Liuyang Tian 1 College of Information and School of Computer Science and College of Information and Communication Engineering, Harbin Technology, Heilongjiang Institute of Communication Engineering, Harbin Engineering University, Harbin, China Technology, Harbin, China; Engineering University, Harbin, China 2 School of Computer Science and +86 451 88028910 +86 451 88028910 Technology, Heilongjiang Institute of kaishengchen1997@outlook.com tianliuyang2016@outlook.com Technology, Harbin, China; +86 451 88028910 kongleilei1979@gmail.com ZhenyuanHao Zhongyuan Han Haoliang Qi School of Computer Science and School of Computer Science and School of Computer Science and Technology, Heilongjiang Institute of Technology, Heilongjiang Institute of Technology, Heilongjiang Institute of Technology, Harbin, China; Technology, Harbin, China; Technology, Harbin, China; +86 451 88028910 +86 451 88028910 +86 451 88028910 zhenyuan_hao@163.com Hanzhongyuan@gmail.com haoliang.qi@gmail.com ABSTRACT sentence is expressed in another sentence using different words”. Detecting paraphrase is an important and challenging task. It can The proposed task is focused on sentence level paraphrase be used in paraphrases generation and extraction, machine identification for Indian languages (Tamil, Malayalam, Hindi and translation, question and answer and plagiarism detection. Since Punjabi). There are two tasks are proposed by FIRE. The first sub the same meaning of a sentence is expressed in another sentence task is: given a pair of sentences from newspaper domain, the task using different words, it makes the traditional methods based on is to classify them as paraphrases (P) or not paraphrases (NP), and lexical similarity ineffective. In this paper, we describe a strategy the second one is: given two sentences from newspaper domain, of Detecting Paraphrases in Indian Languages, which is a the task is to identify whether they are completely equivalent (E) workshop track proposed by Forum Information Retrieval or roughly equivalent (RE)1 or not equivalent (NE) [6]. Evaluation 2016. We formalize this task as a classification The paraphrased sentences always retain the semantic meaning problem, and a supervised learning method based on Gradient and usually obfuscated by manipulating the text and changing Boosting Tree is utilized to classify the types of paraphrase most of its appearance. The words in the original sentence is plagiarism. Inspired by the Meteor evaluation metrics of machine replaced with synonyms/antonyms, and short phrases are inserted translation, the Meteor-like features are used for the classifier. to change the appearance, but not the idea, of the text (Alzahrani Evaluation shows the performance of our approach, which et al., 2012). Otherwise, the sentence reduction, combination, achieved the highest Overall Score (0.77), the highest F1 measure restructuring, paraphrasing, concept generalization, and concept for both Task1 and Task2 on Malayalam and Tamil, and the specification also are used to paraphrase the sentence. All of these highest F1 measure on Punjabi Task2 in the 2016 FIRE Detecting operations make the paraphrases identification difficult, because it Paraphrase in Indian Languages task. involves the semantic similarity, lexical comprehension, syntactical identification, morphological analysis, and so on. CCS Concepts • Information systems➝Information retrieval Since the appearance have changed beyond recognition in paraphrased sentence, the methods only relying on the term Keywords matching or single feature may be become ineffective in detecting Paraphrase; Classification; Indian Languages; Gradient Tree paraphrase. More features should be integrated in the model to Boosting. detecting paraphrase. So we consider a machine learning method based on classification to address this problem. Intuitively, the former sub tasks can be viewed as a two-category 1. INTRODUCTION classification and the latter is multi-category classification. If we Detecting Paraphrasing has attracted the attention of researchers formalize the task of detecting paraphrase as a classification in recent years. It is widely used in paraphrases generation and problem, our objectives focus on answeringthe following two extraction, machine translation, question and answer and questions: (1) Which classification-based methods can effectively plagiarism detection. be applied to the detecting paraphraseproblem, and (2) which features should be used in the classifier. In the task description of Detecting Paraphrases in Indian Languages of Forum Information Retrieval Evaluation 2016 For the first problem, we choose Gradient Tree Boosting to learn t (FIRE 2016)1, the paraphrase is defined as “the same meaning of a he classifier [2,3]. Regarding the second issues, inspired by the METEOR evaluation metrics of machine translation [4], we design 1 http://nlp.amrita.edu/dpil_cen/ * Corresponding author the METEOR-like features for our classifier. Integrating some classical similarity measure feature, we develop the feature set. Using the training and testing corpora of Detecting Paraphrases in Indian Languages proposed by FIRE, we rigorously evaluate various aspects of our classification method for detecting paraphrases. Experimental results show that the proposed method can effectively classify the paraphrases pairs. The rest of this paper is organized as follows. In Section 2, we ana lyze the problem of Detecting Paraphrases in Indian Languages, in troduce the model we used, and describe the features which the cl assifier uses. In Section 3, we report the experimental results and performance comparisons with the other detection methods. And i n the last section we conclude our study. 2. CLASSIFICATION FOR DPIL Figure 2. Score distribution of Jaccard coefficient on We now explore machine-learning methods for Detecting Malayalam (up) sub corpora and all four languages Paraphrases in Indian Languages. In this section, we analyze the corpora(down) main issues of Detecting Paraphrases in Indian Languages firstly. And then a classification method based on boosting tree is proposed. Finally, we describe the features which the classifier xi ( xi(1) , xi( 2) ,..., xi( n) )T , i 1,2,..., N . We use a function to get each used. xi defined as follows. 2.1 Problem Analysis x(i ) (oi , pi ) (1) As we have discussed in above section, paraphrases identification is difficult to detect. The traditional similarity where x(i ) (oi , pi ) is a mapping onto features that describes the computing methods, such as Cosine Distance, Jaccard Coefficient, paraphrase between the i-th original sentence oi and the Dice Distance, may be ineffective for paraphrases. Figure 1 paraphrased sentence pi. exemplifies the paraphrases cases. And yiis the label of xi to denote the category of each xi. For the task 1, we define yi∈{P, NP}, and for task 2, we define yi∈{E, RE, NE}. Then the framework of learning problem can be depicted in Figure 3. Figure 1.A paraphrases cases From Figure 1, we can see that the two sentences having the paraphrasing relationship are different in their appearance. Furthermore, we conduct the analysis on 1000 randomly selected cases with paraphrase relationship on Malayalam sub corpora and all four languages corpora. Figure 2 displays the distribution with Jaccard Coefficient and METEOR-F1 as y-coordinate. Figure 3. The framework of Detection Paraphrase It is easy to detect from Figure 2 that the scores of Jaccard Then, given D as training data, the learning system will learn a coefficient are all very low, the average score is only 0.1332. condition probability P(Y|X) based on the training data. Then Since there are few the same terms between the two sentence, given a new input xn+1, the classification system gives the only considering the term similarity may be inadequate. We corresponding output label yn+1according to the learned classifier. analysis for identifying the relationship of them, more feature should be considered. 2.3 Classification Model: Gradient TreeBoosting 2.2 Problem Definition Boosting tree is one of the best methods to improve the performan According the description of detection paraphrases, we formalize ce of statistical learning the problem as follows. Denote a pair sentences as s i=(oi, pi), [2,3] . In this experiment, we use the Gradient Tree Boosting as the where oi is the original sentence and pi is the paraphrased sentence. classification algorithm to learn the classifier. Gradient boosting is Note that given a pair (oi, pi) on the training data, we can get its typically used with decision trees (especially CART trees) of a fi label, which make learn a model for classification possible. Let xed size as base learners. the train corpora D={(x1,y1), (x2,y2), ....., (xi,yi),......, (xn,yn)}, N where xi∈R is a feature vector of siand 2.4 Features There are two groups of features, the similarity-based features and the METEOR-like features, are utilized to define x(i ) (oi , pi ) . The similarity-based features are used to capture the matching degree of oi and pi, and METEOR-like features is used to describe the semantic similarity. Specially, the METEOR-like features is inspired by METEOR, the measure metrics for machine Table 3. Corpus statistics of DPIL 2016 on Task2 translation, which is used to evaluate the performance of a Train Test translator. Table 1 list these features in detail. Language Hin Mal Pun Tam All Hin Mal Pun Tam all Table 1. Features for detecting paraphrases SampleNum Features Computing methods Description 3500 3500 2200 3500 12700 1400 1400 750 1400 4950 ber The ratio of number of shared Jaccard si rj Avg blank 34 18 41 24 28 42 19 41 28 31 JC si , rj terms against total number of Coefficient si rj terms 4gram 131 164 156 178 158 154 177 157 207 176 terms. xi yi is the inner product of x 3.2 Experimental Settings Cosine x y CS ( xi , yi ) i i || xi || || yi || and y, and || x || represents the Similarity 3.2.1 Pre-processing length of vector. For each sentence pair in training data and test data, wefirstly common (s, r) is the total remove numbers, punctuation and blank spaces. Then, we adopt number of the common two types of word segmentation, one is taking each word as a Dice 2 common(s,r) DC( s, r ) unigrams in s and r, and len(r) Coefficient len ( s) len (r ) and len(s) are the total number term unit, and the other is based on the n-gram, which the words of unigrams in r and s. in sentence are segmented in the form of n-gram. For example, Figure 4 shows an example of 4-gram. In the experiments, the n is common (s, r) is the total set empirically. common( s, r ) number of the common METEOR P unigrams in s and r, and len(r) Precision len (r ) is the total number of unigrams in r. METEOR common( s, r ) len(s) is the total number of R Recall len ( s) unigrams in s. METEOR 2 PR Combine the precision and F1 F1 RP recall. METEOR 10PR Combine the precision and Figure 4. The example of 4-gram Fm ean Fmean R 9P recall. 3.2.2 Parameter Tuning len(chunks) 3 On the training corpus, the classifier is trained by using sklearn METEOR Penalty 0.5 len(chunks)is the number of the ) com m on(s,r Boosting Classifier Gradient 2. The learning rate (learning rate Penalty longer matchesin each chunk. shrinks the contribution of each tree by learning rate) is set as 1.0, the max_depth (the maximum depth limits the number of nodes in Score Fmean 1 Penalty METEOR The overall METEOR score. the tree) is set as 1, the random state (random state is the seed score used by the random number generator) is set as 0. All the other parameters are set as their default values except the parameter 3. Experiments n_estimators (The number of boosting stages to perform). 3.1 Dataset The other parameters, including the methods of word The evaluation dataset is the Detecting Paraphrase in India segmentation, the method of pre-processing method, the n value Language (DPIL) which is mainly obtained from the newspaper. of ngram, are set experimentally. The details of this corpora can be found in We use the cross validation to tune the parameter n_estimators. http://nlp.amrita.edu/dpil_cen/. The training corpora is randomly divided into two equal parts, and one is chosen as the training data and the other as the validation The corpora are divided into two different subsets: Task1-set and data. Task2-set, and each sub set contains four different categories India language: Tamil, Malayalam, Hindi and Punjabi. The 3.3 Performance Measures Task1-set contains 12400 samples, including 9200 training In this evaluation experiment, the experimental results are samples and 3200 test samples, and the Task2-set contains 17650 evaluated according to [5]. examples, including 12700 training samples and 4950 test 1) TP: The sample is true, and the results obtained are positive. samples. The statistics of training and testing data is shown in 2) FP: The sample is false, and the results obtained are positive. Table 2 and Table 3. 3) FN: The sample is false, and the results obtained are negative. 4) TN: The sample is true, and the results obtained are negative. Table 2. Corpus statistics of DPIL 2016 on Task1 According to the above measure metrics, the Precision and Recall Train Test are defined as follows: TP Language Hin Mal Pun Tam all Hin Mal Pun Tam all precision (5) TP FP SampleNum 2500 2500 1700 2500 9200 900 900 500 900 3200 TP ber recall (6) TP FN Avg blank 32 18 39 24 27 32 19 43 23 28 The main evaluation metrics adopted by DPIL is Accuracy and terms 4gram 126 166 150 175 155 120 181 164 176 160 F1 measure defined as follows: 2 http://scikit-learn.org/stable/ TP TN CUSAT 0.465 accuracy (7) 0.5086 —— —— —— —— —— —— TP FN FP TN TEAM 8 2 precision recall 0.513 F1 (8) CUSAT NLP 0.5207 —— —— —— —— —— —— precision recall 0 3.4 Experimental Results 3.4.1 Experimental results on sub corpora The experimental results show that the proposed method achieves Table 4 show the experimental results released by FIRE. the best Accuracy on Malayalam of Task 1 and on Malayalam, Tamil and Punjabi of Task 2. And the highest F1 measure for both Table 4. Experimental results on DPIL@FIRE2016 Task1 and Task2 on Malayalam and Tamil, and the highest F1 (a) Task 1 sub corpus measure on Punjabi Task2 in the 2016FIREDetecting Paraphrase Accuracy F1 Measure in Indian Languages task. TEAM Mal Tam Hin Pun Mal Tam Hin Pun 3.4.2 Effect of word segmentation 0.821 0.896 0.944 0.810 0.790 0.890 0.940 For the word segmentation, we utilize two processing methods. HIT2016 0.8377 1 6 0 0 0 0 0 One is based on the space to do the word segmentation, and the other is based on n-gram. We compare the two kinds of word 0.788 0.906 0.946 0.790 0.750 0.900 0.950 KS_JU 0.8100 8 6 0 0 0 0 0 segmentation methods in Table 5. 0.833 0.915 0.942 0.790 0.790 0.910 0.940 Table 5. Comparison of two different preprocessing NLP-NITMZ 0.8344 3 5 0 0 0 0 0 4-gram space Task1 0.575 0.822 0.942 0.160 0.090 0.740 0.940 Mal Tam Hindi Pun Mal Tam Hindi Pun JU-NLP 0.5900 5 2 0 0 0 0 0 Precisio n 0.8993 0.9587 0.9235 0.9884 0.8771 0.9543 0.9340 0.9911 0.920 0.910 Anuj —— —— —— —— —— —— Recall 0.9301 0.9606 0.9187 0.9921 0.9279 0.9574 0.9289 0.9921 0 0 0.938 0.940 Accurac DAVPBI —— —— —— —— —— —— 0.8957 0.9517 0.9054 0.9885 0.8785 0.9469 0.9178 0.9901 y 0 0 F1 0.9143 0.9596 0.9210 0.9902 0.9017 0.9558 0.9314 0.9916 BITS-PILANI —— —— 0.8977 —— —— —— 0.8900 —— 4-gram space 0.823 0.790 Task2 NLP@KEC —— —— —— —— —— —— Mal Tam Hindi Pun Mal Tam Hindi Pun 3 0 Precisio ASE —— —— 0.358 —— —— —— 0.340 —— n 0.7298 0.7873 0.8499 0.9810 0.7135 0.7917 0.8553 0.9814 8 0 Recall 0.7370 0.7918 0.8484 0.9808 0.7227 0.7949 0.8545 0.9813 CUSAT 0.760 0.8044 —— —— —— —— —— —— Accurac TEAM 0 0.7370 0.7918 0.8484 0.9808 0.7227 0.7949 0.8545 0.9813 y 0.750 CUSAT NLP 0.7622 —— —— —— —— —— —— F1 0.7309 0.7878 0.8483 0.9808 0.7134 0.7923 0.8541 0.9813 0 From the experimental results, we can see that the method of 4- (b) Task 2 sub corpus gram segmentation achieves higher F1 Measure than the space Accuracy F1 Measure segmentation, so we use n-gram method in the following TEAM experiments to deal with the India corpus. Mal Tam Hin Pun Mal Tam Hin Pun 0.755 0.900 0.922 0.746 0.739 0.898 0.923 3.4.3 Effects of pre-processing HIT2016 0.7486 0 0 6 0 8 4 0 In our experiment, there are two types of pre-processing methods. 0.673 0.852 0.896 0.657 0.664 0.848 0.896 To investigate the different contribution of each pre-processing KS_JU 0.6614 method on each language, we analyze the effects of pre- 5 1 0 8 5 2 0 processing. Taking 4gram word segmentation as example, Table 6 0.657 0.785 0.812 0.606 0.630 0.764 0.808 gives the experimental results, where removing all means remove NLP-NITMZ 0.6243 1 7 0 8 7 2 6 the punctuation, the number and the space, and reserving * means 0.550 0.685 0.886 0.307 0.431 0.684 0.886 reserving * and removing all others. For example, reserving JU-NLP 0.4221 7 7 6 8 9 1 6 punctuationrepresents the punctuation is reserved and the number 0.901 0.900 and space are removed. Anuj —— —— —— —— —— —— 4 0 Table 6. Effects of pre-processing 0.746 0.727 Reserved Reserved Reserved DAVPBI —— —— —— —— —— —— Mal punctuation number space Remove all 6 4 BITS-PILANI —— —— 0.7171 —— —— —— 0.7123 —— Precision 0.9013 0.8995 0.8992 0.8988 Recall 0.9280 0.9276 0.9325 0.9335 0.685 0.667 Task1 NLP@KEC —— —— —— —— —— —— Accuracy 0.8956 0.8944 0.8966 0.8968 7 4 0.354 0.353 F1 Measure 0.9144 0.9133 0.9154 0.9157 ASE —— —— —— —— —— —— 3 5 Task2 Precision 0.7304 0.7258 0.7253 0.7289 Recall 0.7380 0.7340 0.7321 0.7362 (a) The experimental results on Task 1 Accuracy 0.7380 0.7340 0.7321 0.7362 F1 Measure 0.7316 0.7273 0.7264 0.7299 Reserved Reserved Reserved Tam Remove all punctuation number space Precision 0.9585 0.9591 0.9535 0.9570 Recall 0.9593 0.9590 0.9558 0.9607 Task1 Accuracy 0.9506 0.9507 0.9455 0.9506 F1 Measure 0.9589 0.9590 0.9546 0.9588 Precision 0.7855 0.7874 0.7864 0.7871 Recall 0.7901 0.7915 0.7897 0.7917 Task2 Accuracy 0.7901 0.7915 0.7897 0.7917 (b) The experimental results on Task 2 F1 Measure 0.7861 0.7880 0.7866 0.7880 Figure 5. The effects of n-gram Reserved Reserved Reserved According to the above experimental results, 4-gram achieves the Hindi Remove all punctuation number space best results. So we set n=4 in the testing corpora of DPIL 2016. Precision 0.9218 0.9242 0.9310 0.9230 3.4.5 Effects of n_estimators Recall 0.9136 0.9151 0.9244 0.9195 Task1 The parameter n_estimators is the number of iterations of Accuracy 0.9018 0.9039 0.9133 0.9054 boosting stage when the classification model trained. It is set F1 Measure 0.9176 0.9195 0.9275 0.9211 empirically. Figure 6 shows the results on training datasets. Precision 0.8490 0.8502 0.8495 0.8500 dpil-mal-train-Task1 Recall 0.8477 0.8481 0.8487 0.8486 0.915 Task2 Accuracy 0.8477 0.8481 0.8487 0.8486 0.91 F1 Measure 0.8475 0.8480 0.8484 0.8484 0.905 Reserved Reserved Reserved 0.9 Pun Remove all punctuation number space 0.895 Precision 0.9909 0.9904 0.9867 0.9903 0.89 Recall 0.9914 0.9908 0.9895 0.9905 0.885 Task1 0 20 40 60 80 100 Accuracy 0.9895 0.9889 0.9859 0.9887 Accuracy F1 Measure F1 Measure 0.9911 0.9906 0.9881 0.9904 Precision 0.9810 0.9774 0.9812 0.9812 (a) The experimental results of Malayalam on Task1 Recall 0.9808 0.9772 0.9810 0.9811 Task2 Accuracy 0.9808 0.9772 0.9810 0.9811 F1 Measure 0.9808 0.9772 0.9810 0.9811 According to the experimental results shown in Table 6, even thoughwe find that there are few differences when we removing punctuation, numbers and spaces, we still accept the best pre- processing method on the test dataset. 3.4.4 Effects of n-gram For analyze the effects of n, we carry out the experiments from 1- gram to 10-gram, and with Precision, Recall and F1 measure as (b) The experimental results of Tamil on Task1 evaluation indicators. The experimental results are shown in Figure 5. (c) The experimental results of Hindion Task1 (d) The experimental results of Punjabi on Task1 (h) The experimental results of Punjabi on Task2 Figure 6.Effects of n_estimators According to Figure 6, we get the value of the parameter n_estimators of each language. Details are shown in Table 7 which is used in the testing datasets of DPIL. Table 7.N_estimatorssetting Task1 Task2 Malayalam 55 40 Tamil 20 20 Hindi 45 45 Punjabi 10 25 (e) The experimental results of Malayalam on Task2 4. CONCLUSIONS We describe an approach to the Detecting Paraphrase problem in India Language that makes used of the Gradient Tree Boosting. Overall, the approach was very competitive and achieved the highest Accuracy and F1 measure among all task participants. 5. ACKNOWLEDGMENTS This work is supported by Youth National Social Science Fund of China (No. 14CTQ032), National Natural Science Foundation of China (No. 61370170), and Research Project of HeilongjiangProv incial Department of Education (No. 12541677, 12541649). 6. REFERENCES (f) The experimental results of Tamil on Task2 [1] Alzahrani, S. M., Salim, N., and Abraham, A. 2012. Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2), 133-149. [2] Friedman, J. H., 2001. Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189-1232. [3] Friedman, J. H., 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367-378. [4] Banerjee, S., andLavie, A., 2005, June. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 29: 65-72. (g) The experimental results of Hindion Task2 [5] Li, Hang., 2012. Statistical learning methods.Tsinghua university press(in Chinese). [6] Anand Kumar, M., Singh, S., Kavirajan, B., and Soman, K.P. 2016. December. DPIL@FIRE2016: Overview of shared task on Detecting Paraphrases in Indian Languages, Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, CEUR Workshop Proceedings, CEUR-WS.org.