CCS Concepts

HIT2016@DPIL-FIRE2016:Detecting Paraphrases in Indian Languages based on Gradient Tree Boosting

Leilei Kong*

kongleilei1979@gmail.com 0

ZhenyuanHao

zhenyuan_hao@163.com 2

Kaisheng Chen

kaishengchen1997@outlook.com 2

Liuyang Tian

tianliuyang2016@outlook.com 1

Zhongyuan Han

Hanzhongyuan@gmail.com 2

Haoliang Qi

haoliang.qi@gmail.com 2 0 1College of Information and, Communication Engineering , Harbin , Engineering University , Harbin , China , 2School of Computer Science and, Technology, Heilongjiang Institute of, Technology , Harbin, China;, +86 451 88028910 1 College of Information and, Communication Engineering , Harbin , Engineering University , Harbin, China, +86 451 88028910 2 School of Computer Science and, Technology, Heilongjiang Institute of, Technology , Harbin, China;, +86 451 88028910

Detecting paraphrase is an important and challenging task. It can be used in paraphrases generation and extraction, machine translation, question and answer and plagiarism detection. Since the same meaning of a sentence is expressed in another sentence using different words, it makes the traditional methods based on lexical similarity ineffective. In this paper, we describe a strategy of Detecting Paraphrases in Indian Languages, which is a workshop track proposed by Forum Information Retrieval Evaluation 2016. We formalize this task as a classification problem, and a supervised learning method based on Gradient Boosting Tree is utilized to classify the types of paraphrase plagiarism. Inspired by the Meteor evaluation metrics of machine translation, the Meteor-like features are used for the classifier. Evaluation shows the performance of our approach, which achieved the highest Overall Score (0.77), the highest F1 measure for both Task1 and Task2 on Malayalam and Tamil, and the highest F1 measure on Punjabi Task2 in the 2016 FIRE Detecting Paraphrase in Indian Languages task.

CCS Concepts

• Information systems➝Information retrieval

1. INTRODUCTION

Detecting Paraphrasing has attracted the attention of researchers in recent years. It is widely used in paraphrases generation and extraction, machine translation, question and answer and plagiarism detection.

In the task description of Detecting Paraphrases in Indian Languages of Forum Information Retrieval Evaluation 2016 (FIRE 2016)1, the paraphrase is defined as “the same meaning of a sentence is expressed in another sentence using different words”. The proposed task is focused on sentence level paraphrase identification for Indian languages (Tamil, Malayalam, Hindi and Punjabi). There are two tasks are proposed by FIRE. The first sub task is: given a pair of sentences from newspaper domain, the task is to classify them as paraphrases (P) or not paraphrases (NP), and the second one is: given two sentences from newspaper domain, the task is to identify whether they are completely equivalent (E) or roughly equivalent (RE)1 or not equivalent (NE) [ 6 ]. The paraphrased sentences always retain the semantic meaning and usually obfuscated by manipulating the text and changing most of its appearance. The words in the original sentence is replaced with synonyms/antonyms, and short phrases are inserted to change the appearance, but not the idea, of the text (Alzahrani et al., 2012). Otherwise, the sentence reduction, combination, restructuring, paraphrasing, concept generalization, and concept specification also are used to paraphrase the sentence. All of these operations make the paraphrases identification difficult, because it involves the semantic similarity, lexical comprehension, syntactical identification, morphological analysis, and so on. Since the appearance have changed beyond recognition in paraphrased sentence, the methods only relying on the term matching or single feature may be become ineffective in detecting paraphrase. More features should be integrated in the model to detecting paraphrase. So we consider a machine learning method based on classification to address this problem.

Intuitively, the former sub tasks can be viewed as a two-category classification and the latter is multi-category classification. If we formalize the task of detecting paraphrase as a classification problem, our objectives focus on answeringthe following two questions: (1) Which classification-based methods can effectively be applied to the detecting paraphraseproblem, and (2) which features should be used in the classifier.

For the first problem, we choose Gradient Tree Boosting to learn t he classifier [ 2,3 ]. Regarding the second issues, inspired by the METEOR evaluation metrics of machine translation [ 4 ], we design the METEOR-like features for our classifier. Integrating some classical similarity measure feature, we develop the feature set. Using the training and testing corpora of Detecting Paraphrases in Indian Languages proposed by FIRE, we rigorously evaluate various aspects of our classification method for detecting paraphrases. Experimental results show that the proposed method can effectively classify the paraphrases pairs.

The rest of this paper is organized as follows. In Section 2, we ana lyze the problem of Detecting Paraphrases in Indian Languages, in troduce the model we used, and describe the features which the cl assifier uses. In Section 3, we report the experimental results and performance comparisons with the other detection methods. And i n the last section we conclude our study. 2. CLASSIFICATION FOR DPIL We now explore machine-learning methods for Detecting Paraphrases in Indian Languages. In this section, we analyze the main issues of Detecting Paraphrases in Indian Languages firstly. And then a classification method based on boosting tree is proposed. Finally, we describe the features which the classifier used.

2.1 Problem Analysis

As we have discussed in above section, paraphrases identification is difficult to detect. The traditional similarity computing methods, such as Cosine Distance, Jaccard Coefficient, Dice Distance, may be ineffective for paraphrases. Figure 1 exemplifies the paraphrases cases. From Figure 1, we can see that the two sentences having the paraphrasing relationship are different in their appearance. Furthermore, we conduct the analysis on 1000 randomly selected cases with paraphrase relationship on Malayalam sub corpora and all four languages corpora. Figure 2 displays the distribution with Jaccard Coefficient and METEOR-F1 as y-coordinate. It is easy to detect from Figure 2 that the scores of Jaccard coefficient are all very low, the average score is only 0.1332. Since there are few the same terms between the two sentence, only considering the term similarity may be inadequate. We analysis for identifying the relationship of them, more feature should be considered.

2.2 Problem Definition

According the description of detection paraphrases, we formalize the problem as follows. Denote a pair sentences as si=(oi, pi), where oi is the original sentence and pi is the paraphrased sentence. Note that given a pair (oi, pi) on the training data, we can get its label, which make learn a model for classification possible. Let the train corpora D={(x1,y1), (x2,y2), ....., (xi,yi),......, (xn,yn)}, where xi∈RNis a feature vector of siand xi  (xi(1) , xi(2) ,..., xi(n) )T ,i  1,2,..., N . We use a function to get each xi defined as follows.

x(i)  (oi , pi ) (1) where x(i)  (oi , pi ) is a mapping onto features that describes the paraphrase between the i-th original sentence oi and the paraphrased sentence pi.

And yiis the label of xi to denote the category of each xi. For the task 1, we define yi∈{P, NP}, and for task 2, we define yi∈{E, RE, NE}.

Then the framework of learning problem can be depicted in Figure 3. Then, given D as training data, the learning system will learn a condition probability P(Y|X) based on the training data. Then given a new input xn+1, the classification system gives the corresponding output label yn+1according to the learned classifier. 2.3 Classification Model: Gradient TreeBoosting Boosting tree is one of the best methods to improve the performan ce of statistical learning [ 2,3 ]. In this experiment, we use the Gradient Tree Boosting as the classification algorithm to learn the classifier. Gradient boosting is typically used with decision trees (especially CART trees) of a fi xed size as base learners.

2.4 Features

There are two groups of features, the similarity-based features and the METEOR-like features, are utilized to define x(i)  (oi , pi ) . The similarity-based features are used to capture the matching degree of oi and pi, and METEOR-like features is used to describe the semantic similarity. Specially, the METEOR-like features is inspired by METEOR, the measure metrics for machine translation, which is used to evaluate the performance of a translator. Table 1 list these features in detail. CS(xi , yi )  || xix|i|  |y|iyi ||

DC(s,r)  2common(s,r)

len(s)  len(r)

P  common(s, r)

len(r)

R  common(s, r)

len(s) F1  2PR

R  P Fm ean 10PR

R  9P Penalty 0.5 cloenm(mchounn(ks)s,)r3  

The ratio of number of shared terms against total number of terms. xi  yi is the inner product of x  and y, and || x || represents the length of vector. common (s, r) is the total number of the common unigrams in s and r, and len(r) and len(s) are the total number of unigrams in r and s. common (s, r) is the total number of the common unigrams in s and r, and len(r) is the total number of unigrams in r. len(s) is the total number of unigrams in s.

Combine the precision and recall.

Combine the precision and recall. len(chunks)is the number of the longer matchesin each chunk.

Score Fmean1 Penalty

The overall METEOR score.

3. Experiments 3.1 Dataset

The evaluation dataset is the Detecting Paraphrase in India Language (DPIL) which is mainly obtained from the newspaper. The details of this corpora can be found in http://nlp.amrita.edu/dpil_cen/.

The corpora are divided into two different subsets: Task1-set and Task2-set, and each sub set contains four different categories India language: Tamil, Malayalam, Hindi and Punjabi. The Task1-set contains 12400 samples, including 9200 training samples and 3200 test samples, and the Task2-set contains 17650 examples, including 12700 training samples and 4950 test samples. The statistics of training and testing data is shown in Table 2 and Table 3.

3.2 Experimental Settings

3.2.1 Pre-processing For each sentence pair in training data and test data, wefirstly remove numbers, punctuation and blank spaces. Then, we adopt two types of word segmentation, one is taking each word as a term unit, and the other is based on the n-gram, which the words in sentence are segmented in the form of n-gram. For example, Figure 4 shows an example of 4-gram. In the experiments, the n is set empirically.

3.2.2 Parameter Tuning

On the training corpus, the classifier is trained by using sklearn Boosting Classifier Gradient2. The learning rate (learning rate shrinks the contribution of each tree by learning rate) is set as 1.0, the max_depth (the maximum depth limits the number of nodes in the tree) is set as 1, the random state (random state is the seed used by the random number generator) is set as 0. All the other parameters are set as their default values except the parameter n_estimators (The number of boosting stages to perform). The other parameters, including the methods of word segmentation, the method of pre-processing method, the n value of ngram, are set experimentally.

We use the cross validation to tune the parameter n_estimators. The training corpora is randomly divided into two equal parts, and one is chosen as the training data and the other as the validation data.

3.3 Performance Measures

In this evaluation experiment, the experimental results are evaluated according to [ 5 ]. 1) TP: The sample is true, and the results obtained are positive. 2) FP: The sample is false, and the results obtained are positive. 3) FN: The sample is false, and the results obtained are negative. 4) TN: The sample is true, and the results obtained are negative. According to the above measure metrics, the Precision and Recall are defined as follows: precision  recall 

TP TP  FP

TP TP  FN (5) (6) The main evaluation metrics adopted by DPIL is Accuracy and F1 measure defined as follows: 0.8044 —— —— —— —— —— —— CUSAT NLP 0.7622 —— —— —— —— —— ——

3.4 Experimental Results

3.4.1 Experimental results on sub corpora Table 4 show the experimental results released by FIRE. (a) Task 1 sub corpus CUSAT NLP 0.5207 —— —— —— —— —— —— 3.4.2 Effect of word segmentation For the word segmentation, we utilize two processing methods. One is based on the space to do the word segmentation, and the other is based on n-gram. We compare the two kinds of word segmentation methods in Table 5.

3.4.3 Effects of pre-processing

In our experiment, there are two types of pre-processing methods. To investigate the different contribution of each pre-processing method on each language, we analyze the effects of preprocessing. Taking 4gram word segmentation as example, Table 6 gives the experimental results, where removing all means remove the punctuation, the number and the space, and reserving * means reserving * and removing all others. For example, reserving punctuationrepresents the punctuation is reserved and the number and space are removed.

Reserved punctuation

Reserved number According to the experimental results shown in Table 6, even thoughwe find that there are few differences when we removing punctuation, numbers and spaces, we still accept the best preprocessing method on the test dataset.

3.4.4 Effects of n-gram

For analyze the effects of n, we carry out the experiments from 1gram to 10-gram, and with Precision, Recall and F1 measure as evaluation indicators. The experimental results are shown in Figure 5. (a) The experimental results on Task 1 (b) The experimental results on Task 2 According to the above experimental results, 4-gram achieves the best results. So we set n=4 in the testing corpora of DPIL 2016.

3.4.5 Effects of n_estimators

The parameter n_estimators is the number of iterations of boosting stage when the classification model trained. It is set empirically. Figure 6 shows the results on training datasets.

dpil-mal-train-Task1 0.915 0.91 0.905 0.9 0.895 0.89 0.885 0 20 40 60 80

100 Accuracy

F1 Measure (a) The experimental results of Malayalam on Task1 (b) The experimental results of Tamil on Task1 (c) The experimental results of Hindion Task1 (d) The experimental results of Punjabi on Task1 (e) The experimental results of Malayalam on Task2 (f) The experimental results of Tamil on Task2 (g) The experimental results of Hindion Task2 (h) The experimental results of Punjabi on Task2 According to Figure 6, we get the value of the parameter n_estimators of each language. Details are shown in Table 7 which is used in the testing datasets of DPIL.

4. CONCLUSIONS

We describe an approach to the Detecting Paraphrase problem in India Language that makes used of the Gradient Tree Boosting. Overall, the approach was very competitive and achieved the highest Accuracy and F1 measure among all task participants. 5. ACKNOWLEDGMENTS This work is supported by Youth National Social Science Fund of China (No. 14CTQ032), National Natural Science Foundation of China (No. 61370170), and Research Project of HeilongjiangProv incial Department of Education (No. 12541677, 12541649).

[1] Alzahrani , S. M. , Salim , N. , and Abraham , A. 2012 . Understanding plagiarism linguistic patterns, textual features, and detection methods . IEEE Transactions on Systems, Man, and Cybernetics , Part C ( Applications and Reviews), 42 ( 2 ), 133 - 149 .

[2] Friedman , J. H. , 2001 . Greedy function approximation: a gradient boosting machine . Annals of statistics , 1189 - 1232 .

[3] Friedman , J. H. , 2002 . Stochastic gradient boosting . Computational Statistics & Data Analysis , 38 ( 4 ), 367 - 378 .

[4] Banerjee , S. , andLavie, A. , 2005 , June. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments . In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 29 : 65 - 72 .

[5] Li , Hang. , 2012 . Statistical learning methods .Tsinghua university press(in Chinese).

[6]

Anand

Kumar , M. , Singh , S. , Kavirajan , B. , and Soman , K.P. 2016 . December. DPIL@ FIRE2016: Overview of shared task on Detecting Paraphrases in Indian Languages, Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata , India, CEUR Workshop Proceedings, CEUR-WS.org.