<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HIT2016@DPIL-FIRE2016:Detecting Paraphrases in Indian Languages based on Gradient Tree Boosting</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leilei Kong*</string-name>
          <email>kongleilei1979@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ZhenyuanHao</string-name>
          <email>zhenyuan_hao@163.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kaisheng Chen</string-name>
          <email>kaishengchen1997@outlook.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liuyang Tian</string-name>
          <email>tianliuyang2016@outlook.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhongyuan Han</string-name>
          <email>Hanzhongyuan@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haoliang Qi</string-name>
          <email>haoliang.qi@gmail.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1College of Information and, Communication Engineering</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <institution>Engineering University</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <country country="CN">China</country>
          ,
          <institution>2School of Computer Science and, Technology, Heilongjiang Institute of, Technology</institution>
          ,
          <addr-line>Harbin, China;, +86 451 88028910</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>College of Information and, Communication Engineering</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <institution>Engineering University</institution>
          ,
          <addr-line>Harbin, China, +86 451 88028910</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>School of Computer Science and, Technology, Heilongjiang Institute of, Technology</institution>
          ,
          <addr-line>Harbin, China;, +86 451 88028910</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Detecting paraphrase is an important and challenging task. It can be used in paraphrases generation and extraction, machine translation, question and answer and plagiarism detection. Since the same meaning of a sentence is expressed in another sentence using different words, it makes the traditional methods based on lexical similarity ineffective. In this paper, we describe a strategy of Detecting Paraphrases in Indian Languages, which is a workshop track proposed by Forum Information Retrieval Evaluation 2016. We formalize this task as a classification problem, and a supervised learning method based on Gradient Boosting Tree is utilized to classify the types of paraphrase plagiarism. Inspired by the Meteor evaluation metrics of machine translation, the Meteor-like features are used for the classifier. Evaluation shows the performance of our approach, which achieved the highest Overall Score (0.77), the highest F1 measure for both Task1 and Task2 on Malayalam and Tamil, and the highest F1 measure on Punjabi Task2 in the 2016 FIRE Detecting Paraphrase in Indian Languages task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS Concepts</title>
      <p>• Information systems➝Information retrieval</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>Detecting Paraphrasing has attracted the attention of researchers
in recent years. It is widely used in paraphrases generation and
extraction, machine translation, question and answer and
plagiarism detection.</p>
      <p>
        In the task description of Detecting Paraphrases in Indian
Languages of Forum Information Retrieval Evaluation 2016
(FIRE 2016)1, the paraphrase is defined as “the same meaning of a
sentence is expressed in another sentence using different words”.
The proposed task is focused on sentence level paraphrase
identification for Indian languages (Tamil, Malayalam, Hindi and
Punjabi). There are two tasks are proposed by FIRE. The first sub
task is: given a pair of sentences from newspaper domain, the task
is to classify them as paraphrases (P) or not paraphrases (NP), and
the second one is: given two sentences from newspaper domain,
the task is to identify whether they are completely equivalent (E)
or roughly equivalent (RE)1 or not equivalent (NE) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
The paraphrased sentences always retain the semantic meaning
and usually obfuscated by manipulating the text and changing
most of its appearance. The words in the original sentence is
replaced with synonyms/antonyms, and short phrases are inserted
to change the appearance, but not the idea, of the text (Alzahrani
et al., 2012). Otherwise, the sentence reduction, combination,
restructuring, paraphrasing, concept generalization, and concept
specification also are used to paraphrase the sentence. All of these
operations make the paraphrases identification difficult, because it
involves the semantic similarity, lexical comprehension,
syntactical identification, morphological analysis, and so on.
Since the appearance have changed beyond recognition in
paraphrased sentence, the methods only relying on the term
matching or single feature may be become ineffective in detecting
paraphrase. More features should be integrated in the model to
detecting paraphrase. So we consider a machine learning method
based on classification to address this problem.
      </p>
      <p>Intuitively, the former sub tasks can be viewed as a two-category
classification and the latter is multi-category classification. If we
formalize the task of detecting paraphrase as a classification
problem, our objectives focus on answeringthe following two
questions: (1) Which classification-based methods can effectively
be applied to the detecting paraphraseproblem, and (2) which
features should be used in the classifier.</p>
      <p>
        For the first problem, we choose Gradient Tree Boosting to learn t
he classifier [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ]. Regarding the second issues, inspired by the
METEOR evaluation metrics of machine translation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we design
the METEOR-like features for our classifier. Integrating some
classical similarity measure feature, we develop the feature set.
Using the training and testing corpora of Detecting Paraphrases in
Indian Languages proposed by FIRE, we rigorously evaluate
various aspects of our classification method for detecting
paraphrases. Experimental results show that the proposed method
can effectively classify the paraphrases pairs.
      </p>
      <p>The rest of this paper is organized as follows. In Section 2, we ana
lyze the problem of Detecting Paraphrases in Indian Languages, in
troduce the model we used, and describe the features which the cl
assifier uses. In Section 3, we report the experimental results and
performance comparisons with the other detection methods. And i
n the last section we conclude our study.
2. CLASSIFICATION FOR DPIL
We now explore machine-learning methods for Detecting
Paraphrases in Indian Languages. In this section, we analyze the
main issues of Detecting Paraphrases in Indian Languages firstly.
And then a classification method based on boosting tree is
proposed. Finally, we describe the features which the classifier
used.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Problem Analysis</title>
      <p>As we have discussed in above section, paraphrases
identification is difficult to detect. The traditional similarity
computing methods, such as Cosine Distance, Jaccard Coefficient,
Dice Distance, may be ineffective for paraphrases. Figure 1
exemplifies the paraphrases cases.
From Figure 1, we can see that the two sentences having the
paraphrasing relationship are different in their appearance.
Furthermore, we conduct the analysis on 1000 randomly selected
cases with paraphrase relationship on Malayalam sub corpora and
all four languages corpora. Figure 2 displays the distribution with
Jaccard Coefficient and METEOR-F1 as y-coordinate.
It is easy to detect from Figure 2 that the scores of Jaccard
coefficient are all very low, the average score is only 0.1332.
Since there are few the same terms between the two sentence,
only considering the term similarity may be inadequate. We
analysis for identifying the relationship of them, more feature
should be considered.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Problem Definition</title>
      <p>According the description of detection paraphrases, we formalize
the problem as follows. Denote a pair sentences as si=(oi, pi),
where oi is the original sentence and pi is the paraphrased sentence.
Note that given a pair (oi, pi) on the training data, we can get its
label, which make learn a model for classification possible. Let
the train corpora D={(x1,y1), (x2,y2), ....., (xi,yi),......, (xn,yn)},
where xi∈RNis a feature vector of siand
xi  (xi(1) , xi(2) ,..., xi(n) )T ,i  1,2,..., N . We use a function to get each
xi defined as follows.</p>
      <p>x(i)  (oi , pi )
(1)
where x(i)  (oi , pi ) is a mapping onto features that describes the
paraphrase between the i-th original sentence oi and the
paraphrased sentence pi.</p>
      <p>And yiis the label of xi to denote the category of each xi. For the
task 1, we define yi∈{P, NP}, and for task 2, we define yi∈{E,
RE, NE}.</p>
      <p>
        Then the framework of learning problem can be depicted in
Figure 3.
Then, given D as training data, the learning system will learn a
condition probability P(Y|X) based on the training data. Then
given a new input xn+1, the classification system gives the
corresponding output label yn+1according to the learned classifier.
2.3 Classification Model: Gradient
TreeBoosting
Boosting tree is one of the best methods to improve the performan
ce of statistical learning
[
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ]. In this experiment, we use the Gradient Tree Boosting as the
classification algorithm to learn the classifier. Gradient boosting is
typically used with decision trees (especially CART trees) of a fi
xed size as base learners.
      </p>
    </sec>
    <sec id="sec-5">
      <title>2.4 Features</title>
      <p>There are two groups of features, the similarity-based features and
the METEOR-like features, are utilized to define x(i)  (oi , pi ) .
The similarity-based features are used to capture the matching
degree of oi and pi, and METEOR-like features is used to describe
the semantic similarity. Specially, the METEOR-like features is
inspired by METEOR, the measure metrics for machine
translation, which is used to evaluate the performance of a
translator. Table 1 list these features in detail.
CS(xi , yi )  || xix|i|  |y|iyi ||</p>
      <sec id="sec-5-1">
        <title>DC(s,r)  2common(s,r)</title>
        <p>len(s)  len(r)</p>
        <sec id="sec-5-1-1">
          <title>P  common(s, r)</title>
          <p>len(r)</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>R  common(s, r)</title>
          <p>len(s)
F1  2PR</p>
          <p>R  P
Fm ean 10PR</p>
          <p>R  9P
Penalty 0.5 cloenm(mchounn(ks)s,)r3

</p>
          <p>The ratio of number of shared
terms against total number of
terms.
xi  yi is the inner product of x

and y, and || x || represents the
length of vector.
common (s, r) is the total
number of the common
unigrams in s and r, and len(r)
and len(s) are the total number
of unigrams in r and s.
common (s, r) is the total
number of the common
unigrams in s and r, and len(r)
is the total number of unigrams
in r.
len(s) is the total number of
unigrams in s.</p>
          <p>Combine the precision and
recall.</p>
          <p>Combine the precision and
recall.
len(chunks)is the number of the
longer matchesin each chunk.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Score Fmean1 Penalty</title>
        <p>The overall METEOR score.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3. Experiments 3.1 Dataset</title>
      <p>The evaluation dataset is the Detecting Paraphrase in India
Language (DPIL) which is mainly obtained from the newspaper.
The details of this corpora can be found in
http://nlp.amrita.edu/dpil_cen/.</p>
      <p>The corpora are divided into two different subsets: Task1-set and
Task2-set, and each sub set contains four different categories
India language: Tamil, Malayalam, Hindi and Punjabi. The
Task1-set contains 12400 samples, including 9200 training
samples and 3200 test samples, and the Task2-set contains 17650
examples, including 12700 training samples and 4950 test
samples. The statistics of training and testing data is shown in
Table 2 and Table 3.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Experimental Settings</title>
      <p>3.2.1 Pre-processing
For each sentence pair in training data and test data, wefirstly
remove numbers, punctuation and blank spaces. Then, we adopt
two types of word segmentation, one is taking each word as a
term unit, and the other is based on the n-gram, which the words
in sentence are segmented in the form of n-gram. For example,
Figure 4 shows an example of 4-gram. In the experiments, the n is
set empirically.</p>
      <sec id="sec-7-1">
        <title>3.2.2 Parameter Tuning</title>
        <p>On the training corpus, the classifier is trained by using sklearn
Boosting Classifier Gradient2. The learning rate (learning rate
shrinks the contribution of each tree by learning rate) is set as 1.0,
the max_depth (the maximum depth limits the number of nodes in
the tree) is set as 1, the random state (random state is the seed
used by the random number generator) is set as 0. All the other
parameters are set as their default values except the parameter
n_estimators (The number of boosting stages to perform).
The other parameters, including the methods of word
segmentation, the method of pre-processing method, the n value
of ngram, are set experimentally.</p>
        <p>We use the cross validation to tune the parameter n_estimators.
The training corpora is randomly divided into two equal parts, and
one is chosen as the training data and the other as the validation
data.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>3.3 Performance Measures</title>
      <p>
        In this evaluation experiment, the experimental results are
evaluated according to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
1) TP: The sample is true, and the results obtained are positive.
2) FP: The sample is false, and the results obtained are positive.
3) FN: The sample is false, and the results obtained are negative.
4) TN: The sample is true, and the results obtained are negative.
According to the above measure metrics, the Precision and Recall
are defined as follows:
precision 
recall 
      </p>
      <p>TP
TP  FP</p>
      <p>TP
TP  FN
(5)
(6)
The main evaluation metrics adopted by DPIL is Accuracy and
F1 measure defined as follows:
0.8044
——
——
——
——
——
——
CUSAT NLP 0.7622
——
——
——
——
——
——</p>
    </sec>
    <sec id="sec-9">
      <title>3.4 Experimental Results</title>
      <p>3.4.1 Experimental results on sub corpora
Table 4 show the experimental results released by FIRE.
(a) Task 1 sub corpus
CUSAT NLP 0.5207
——
——
——
——
——
——
3.4.2 Effect of word segmentation
For the word segmentation, we utilize two processing methods.
One is based on the space to do the word segmentation, and the
other is based on n-gram. We compare the two kinds of word
segmentation methods in Table 5.</p>
      <sec id="sec-9-1">
        <title>3.4.3 Effects of pre-processing</title>
        <p>In our experiment, there are two types of pre-processing methods.
To investigate the different contribution of each pre-processing
method on each language, we analyze the effects of
preprocessing. Taking 4gram word segmentation as example, Table 6
gives the experimental results, where removing all means remove
the punctuation, the number and the space, and reserving * means
reserving * and removing all others. For example, reserving
punctuationrepresents the punctuation is reserved and the number
and space are removed.</p>
        <p>Reserved
punctuation</p>
        <p>Reserved
number
According to the experimental results shown in Table 6, even
thoughwe find that there are few differences when we removing
punctuation, numbers and spaces, we still accept the best
preprocessing method on the test dataset.</p>
      </sec>
      <sec id="sec-9-2">
        <title>3.4.4 Effects of n-gram</title>
        <p>For analyze the effects of n, we carry out the experiments from
1gram to 10-gram, and with Precision, Recall and F1 measure as
evaluation indicators. The experimental results are shown in
Figure 5.
(a) The experimental results on Task 1
(b) The experimental results on Task 2
According to the above experimental results, 4-gram achieves the
best results. So we set n=4 in the testing corpora of DPIL 2016.</p>
      </sec>
      <sec id="sec-9-3">
        <title>3.4.5 Effects of n_estimators</title>
        <p>The parameter n_estimators is the number of iterations of
boosting stage when the classification model trained. It is set
empirically. Figure 6 shows the results on training datasets.</p>
        <p>dpil-mal-train-Task1
0.915
0.91
0.905
0.9
0.895
0.89
0.885
0
20
40
60
80</p>
        <p>100
Accuracy</p>
        <p>F1 Measure
(a) The experimental results of Malayalam on Task1
(b) The experimental results of Tamil on Task1
(c) The experimental results of Hindion Task1
(d) The experimental results of Punjabi on Task1
(e) The experimental results of Malayalam on Task2
(f) The experimental results of Tamil on Task2
(g) The experimental results of Hindion Task2
(h) The experimental results of Punjabi on Task2
According to Figure 6, we get the value of the parameter
n_estimators of each language. Details are shown in Table 7
which is used in the testing datasets of DPIL.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>4. CONCLUSIONS</title>
      <p>We describe an approach to the Detecting Paraphrase problem in
India Language that makes used of the Gradient Tree Boosting.
Overall, the approach was very competitive and achieved the
highest Accuracy and F1 measure among all task participants.
5. ACKNOWLEDGMENTS
This work is supported by Youth National Social Science Fund of
China (No. 14CTQ032), National Natural Science Foundation of
China (No. 61370170), and Research Project of HeilongjiangProv
incial Department of Education (No. 12541677, 12541649).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Alzahrani</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salim</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Abraham</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Understanding plagiarism linguistic patterns, textual features, and detection methods</article-title>
          .
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          , Part C (
          <article-title>Applications</article-title>
          and Reviews),
          <volume>42</volume>
          (
          <issue>2</issue>
          ),
          <fpage>133</fpage>
          -
          <lpage>149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          ,
          <year>2001</year>
          .
          <article-title>Greedy function approximation: a gradient boosting machine</article-title>
          .
          <source>Annals of statistics</source>
          ,
          <volume>1189</volume>
          -
          <fpage>1232</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Friedman</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          ,
          <year>2002</year>
          .
          <article-title>Stochastic gradient boosting</article-title>
          .
          <source>Computational Statistics &amp; Data Analysis</source>
          ,
          <volume>38</volume>
          (
          <issue>4</issue>
          ),
          <fpage>367</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , andLavie,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <year>2005</year>
          , June. METEOR:
          <article-title>An automatic metric for MT evaluation with improved correlation with human judgments</article-title>
          .
          <source>In Proceedings of the acl workshop on</source>
          intrinsic and
          <article-title>extrinsic evaluation measures for machine translation</article-title>
          and/or summarization.
          <volume>29</volume>
          :
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Hang.</given-names>
          </string-name>
          ,
          <year>2012</year>
          .
          <article-title>Statistical learning methods</article-title>
          .Tsinghua university press(in Chinese).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Anand</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Kavirajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            , and
            <surname>Soman</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.P.</surname>
          </string-name>
          <year>2016</year>
          . December. DPIL@
          <article-title>FIRE2016: Overview of shared task on Detecting Paraphrases in Indian Languages, Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata</article-title>
          , India, CEUR Workshop Proceedings, CEUR-WS.org.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>