JU_NLP@DPIL-FIRE2016: Paraphrase Detection in Indian
       Languages - A Machine Learning Approach
              Tanik Saikh                     Sudip Kumar Naskar                               Sivaji Bandyopadhyay
                              Computer Science and Engineering Department, Jadavpur University
                                                      Kolkata, India
                    tanik4u@gmail.com , {sudip.naskar, sbandyopadhyay}@cse.jdvu.ac.in


ABSTRACT                                                                  these measures were considered as feature values to build the
This paper presents our system report on our participation in the         models. The models were used to train the machine learning based
shared task on “Detecting Paraphrases in Indian Languages                 classifiers. Naïve Bayes, SVM and SMO were employed for this
(DPIL)” organized in the “Forum for Information Retrieval                 purpose.
Evaluation (FIRE)”- 2016, in both the tasks (Task1 and Task2)             The work reported in [14] made use of same kinds of features in
defined in this shared task in four Indian languages (Tamil,              taking textual entailment decision between a pair of texts on the
Malayalam, Hindi and Punjabi). We made use of different                   datasets released in the shared task on recognizing textual
similarity measures and machine translation evaluation metrics as         entailment in RTE-1, RTE-2 and RTE-3. In the present work we
features and used machine learning framework to take paraphrase           demonstrated that the same features and techniques are also
decision between a pair of text snippets. We obtained the                 effective in taking paraphrase decision between two sentences.
accuracies of 97.08%, 94.2%, 97.32% and 98.29% in Task1 and
86.68%, 77.37%, 84% and 98.77% in Task2 for Tamil,                        2. DATA
Malayalam, Hindi and Punjabi respectively on the respective               The shared task on detecting paraphrases in Indian languages
training sets using a 10-fold cross validation framework.                 (DPIL) defined two subtasks namely Task1 and Task2. Training
                                                                          datasets were provided for each of the subtasks in four Indian
Keywords                                                                  languages − Hindi, Panjabi, Tamil and Malayalam. The statistics
Paraphrase detection; Similarity Measures; Machine Learning.              of the training and test datasets are shown in table1 and table2
                                                                          respectively.
1. INTRODUCTION
A paraphrase is a restatement of a text expressed using other                          Table 1. Statistics of the training sets
words. Alternatively, two texts say text1 and text2 can be defined
as paraphrases if they textually entail each other bi-directionally,                                     # of sentence pairs
i.e. if text1 entails text2 and text2 entails text1. Textual entailment   Language           Task1                         Task2
and paraphrase relations between a pair of text snippets are highly                    Total   P         NP     Total     E     NE        RE
correlated. There are two different tasks related to paraphrases -          Tamil      2500 1000        1500 3500 1000 1500              1000
paraphrase identification and paraphrase generation. In this shared       Malayalam    2500     1000    1500     3500     1000    1500   1000
task [19] the focus is on identifying sentence level paraphrases in
Indian languages namely Tamil, Malayalam, Hindi, and Punjabi.               Hindi      2500     1000    1500     3500     1000    1500   1000
Two subtasks, Task1 and Task2, were defined in the shared task.
Given a pair of sentences from newspaper domain, Task1 is to               Punjabi     1700     700     1000     2200     700     1000   500
classify them as paraphrases (P) or not paraphrases (NP) and
Task2 is to identify whether they are completely equivalent (E),
                                                                                          Table 2.Statistics of the test sets
not equivalent (NE) or roughly equivalent (RE). Task2 is similar
to Task1 except that it is a ternary classification problem.
Identifying paraphrase between a pair of sentence in Indian                                             # of sentence pairs
                                                                                          Language
languages is not an easy task due to the scarcity of tools and                                           Task1       Task2
resources in such languages. Semantic knowledge bases like                                 Tamil          900        1400
WordNet [13] and BabelNet [10] are also very useful resources
for knowledge based approaches to detecting paraphrases and                               Malayalam       900        1400
textual entailment and Indian languages also suffer from very                                 Hindi        900          1400
poor coverage in this respect. Therefore, for this study we adopted                        Punjabi         500          750
lexical level of analysis on the sentences, which is considered to
be a shallow level of processing in any Natural Languages
Processing task.                                                          3. Features
For both the tasks (i.e., Task1 and Task2) we made use of various         Features play a pivotal role in machine learning based
kinds of lexical similarity measures namely Cosine similarity,            frameworks. Therefore, analysis of features which take part in
unigram matching with respect to sentence1, unigram matching              predicting the target class is very crucial. Features which have
with respect to sentence2, Jaccard similarity [12], Dice coefficient      been used in this study can be broadly divided into two categories
[11], overlap, harmonic mean and machine translation (MT)                 − similarity based features and MT evaluation metrics based
evaluation metrics namely BLEU and METEOR. The scores of                  features.
3.1 Similarity Based Features                                             [4], NIST [6], General Text Matcher (GTM) [5] etc. Among these
For the present study, we considered different similarity based           BLEU and METEOR are perhaps the most popular and widely
features such as vector based (cosine similarity, dice similarity),       used ones. In this present work, we made use of these two MT
lexical based (unigram matching with respect to sentence1 and             evaluation metrics as features to predict paraphrase relation
sentence2), set based (Jaccard, overlap and harmonic) which are           between a pair of sentences.
discussed below.                                                          3.2.1 BLEU
3.1.1 Cosine Similarity                                                   BiLingual Evaluation Understudy (BLEU) is an algorithm to
Cosine similarity measures the similarity between two vectors of          evaluate the quality of a machine translated text. It compares n-
an inner product space that measures the cosine of angle between          grams of the translation hypothesis with the n-grams of the
them. The lower the angle between the two vectors the more                reference translation(s) and counts the number of n-gram matches.
similar the two vectors are.                                              It is essentially a modified n-gram precision measure. To avoid
                                                                          bias towards shorter translation candidates, BLEU uses a brevity
3.1.2 Unigram Matching                                                    penalty that penalizes candidate translations whose length differs
Unigram (i.e., word) matches between two sentences are taken              significantly from that of the reference translation.
into consideration. Here two variations of the unigram matching
are considered, i.e. unigram matching with respect to sentence1           3.2.2 METEOR
and sentence2. These are calculated by the number of unigram              Metric for evaluation of translation with explicit ordering
matching between two sentences normalized by the number of                (METEOR) calculates n-gram overlaps between a translation
unigrams in sentence1 and sentence2 respectively.                         hypothesis and the reference translation(s). If multiple reference
                                                                          translations are available for an MT output, the translation is
3.1.3 Jaccard                                                             scored against each reference translation independently and the
Jaccard similarity is a set based measure which can be defined as         best scoring pair is used for evaluation. Given a pair of sentences
                                                                          to be compared, METEOR creates a word alignment between the
                                     𝐴∩𝐵
                          𝐽(𝐴, 𝐵) =                                       two sentences. An alignment is a mapping between the sentences
                                     𝐴∪𝐵                                  such that every word in each sentence maps to at most one word
where A and B are two sets of element. It provides number of              in the other sentences. This alignment is incrementally produced
common elements between two sets.                                         by a sequence of word mapping modules which considers exact
                                                                          matching, stem matching and synonymy matching. Based on the
3.1.4 Dice                                                                number of mapped unigrams found between the two strings (m),
Dice is a vector based similarity measure the value of which lies         the total number of unigrams in the translation (t) and the total
between 0 to 1. It can be calculated by the following equation            number of unigrams in the reference (r), unigram precision is
                                        𝐴∩𝐵                               calculated as 𝑃 = 𝑚⁄𝑡 and unigram recall is calculated as 𝑅 =
                     𝐷𝑖𝑐𝑒(𝐴, 𝐵) =                                          𝑚⁄ . Finally, it computes the F score as a parameterized
                                     (|𝐴| + |𝐵|)                              𝑟
                                                                                                                       𝑃∗𝑅
where A and B are two sets.                                               harmonic mean of P and R as 𝐹𝑚𝑒𝑎𝑛 =                  , where 𝛼 is a
                                                                                                                   𝛼∗𝑃+(1−𝛼)∗𝑅
                                                                          constant. To address word ordering, METEOR calculates a
3.1.5 Overlap                                                             reordering penalty based on how many chunks in the translation
Overlap is a set based text similarity metric where a text can be         hypothesis need to be moved around to get the reference text.
represented as a set and the set elements are words. It is similar to     Finally the METEOR score for the alignment between the two
Dice with a minor difference that it assumes a full match between         sentences is calculated as follows.
two strings if one is subset of another. The similarity of this
measure lies in the range of 0 to 1. It can be measured by the                          𝑠𝑐𝑜𝑟𝑒 = (1 − 𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝑠𝑐𝑜𝑟𝑒) ∗ 𝐹𝑚𝑒𝑎𝑛
following equation.
                                        |𝐴 ∩ 𝐵|
                                                                          4. System Description
                 𝑂𝑣𝑒𝑟𝑙𝑎𝑝(𝐴, 𝐵) =                                          We extract sentence pairs (say sentence1 and sentence2) from the
                                      𝑚𝑖𝑛(|𝐴|, |𝐵|)                       XML format training dataset for each language. An example
where A and B are two sets.                                               training sentence pair is shown below from the Hindi dataset.

3.1.6 Harmonic                                                            <Paraphrase
Harmonic is also a set based similarity measure. It can be                pID="HIN0001"><Sentence1>भारतीयमुस्लिमोंकीवजहसेनह ींपनप
calculated by employing the following equation.                           सकताआईएस|</Sentence1><Sentence2>भारतमेंकभीवर्चलवकायम
                              |𝐴 ∩ 𝐵|. (|𝐴| + |𝐵|)
             𝐻𝑎𝑟𝑚𝑜𝑛𝑖𝑐(𝐴, 𝐵) =                                             नह ींकरसकताआईएस|</Sentence2><Class>P</Class></Paraphr
                                   2. |𝐴|. |𝐵|
                                                                          ase>.
where A and B are two sets.
                                                                          In the above example the relation between sentence1 and
3.2 Machine Translation Evaluation Metrics                                sentence2 is paraphrase which is tagged as “P” in the training
Machine translation (MT) evaluation metrics are generally used to         dataset. We calculate different similarity scores between the
measure the closeness between the MT translation hypotheses and           sentences pairs extracted from the training datasets released in the
the reference translations. The closer a translation hypothesis is to     shared task. These scores are used as feature values to build the
the reference translation, the better the translation quality is. There   models for the four languages. Four models were prepared by
are several MT evaluation metrics available like Word Error Rate          combining different features for every language. Model1
(WER) [7], Position Independent word error rate (PER) [8],                considers only lexical features (LF), i.e., cosine similarity,
BLEU [1], METEOR [2] [3], Translation error/edit rate (TER)               unigram matching with respect to sentence1, unigram matching
with respect to sentence2, jaccard, dice, overlap and harmonic.                 Table 4. Accuracies on training sets in Task2.
Model2 considers LF and BLEU. LF and METEOR were
                                                                                      Classifiers                       Models
considered for Model3. Model4 considers all the features, i.e., LF,
BLEU and METEOR. These models were used to train three                                                LF        LF+B      LF+M      LF+B+M
machine learning classifiers namely naïve bayes [17], support                         Naïve Byes      73.14     75.37     73.45     74.97
vector machine (SVM) [16] and sequential minimal optimization
                                                                        Hindi         SVM             82.6      82.82     82.71     82.85
algorithm (SMO) [9]. Weka1 [15] tool was used for this purpose
which is freely available in web; the java implementations of                         SMO             82.88     83.17     83.37     84
various machine learning algorithms are available in this tool                        Naïve Byes      71.6      78.82     71.25     78.54
including the classifiers used for our experiments. We carried out
                                                                        Tamil         SVM             76.57     77.77     76.54     77.77
experiments based on 10-fold cross validation on the training set.
So our system can predict paraphrase class of an unknown                              SMO             76.71     86.4      77.11     86.68
sentence pair based on these learning. Finally, the                                   Naïve Byes      65.02     72.65     65.54     72.22
classifier−model combinations producing the optimized results
                                                                        Malayalam     SVM             67.31     68.51     67.25     68.48
were applied on the corresponding test sets.
                                                                                      SMO             68        77.25     68        77.37
5. Results and Discussion                                                             Naïve Byes      93.81     97.5      94.45     97.31
We took part in both the tasks (Task1 and Task2) defined in the
                                                                        Punjabi       SVM             96.22     98.04     97.09     98.04
shared task in all the four languages (Hindi, Punjabi, Tamil and
Malayalam). Accuracies obtained on the training sets (using 10-                       SMO             97.18     98.77     97.72     98.77
fold cross validation) on Task1 and Task2 using the different
classifiers are shown in Table 3 and Table 4 respectively.
                                                                      SVMs are comparatively new machine learning approaches for
          Table 3. Accuracies on training sets in Task1.              solving two class pattern recognition problems. In the field of
                Classifiers                   Models                  NLP, SVMs have been employed for many tasks including text
                                                                      classification and are reported to have obtained high accuracy
                              LF      LF+B      LF+M    LF+B+M
                                                                      without falling into overfitting even with a large number of
                Naïve Byes    90.04   92.52     90.16   74.97         features [18]. Since SVMs perform better in binary classification
  Hindi         SVM           90.36   92.52     90.36   82.85         problems, our results in Task1 (binary classification problem) also
                                                                      outperforms the results obtained in Task2 (ternary classification
                SMO           90.84   97.32     90.84   84
                                                                      problem). SMO is essentially another way of expressing SVMs
                Naïve Byes    92.44   94.16     92.2    93.92         which implements John Platt's sequential minimal optimization
  Tamil         SVM           93.24   94.44     93.24   94.24         algorithm for training a support vector classifier. It makes use of
                                                                      heuristics to partition the training problem into smaller problems.
                SMO           76.71   97.04     93.36   97.08
                Naïve Byes    83.48   88.36     83.56   87.88         The official results of our submissions released by the task
                                                                      organizers on the test sets for the two subtasks in four languages
  Malayalam     SVM           84.24   86.88     83.96   86.56
                                                                      are reported in Table 5.
                SMO           84.68   94.2      84.72   94.2
                                                                             Table 5. Results on test sets in Task1 and Task2.
                Naïve Byes    97.64   97.64     97.70   97.58
  Punjabi                                                               Languages     Task     Count         Accuracy     F1 Measure/
                SVM           97.76   97.82     97.70   97.7
                                                                                                                          Macro F1 Measure
                SMO           98.17   98.23     98.23   98.29
                                                                        Hindi         Task1    900           0.8222       0.74
                                                                        Hindi         Task2    1400          0.68571      0.6841
In Task1 the best accuracy 97.32% was obtained for Hindi in
                                                                        Malayalam     Task1    900           0.59         0.16
SMO with LF+B model. Tamil achieved the highest accuracy of
97.08% with the LF+B+M model in SMO. Malayalam resulted in              Malayalam     Task2    1400          0.42214      0.3078
the highest accuracy of 94.2% with both LF+B and LF+B+F                 Punjabi       Task1    500           0.942        0.94
models in SMO, whereas in Punjabi we got the highest accuracy
                                                                        Punjabi       Task2    750           0.88666      0.88664
of 98.29% with LF+B+M model by SMO. Thus SMO provided
the optimum results in all four languages on the training datasets.     Tamil         Task1    900           0.57555      0.09
                                                                        Tamil         Task2    1400          0.55071      0.4319
In Task2, the SMO classifier and LF+B+M model combination
produced the best accuracies for all the four languages − Hindi
(84%), Tamil (86.68%), Malayalam (77.37%) and Punjabi                 6. Conclusions
(98.77%). For Punjabi the LF+B model (on SMO) also achieved           The paper presents our submissions in the DPIL shared task
the highest accuracy along with the LF+B+M model.                     organized in FIRE 2016. We took part in both the subtasks in all
                                                                      four languages. Different lexical level similarity measures and
The features which are used in these experiments are independent      two machine translation evaluation metrics namely BLEU and
of each other. Naïve Bayes makes the simplification assumption        METEOR were employed as features to find similarity scores
that features are independent to given class [17].                    between pair of sentences. Four different models were built
                                                                      combining these features. The models were used to train three
                                                                      classifiers namely naïve bayes, SVM and SMO. We carried out
1http://www.cs.waikato.ac.nz/ml/weka/                                 experiments using 10-fold cross validation framework on the
training sets. The SMO classifier produced the optimum results         Translation. In Proceedings of the 5th European Conference
when all the features were combined.                                   on Speech Communication and Technology, Rhodes, Greece,
                                                                       pp. 2667–2670.
7. Acknowledgements                                                [9] John C. Platt. 1999. Fast Training of Support Vector
The research work has received funding from the project
                                                                       Machines Using Sequential Minimal Optimization. In
“Development of Tree Bank in Indian Languages” funded by The
                                                                       Bernhard Schölkopf, Christopher J. C. Burges and Alexander
Department of Electronics and Information Technology (DeitY),
                                                                       J. Smola (eds.). 1999. Advances in Kernel Methods: Support
Ministry of Communication and Information Technology,
                                                                       Vector Learning. The MIT Press, Cambridge, MA, pp. 185-
Government of India.
                                                                       208.
8. REFERENCES                                                      [10] Roberto Navigli and Simone Paolo Ponzetto. 2012.
[1] Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. 2002.             BabelNet: The automatic construction, evaluation and
    BLEU: a Method for Automatic Evaluation of Machine                  application of a wide-coverage multilingual semantic
    Translation. In Proceedings of 40th Annual Meeting of the           network. Artificial Intelligence, pp. 193:217– 250.
    Association for Computational Linguistics (ACL),               [11] Dice, L. 1945. Measures of the amount of ecologic
    Philadelphia, PA, pp. 311–318.                                      association between species. Ecology, vol 26, no 3, pp. 297-
[2] Banerjee, S. and Lavie, A. 2005. METEOR: An Automatic               302.
    Metric for MT Evaluation with Improved Correlation with        [12] Jaccard, P. 1901. Étude comparative de la distribution
    Human Judgments. In Proceedings of the ACL Workshop on              floraledansune portion des Alpeset des Jura. Bulletin de la
    Intrinsic and Extrinsic Evaluation Measures for Machine             SociétéVaudoise des Sciences Naturelles 37, pp. 547-579.
    Translation and/or Summarization, Ann Arbor, Michigan, pp.
    65–72.                                                         [13] Fellbaum, Christine, ed.1998.WordNet: An Electronic
                                                                        Lexical Database, Cambridge, MA: MIT Press.
[3] Lavie, A. and Agarwal, A. 2007. METEOR: An Automatic
    Metric for MT Evaluation with High Levels of Correlation       [14] T Saikh, SK Naskar, C Giri, and S Bandyopadhyay. 2015.
    with Human Judgments. In Proceedings of the Second ACL              Textual Entailment Using Different Similarity Metrics in
    Workshop on Statistical Machine Translation, Prague, Czech          Computational Linguistics and Intelligent Text Processing,
    Republic, pp. 228–231.                                              pp. 491-501.
[4] Snover, M., Dorr, B., Schwartz, R., Micciulla, and L.,         [15] I.H. Witten and E. Frank. 2005. Data Mining: Practical
    Makhoul, J. 2006: A study of translation edit rate with             machine learning tools and techniques. Morgan Kaufmann,
    targeted human annotation. In Proceedings of Association for        San Francisco, 2nd edition.
    Machine Translation in the Americas, Cambridge,                [16] Vapnik, Vladimir N. 1995. The Nature of Statistical
    Massachusetts, USA, pp. 223–231.                                    Learning Theory. Springer.
[5] Turian, J.P., Shen, L., and Dan Melamed, I. 2003. Evaluation   [17] I. Rish. 2001. An empirical study of the naive Bayes
    of Machine Translation and Its Evaluation. In Proceedings of        classifier. In IJCAI 2001 workshop on empirical methods in
    MT Summit, New Orleans, Luisiana, pp. 386–393.                      artificial intelligence, vol 22, pp. 41-46.
[6] Doddington, G. 2002. Automatic evaluation of machine           [18] AEkbal and S Bandyopadhyay. 2008. Bengali Named Entity
    translation quality using n-gram cooccurrence statistics. In        Recognition Using Support Vector Machine. In International
    Proceedings of the Second International Conference on               Joint Conference on Natural Language Processing (IJCNLP),
    Human Language Technology Research, Morgan Kaufmann                 pp. 51-58
    Publishers Inc, pp. 138–145.
                                                                   [19] Anand Kumar, M., Singh, S., Kavirajan, B., and Soman, K
[7] Vidal, E.1997. Finite-State Speech-to-Speech Translation. In        .P.2016.DPIL@FIRE2016: Overview of shared task on
    Proceedings of the International Conference on Acoustics,           Detecting Paraphrases in Indian Languages. Working notes
    Speech and Signal Processing, Munich, Germany, pp. 111–             of FIRE 2016 - Forum for Information Retrieval Evaluation,
    114.                                                                Kolkata,India,December 7-10,CEUR Workshop Proceedings,
[8] Tillmann, C., Vogel, S., Ney, H., Sawaf, H., and Zubiaga, A.        CEUR-WS.org
    1997. Accelerated DP based Search for Statistical