=Paper= {{Paper |id=Vol-1737/T6-8 |storemode=property |title=Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language using Machine Learning |pdfUrl=https://ceur-ws.org/Vol-1737/T6-8.pdf |volume=Vol-1737 |authors=Anuj Saini |dblpUrl=https://dblp.org/rec/conf/fire/Saini16a }} ==Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language using Machine Learning== https://ceur-ws.org/Vol-1737/T6-8.pdf
 Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method
        in Hindi Language using Machine Learning
                                                            Anuj Saini
                                                      Sapient Global Markets
                                                        Gurgaon, Haryana
                                                      asaini13@sapient.com

ABSTRACT                                                              Subtask 1 is to classify two Hindi sentences into two classes: P
                                                                      (Paraphrased) or NP (Non-Paraphrased). Subtask 2 is to classify
Every language possesses plausible several interpretations. With      them into three classes which are P (Paraphrased), SP (Semi-
the evolution of web, smart devices and social media it has           Paraphrased) or NP (Non-Paraphrased). For detecting
become a challenging task to identify these syntactic or semantic     paraphrases, it is very important to understand the language at an
ambiguities. In Natural Language Processing, two statements           initial stage. Starting from tokenization, stemming, lemmatization,
written using different words having same meaning is termed as        stop words, phonetics and POS tags etc. need to be identified
paraphrasing. At FIRE 2016, we have worked upon the problem           before comparing two texts. And as machine learning algorithms
of detecting paraphrases for the given Shared Task DPIL               works on numeric data, so it’s important to convert our textual
(Detecting Paraphrases in Indian Languages) in Hindi Language         data into corresponding numbers known as text vectors. A vector
specifically. This paper proposed a novel approach to identify if     denotes the numerical representation of text comparison of two
two statements are paraphrased or not using various machine           sentences. We have generated a set of vectors for each data point
learning algorithms like Random Forest, Support Vector Machine,       using first two steps of our proposed approach and trained the
Gradient Boosting and Gaussian Naïve Bayes on the given               model in the third step to get the results.
training data set of two subtasks. In cross validation experiments,
Random Forest leads the other methods in terms of F1-score. The       The paper has been organized as follows. Section 2 gives the
experimental results depicts that our algorithm gives better          description of the training dataset provided by the organizing
performance with the ensemble learning method than individual         committee. Section 3 presents the proposed approach for
approaches for such classification problem. This can be used in       paraphrase detection. The experimental evaluation has been
various applications such as question-answering system,               carried out in Section 4. Section 5 concludes our research work
document clustering, machine translation, text summarization,         followed by acknowledgment and references given in Section 6
plagiarism detection and many more.                                   and Section 7 respectively.

CCS Concepts
• Theory of computation Random Forest                                2. DATA SET DESCRIPTION
• Computing methodologiesNatural language
 Processing                                                           There were a total of 2500 data points in the training set with 2
                                                                      classes P and NP to be used for classification in Subtask1,
Keywords                                                              whereas Subtask 2 had 3 classes and a total of 3500 data points. A
Paraphrase detection; Machine Learning; Natural Language              detail distribution of classes and its count for both the tasks is
Processing; Soundex; Semantic Similarity; Random Forest               mentioned in Table1.

                                                                           Table 1. Classes & its count for SubTask1 and SubTask2
1. INTRODUCTION
                                                                      Class                   SubTask1               SubTask2
With the plethora of information generated by the web these days,     P                       1500                   1500
it is challenging to understand the semantics of different
                                                                      NP                      1000                   1000
languages when each language has its own scripts and linguistic
rules. Hindi language is still in its early stage concerning to       SP                      -                      1000
natural language processing and applications [1]. Currently
significant amount of research work has already been done for
English language but there is a huge scope for Hindi language.        Each data point contains a pID, a unique id for each data point and
Paraphrases are sentences or phrases which conveys the same           two Hindi sentences and their final tagged Class. The initial
meaning using different words [2]. Paraphrase detection is an         analysis of the data surfaced some noise, majorly in NP class. A
important building block in Natural Language Processing (NLP)         few examples of false negatives and false positives have been
pipeline [8]. Previously, many researchers have investigated ways     identified and listed down in Table 2. Here the examples for class
of automatically detecting paraphrases on formal texts [3].Various    NP and class P denotes false negatives and false positives
state-of-the-art paraphrase identification techniques have been       respectively.
summarized in an excellent manner by ACL [4]. The objective of
our work is motivated by the shared task of DPIL [11] organized
by the Forum for Information Retrieval Evaluation (FIRE 2016).             Table 2. False Negatives and False Positives in the dataset
There were two subtasks given for the classification problem.              pID        Sentence1          Sentence2          Class
                रविशंकर ने         मझ
                                    ु े लगता है कक         NP            paraphrases. Classification model includes the model training
                                                                         using the four machine learning algorithms. All the steps are
 HIN2802         कहा मुझे        आईएसआईएस कोई
                                                                         described below.
               लगता है कक        पीस टॉक नह ं चाहता:
              आईएसआईएस                 रविशंकर
                                                                         3.1 Text Preprocessing
              कोई पीस टॉक
                नह ं चाहता                                               For each data point, the preprocessing steps are as follows:
 HIN2852       पाक के फॉरे न     पाककस्तान के विदे श       NP            3.1.1 Text Encoding
              सेक्रेटर एजाज      सचचि एजाज अहमद                          We have encoded the sentences using standard UTF-8 encoding
              अहमद चोधर          चोधर हैदराबाद हाउस                      that handles scripting of Hindi language.
               मंगलिार को         में हाटट ऑफ एशशया                      3.1.2 Tokenization
                 हाटट ऑफ         सम्मेलन में भाग लेंगे                   We have tokenized the sentences into words using NLTK library.
                 एशशया के                                                3.1.3 Phonetics Transformation
              ऑकफशशयल्स                                                  We have applied custom set of rules for phonetics. The
                की बैठक में                                              normalization of phonetics has been done using soundex
                                                                         algorithm [9]. It looks for specific characters and replaces them
               भाग लेने के                                               with their corresponding metaphor characters. For example,
                शलए भारत                                                 न ् न or ज़ ज
                पहुंच रहे हैं
                                                                         3.1.4 Tokens Stemming
 HIN2958       इस ककताब में      इस ककताब में कांग्रेस     NP
                                                                         We have applied some more set of rules for stemming into its
                  कांग्रेस       उपाध्यक्ष राहुल गांधी                   basic form. A set of Hindi suffixes characters were removed to get
              उपाध्यक्ष राहुल    को कररश्माई नेता भी                     normalized Hindi word. For example,
                 गांधी को       बताया गया और इसको                        [ोो",ोे",ो",ोु",ोी",िो",ोा"][कर",ोाओ",िोए",ोाई",ोाए",ने",नी",ना",
              कररश्माई नेता        लेकर लोकसभा में                       ते",ोीों",ती",ता",ोाो",ोाों",ोोों",ोेों"]
              भी बताया गया       बध
                                  ु िार को सत्तापक्ष                     [ोाकर",ोाइए",ोाईं",ोाया",ोेगी",ोेगा",ोोगी",ोोगे",ोाने",ोाना",ोाते",
                     है           और विपक्ष के बीच                       ोाती",ोाता",तीं",ोाओं",ोाएं",ोुओ"ं ,ोुएं",ोुआं"]
                                  काफी नोकझोंक हुई
                                                                         3.1.5 Stop Words Filtering
 HIN3032          बॉल िड
                       ु        बॉल िड
                                     ु एक्ट्रै स बबपाशा    NP            We have removed irrelevant words from the sentences using a
                 अशभनेत्री        बसु और करण शसंह                        standard list of 164 Hindi Stop words. For example,
              बबपाशा बसु ने       ग्रोिर ने शननिार को                    [सारा, से, सो, संग, ह , हुआ, हुई, हुए]
               शननिार को           मुंबई में की शाद                      3.1.6 Synonyms Expansion
                 एक ननजी                                                 We have used Hindi WordNet, an extensive lexical dictionary of
                समारोह में                                               Hindi language having 40K~ synsets, to fetch synonyms for Hindi
                                                                         words. It was developed by researchers at the Center for Indian
                करण शसंह
                                                                         Language Technology, Computer Science and Engineering
               ग्रोिर के साथ                                             Department, IIT Bombay and we have downloaded it from the
                शाद रचाई                                                 mentioned link in [5].

                मुद्राकोष में    ऎएमएफ के दस बडे
                                                                         All the text preprocessing steps have been summarized with
              बढे गा भारत का    सदस्य दे शों में भारत भी                 examples in the below Table 3.
 HIN0230           रुतबा                शाशमल              P                               Table 3. Text Preprocessing Steps
                                                                          PreProcessing               Input                   Preprocessed
We have accepted this noise present in the given training data              Tokenization         कवपल शमाट फोर्बसट          [कवपल, शमाट, फोर्बसट]
without any filtering of such misclassified records and trained               Soundex               हजअर लोट
                                                                                                                                हज़अर लौट
model on all the data points.
                                                                            Stemming             अननयशमतताओं,
                                                                                                                             अननयशमत, ददल्ल
3. THE PROPOSED METHOD                                                                                 ददल्ल

The proposed approach includes three major steps which are text            Stop Words          काफी ननराश था और
preprocessing, feature generation and classification model. Text            Removal                                      काफी ननराश ड्रेशसंग रूम
                                                                                                ड्रेशसंग रूम में लोटते
preprocessing is done in various steps such as tokenization,                                                                 लोटते हुए रोने लगा
stemming, soundex, stop word removal and handling synonyms.                                        हुए रोने लगा
Feature generation involves the creation of five new features               Synonyms              इंडडया आखिर                  भारत फाइनल
which will be used as an input to the classifier for classification of
                                                                      and having them “vote” for a final decision according to a
                                                                      majority role [6].We have focused on tuning its hyper parameters
3.2 Feature Generation                                                to enhance the predictive ability of the model. Key parameters are
                                                                      as follows:
After the preprocessing of sentences following features have been
                                                                      3.3.1 n_estimators
generated in the form of vectors to be passed as an input to the
                                                                      These are the number of trees that we want to build before having
classifier.
                                                                      the vote for the final decision. More the number of trees better the
3.2.1 Common Tokens                                                   performance however it also increases the time complexity.
The number of common tokens amongst two sentences is used as          3.3.2 max_depth
a feature. These tokens have been generated by comparing the          It is the maximum depth of the tree which needs to be tuned.
preprocessed tokens after removing the stop words and then
taking intersection of them symbolized as follows.                    3.3.3 min_samples_leaf
                                                                      These are the minimum number of samples or observations
          Tokens (sentence1) ∩ Tokens (sentence2)                     required in a terminal node of the tree.
3.2.2 Normalized common Tokens                                        3.3.4 min_samples_split
We have normalized the common tokens generated in the first           These are the minimum number of samples or observations
feature to compute the proportion of commonality of tokens            needed in a node to be considered for splitting.[7]
between two sentences. The value will be in the range of 0 and
1.It is 0 when there are no common tokens between two sentences       We have selected the best set of hyper parameters for Random
and 1 if all tokens between the two sentences are exactly the         Forest using Grid Search of Scikit which resulted in the following
same. Mathematically, It has been calculated by dividing the          values n_estimators - 500, max_depth - 10, min_samples_leaf - 4
common tokens by number of unique tokens in both the sentences        and min_samples_split – 4 and trained our training data. Overall
as shown below.                                                       training time for model is less than 1 second on quad-core
          Common tokens/Unique Tokens (sent1, sent2)                  Machine with 8GB of RAM.

3.2.3 Common IDF Score
Sum of IDF scores of common tokens from two sentences is used         4. EXPERIMENTAL RESULTS
as numeric similarity vector.
                                                                      We have used 10 fold cross validation to compute overall
          ΣIDF score (common tokens)                                  accuracy for the system. In this work three evaluation metrics
IDF (Inverse Document Frequency) is defined as inverse of             have been considered which are precision, recall and f1-score. We
document frequency which is used to identify the importance of a      have calculated the values of the evaluation parameters for all of
token in a given corpus. Represented as:                              the four respective algorithms. For the subtask 1 we have got the
                                                                      overall accuracy of 0.92 with F1 score of 0.94 maximum for
                                                                      Random Forest algorithm. Detailed performance matrix of the
                                                                      model is given as below in Table 4 in which we have to predict
                                                                      for 2 classes.
Here a document is an individual sentence. At first IDF has been
calculated for all tokens and kept for reference. Then during                       Table 4. Subtask 1 Scores Summary
feature generation process, IDF of common tokens has been               Class        Precision     Recall      F1-Score      Algorithm
calculated and is used as a feature.
                                                                          P            0.94         0.93          0.94
3.2.4 Normalized Common IDF Score
Here the proportion of IDF score of common tokens between two             NP           0.90         0.91          0.90      Random
                                                                                                                            Forest
sentences is computed for normalization and used as a vector to
model. It gives us normalized common IDF score ranges between         Avg/Total        0.92         0.92          0.92
0 to 1 of common tokens. It can be calculated as follows.
                                                                          NP           0.93         0.91          0.92
          IDF of Common tokens/                                                                                             Gradient
                                                                          P            0.87         0.89          0.88
          Total IDF of Unique tokens (sent1, sent2)                                                                         Boosting
                                                                      Avg/Total         0.9          0.9          0.9
3.2.5 Sentences Length
It denotes the count of number of tokens in sentence 1 and                NP           0.92         0.84          0.88
sentence 2 as separate columns.
                                                                          P            0.79         0.89          0.84          SVM

3.3 Classification Model                                              Avg/Total        0.87         0.86          0.86

                                                                          NP           0.92         0.94          0.93
Features generated in section 3.2 have been used as training data
to train the classifiers using Python Scikit library. Here, we have                                                         Gaussian
                                                                          P            0.91         0.88          0.9
                                                                                                                            Naïve Bayes
used four popular machine learning algorithms which are random
forest, support vector machine and gradient boosting and              Avg/Total        0.92         0.92          0.92
compared their performance. Since, random forest being an
ensemble learning method outperforms the other individual
methods, is implemented by growing many classification trees
Subtask 2 which had same problem with 3 classes to predict from.                Figure 2: Subtask 2 Results Comparison
We have used similar approach and similar features set for
training our model. With 3 classes and larger training set of 3500       0.9
data points we have got overall accuracy of 0.85 with 10 cross          0.85
folds and a F1 score of 0.91 which is again maximum for Random           0.8                                             Precision
forest algorithm. Detailed summary of performance matrix of             0.75
subtask2 is given in Table 5.                                            0.7                                             Recall
                                                                                                                         F1-Score
              Table 5. Subtask 2 Scores Summary
  Class      Precision     Recall      F1-Score      Algorithm
   NP           0.90        0.91          0.91
    P           0.81        0.80          0.80                       5. CONCLUSION
                                                     Random
   SP           0.83        0.82          0.82       Forest
                                                                     In this paper we have proposed our novel approach for the
Avg/Total       0.85        0.85          0.85                       detection of Hindi paraphrases which is a very important building
   NP           0.89        0.90          0.89                       block of semantic text analysis. Building a question answering
                                                                     system, document clustering, knowledge extraction, plagiarism
    P           0.79        0.80          0.79                       detection, building ontologies etc. are the potential applications
                                                     Gradient
                                                                     for paraphrase identification in NLP [10]. After comparing all the
   SP           0.84        0.81          0.83       Boosting
                                                                     four machine learning algorithms used in our model, random
                0.85        0.85          0.85                       forest is giving best results with F1 score of 0.94 for subtask1 and
Avg/Total
                                                                     0.91 for subtask2 which can be further improved by using more
   NP           0.89        0.82          0.86                       robust phonetics and synonyms replacements. One limitation of
                                                                     our research work is that we have not removed outliers from the
    P           0.74        0.67          0.70                       training data which could slightly improve the system
                                                        SVM          performance. In our future work we will include Part of Speech
   SP           0.68        0.82          0.74
                                                                     tagging in feature generation which plays an important role in
Avg/Total       0.79        0.78          0.78                       paraphrase detection, as nouns and verbs are key elements for
                                                                     paraphrasing.
   NP           0.87        0.93          0.9

    P           0.68        0.73          0.71       Gaussian        6. ACKNOWLEDGMENTS
                                                     Naïve
   SP           0.76        0.62          0.68       Bayes           We would like to thank organizers for conducting this shared task
Avg/Total       0.78        0.79          0.78                       and also building the training data. We also would like to thank
                                                                     Sapient Corporation for giving us an opportunity to work and
                                                                     explore the world of text analytics.
Following figures gives the summarized view of the performance
of the various machine learning algorithms for both subtasks
                                                                     7. REFERENCES
            Figure 1: Subtask 1 Results Comparison
                                                                       [1] Sethi, N., Agrawal, P., Madaan, V., and Singh, S.K. July
   0.94                                                                    2016. A Novel Approach to Paraphrase Hindi Sentences
   0.92
    0.9                                                                    using Natural Language Processing. Indian Journal of
   0.88                                           Precision                Science and Technology, Vol 9(28), DOI:
   0.86                                                                    10.17485/ijst/2016/v9i28/98374.
   0.84
   0.82                                           Recall               [2] Kumar, N. 2014. A Graph Based Automatic Plagiarism
                                                                           DetectionTechnique to Handle Artificial Word Reordering
                                                  F1-Score
                                                                           and Paraphrasing. A. Gelbukh (Ed.): CICLing 2014, Part
                                                                           II, LNCS 8404, pp. 481–494, Springer-Verlag Berlin
                                                                           Heidelberg 2014.
                                                                       [3] Xu, W., Callison-Burch, C., and Dolan, W. B. 2015.
                                                                           SemEval-2015 Task 1: Paraphrase and Semantic
                                                                           Similarity in Twitter (PIT), Proceedings of the 9th
                                                                           International Workshop on Semantic Evaluation (SemEval
                                                                           2015), pages 1–11, Denver, Colorado, June 4-5,
                                                                           Association for Computational Linguistics
                                                                       [4] https://www.aclweb.org/aclwiki/index.php?title=Paraphras
                                                                           e_Identification_(State_of_the_art)
[5] http://www.cfilt.iitb.ac.in/wordnet/webhwn/downloaderInf        Learning with Recursive Autoencoders. SemEval-2015
    o.php                                                           (2015): 45.
[6] Zhang, W., Zeng, F., Wu, X., Zhang, X., and Jiang, R.A.     [9] Mahalakshmi, S., Anand Kumar, M., Soman, K.P.
    2009. comparative study of ensemble learning approaches         2015. Paraphrase detection for Tamil language using deep
    in the classification of breast cancer metastasis,              learning algorithm, (2015) International Journal of
    International Joint Conference on Bioinformatics, Systems       Applied Engineering Research, 10 (17), pp. 13929-13934
    Biology and Intelligent Computing
                                                                [10] Socher, R., Huang, E. H., Pennin, J., Manning, C.D., and
[7] Banfield, E.R., Student Member, IEEE,Lawrence O. Hall,           Andrew, Y. Ng. 2011. Dynamic pooling and unfolding
    Fellow, IEEE,Kevin W. Bowyer, Fellow, IEEE, and W.P.             recursive autoencoders for paraphrase detection. Advances
    Kegelmeyer, Member, IEEE, JANUARY 2007. A                        in Neural Information Processing Systems (pp. 801-809).
    Comparison of Decision Tree Ensemble Creation
    Techniques, IEEE TRANSACTIONS ON PATTERN                    [11] Anand Kumar, M., Singh, S., Kavirajan, B., and Soman,
    ANALYSIS AND MACHINE INTELLIGENCE, VOL. 29,                      K. P. 2016. DPIL@FIRE2016: Overview of shared task on
    NO. 1.                                                           Detecting Paraphrases in Indian Languages, Working
                                                                     notes of FIRE 2016 - Forum for Information Retrieval
[8] Sundaram, Shanmuga, M., Anand Kumar M, and Soman,                Evaluation, Kolkata, India, December 7-10, CEUR
    K.P. 2015. AMRITA CEN@ SemEval-2015: Paraphrase                  Workshop Proceedings, CEUR-WS.org.
    Detection for Twitter using Unsupervised Feature