KEC@DPIL-FIRE2016: Detection of Paraphrases on Indian
                    Languages
    R.Thangarajan                          S.V.Kogilavani                        A.Karthic                           S.Jawahar
       Professor                       Assistant Professor (SrG)                UG Student                           UG Student
   Department of CSE                     Department of CSE                  Department of CSE                    Department of CSE
Kongu Engineering College             Kongu Engineering College          Kongu Engineering College            Kongu Engineering College
    Erode, Tamilnadu                      Erode, Tamilnadu                   Erode, Tamilnadu                     Erode, Tamilnadu
    +919443014942                          +919486153223                     +919443840800                        +918973744171
 rt.cse@kongu.edu                     kogilavani.sv@gmail                karthic1011@gmail.                   jawahar273@gmail.co
                                              .com                               com                                   m

ABSTRACT                                                                 needs the paraphrase identification technique to detect the
This paper presents a report on Detecting Paraphrases in Indian          sentences which are paraphrases of others. Identifying paraphrases
Languages (DPIL), in particular the Tamil language, by the team          is an important task that is used in information retrieval, question
NLP@KEC of Kongu Engineering College. Automatic paraphrase               answering, text summarization and plagiarism detection. This
detection is an intellectual task which has immense applications         work mainly focuses on the detection of paraphrases in Tamil
like plagiarism detection, new event detection, etc. Paraphrase is       language. For example, the sentences in Table 1 express the same
defined as the expression of a given fact in more than one way by        meaning therefore, they are paraphrases.
means of different phrases. Paraphrase identification is a classic                Table 1 Sample Sentence in Tamil Language
natural language processing task which is of classification type.
Though there are several algorithms for paraphrase identification,       கேரளமாநில்திு்ூி்ூட்மாணி்ே்
reflecting the semantic relations between the constituent parts of a
sentence plays a very important role. In this paper we utilize           கோயி்திுவிழாுவ்ேியு.
sixteen different features to best represent the similarity between
sentences. The proposed approach utilizes machine learning               ூட்மாணி்ே்கோயி்திுவிழாகோலாே
algorithms like Support Vector Machine and Maximum Entropy               லமாேுவ்ேியு.
for classification of given sentence pair. They have been classified
into Paraphrase and Not-a-Paraphrase for task1 and Paraphrase,
Not-a-Paraphrase and Semi-Paraphrase for task2. The accuracy             Our proposed system utilizes two supervised machine learning
and performance of these methods are measured on the basis of            approaches using a Support Vector Machine (SVM), Maximum
evaluation parameters like accuracy, precision, recall, f-measure        Entropy (ME) and learns classifiers based on sixteen features like
and macro f-measure. Our methodology got 2nd place in DPIL               lexical and POS tagging features in order to detect the paraphrase
evaluation track.                                                        sentences. The structure of the report is defined as follows:
                                                                         Section-2 describes literature review. Section-3 gives the task
Keywords                                                                 description Section-4 represents the overview of the proposed
Natural Language Processing; Paraphrase Identification; Machine          system. Section-5 presents the performance evaluation results and
Learning Approach; Support Vector Machine; Maximum Entropy;              Section-6 concludes the work.
Shallow Parser.
                                                                         2. LITERATURE REVIEW
1. INTRODUCTION                                                          In this section, recent research work carried in the field of
Variability of semantic expression is a fundamental phenomenon           paraphrase identification in general, and paraphrase identification
of natural language where in the same meaning can be expressed           in Tamil language in particular is discussed. Many researchers on
by, or inferred from, different texts. Paraphrases are alternative       paraphrase identification make use of existing Natural Language
ways to express or convey the same information. One can express          Processing (NLP) tools to identify paraphrases. [2] exploits the
a single event in many different ways in natural language                NLP tools of a QA system to identify paraphrases. [3-6] have
sentences which depends on the command of the language the               employed lexical semantic similarity information based on
writer or speaker has on the language in consideration. A properly       resources such as WordNet [7]. Tamil paraphrase detection is
written paraphrase expresses the ideas of a fact or event in words       tried with deep learning method [18].
and sentence structure inherent to the writer or speaker. It is
similar to summarization but the key difference is that paraphrases      The ability to identify paraphrase, in which a sentence express the
include both key points and sub-points. Because a paraphrase             same meaning of another one but with different words, has proven
includes detailed information it can sometimes be as long as the         useful for a wide variety of Many different natural language
original source. A paraphrase typically explains or clarifies the        processing applications are there for detecting paraphrases. The
text that is being paraphrased. This greatly adds to the difficulty of   different approaches can be categorized into supervised methods,
detecting paraphrases                                                    i.e. [8-9], which are the most promising methods.

Paraphrase detection is the task of determining whether two or            Most of the existing system utilizes thresholds to determine
more sentences represent the same meaning or not [1]. Paraphrase         whether two sentences are similar and represents the same
detection systems progress the performance of a paraphrase               meaning. But the specification of exact threshold will depends up
generation by choosing the best sentence from the list of                on the training data. Machine Learning (ML)techniques have been
paraphrase sentences. Plagiarism detection is another task which         applied in order to overcome the problems in setting the threshold.
The benefit of applying the ML approach resides based on the          ME: The training data are used to set constraint on conditional
morphologic, syntactic, semantic features of a sentence.              distribution [14]. Each constraint is used to express characteristics
                                                                      of training the data. These constraints then are used for testing the
In general, supervised and unsupervised machine learning              data. The results obtained from this analysis are compared using
techniques are quite useful in paraphrase detection. In supervised    different performance evaluation measures.
learning technique, the dataset is labeled and trained to obtain a
reasonable output which help in proper decision making. Unlike
supervised learning, unsupervised learning process does not need              Subtask1 Dataset                     Subtask2 Dataset
any label data; therefore they cannot be processed easily. This
report work presents the impact of two supervised learning
methods on given dataset.

3. TASK DESCRIPTION                                                                           Shallow Parsing
One of the most commonly used corpora for paraphrase detection
is the Micro Soft Research in Paraphrase (MSRP) corpus [10],
which contains 5,801 English sentence pairs from news articles
manually labeled with 67% paraphrases and 33% non-
paraphrases. Since there are no annotated corpora or automated
                                                                                         Features File Construction
semantic interpretation systems available for Indian languages till
date, creating benchmark data for paraphrases and utilizing that
data in open shared task competitions will motivate the research
community for further research in Indian languages [11]. We
participated in DPIL task which is focused on sentence level
paraphrase identification for Indian languages (Tamil, Malayalam,
Hindi and Punjabi). In this context, the task is divided into two           Classification using                   Classification using
subtasks.                                                                          SVM                             Maximum Entropy

3.1 Sub Task 1:
Given a pair of sentences, the system is required to assess if the            Result for Sub                           Result for Sub
two sentences carry the same meaning or not and to classify them             Task1, Sub Task2                         Task1, Sub Task2
into Paraphrase (P), or Not Paraphrase (NP) otherwise.

3.2 Sub Task 2:                                                                   Figure 1: Overview of Proposed System
Given two sentences from newspaper domain, the task is to
identify whether they are completely equivalent (P) or roughly        4.1 SHALLOW PARSING
equivalent (SP) or Not Equivalent (NE). This task is similar to the
                                                                      Shallow parsing also called as chunking or light parsing is an
subtask 1, but the main difference is 3-point scale tag in
                                                                      analysis of a sentence which first identifies basic parts of
paraphrases.
                                                                      sentences such as nouns, verbs, adjectives, etc., and then links
The training data set given to us is the News Corpus, which           them to higher order units that have discrete grammatical
contains 2,500 Tamil sentence pairs for subtask1 and 3,500 Tamil      meanings. While the most elementary chunking algorithms simply
sentence pairs for subtask2. The test set given to us consisted of    link constituent parts on the basis of elementary search patterns,
900 Tamil sentence pairs for subtask1 and 1,400 Tamil sentence        approaches that use ML techniques can take contextual
pairs for subtask2. Both the training and test corpus are available   information into account and thus compose chunks in such a way
at the link specified in [12].                                        that they better reflect the semantic relations between the basic
                                                                      constituents. That is, these more advanced methods get around the
4. OVERVIEW OF PROPOSED SYSTEM                                        problem that combinations of elementary constituents can have
Subtask1 and subtask2 datasets are processed using Tamil shallow      different higher level meanings depending on the context of the
parser. The processed results are in text format; but for             sentence. The proposed system utilizes Tamil Shallow parser
classification of sentences using the machine learning algorithms,    developed by IIIT [15]. For example, the following Tamil
the text values are converted into numerical matrix, which is then    sentence
given as input into SVM and ME for further classification.
                                                                      செய்கே்கோ்ேகள                 ஏவி     இ்ு       இ்கரா ுதிய
SVM: Given data are analyzed and decision boundaries are
                                                                      ொதகை பகட்ு்ளு.
defined by having hyper planes. In two category case, the hyper
plane separates the document vector of one class from other           would be chunked as in Table 2.
classes, where the separation is maintained to be large as possible
[13].
                           Table 2 Chunking Result of Sample Sentence

<Sentence id="1">

                                 <fsaf='ஏு,v,any,any,any,,இ,i'      vbp="Y"        head="ஏவி"       poslcat="NM"
1         ((         NP
                                 paradigm="v5">

1.1       செய்கே்    NN          <fsaf='செய்கே,n,any,sg,any,d,0,' case_name="nom" paradigm="n2">

1.2       கோ்ேகள    NN          <fsaf='கோ்,n,any,pl,any,d,ஐ,kalY_E' case_name="acc" paradigm="n17">

                                 <fsaf='ஏு,v,any,any,any,,இ,i'      vbp="Y"        name="ஏவி"       poslcat="NM"
1.3       ஏவி        NN
                                 paradigm="v5">

          ))

                                 <fsaf='இ்ு,n,any,sg,any,d,0,'          head="இ்ு"              case_name="nom"
2         ((         NP
                                 paradigm="n6">

                                 <fsaf='இ்ு,n,any,sg,any,d,0,'          name="இ்ு"              case_name="nom"
2.1       இ்ு        NN
                                 paradigm="n6">

          ))

                                 <fsaf='இ்கரா,n,any,sg,any,d,0,'        head="இ்கரா"            case_name="nom"
3         ((         NP
                                 paradigm="n1">

                                 <fsaf='இ்கரா,n,any,sg,any,d,0,'        name="இ்கரா"            case_name="nom"
3.1       இ்கரா      NN
                                 paradigm="n1">

          ))

                                 <fsaf='ொதகை,n,any,sg,any,d,0,'        head="ொதகை"            case_name="nom"
4         ((         NP
                                 paradigm="n2">

4.1       ுதிய       JJ          <fsaf='ுதிய,adj,any,any,any,,,' paradigm="adj">

                                 <fsaf='ொதகை,n,any,sg,any,d,0,'        name="ொதகை"            case_name="nom"
4.2       ொதகை      NN
                                 paradigm="n2">

          ))

                                 <fsaf='பகட,v,n,sg,3,,்ு_உ்_்,wwu_ulY_lY_awu'                           vbp="Y"
5         ((         VGF
                                 head="பகட்ு்ளு" tense="PRESENT" paradigm="v11" finite="Y">

                                 <fsaf='பகட,v,n,sg,3,,்ு_உ்_்,wwu_ulY_lY_awu'                           vbp="Y"
5.1       பகட்ு்ளு   VM
                                 name="பகட்ு்ளு" tense="PRESENT" paradigm="v11" finite="Y">

          ))

5.2       .          SYM         <fsaf='˙ ,punc,,,,,,'>
4.2 FEATURES FILE CONSTRUCTION
From the output of shallow parsing process, feature file is constructed both for training and test datasets for subtask1 and subtask2. Sample
feature file is presented in Table 3


                                                     Table 3 Sample Sentence- Feature File


Para
                                    Present


                                                             future


                                                                                                                                    count
phrase


                                                                                                    NNP


                                                                                                                                            class
                    nom


                                              past


                                                                                                                            VM
            ben


                                                                             gen


                                                                                                            NN


                                                                                                                    soc
                            RB


                                                      acc


                                                                      loc


                                                                                     dat
Id


                                                                                             JJ
 TAM
             0       10      2         0       2       0       2       3      0       0       2      2       11      0       6        6       P
  0001
 TAM
             0       8       1         0       2       0       0       1      0       0       1      1       9       0       2        4       P
  0002
 TAM
             0       12      1         2       3       0       0       0      0       5       0      0       19      0       5        6       P
  0003
 TAM
             0       15      2         0       1       0       3       2      0       3       2      0       20      0       3        4       P
  0004
 TAM
             0       11      0         1       0       0       0       0      0       0       1      3       16      0       1        7       P
  0005
 TAM
             0       13      3         2       1       0       0       0      0       1       0      0       17      0       7        3       P
  0006
 TAM
             0       6       0         0       1       0       1       3      1       0       0      0       14      0       2        2       P
  0007
 TAM
             0       5       1         1       1       0       0       3      0       2       1      0       12      0       2        7       P
  0008
 TAM
             0       8       0         1       0       0       1       0      0       3       1      1       12      0       2        4       P
  0009
 TAM
             0       7       0         0       2       1       0       4      0       0       0      1       15      0       3        7       P
  0010


4.3 TEXT CLASSIFICATION USING                                               multi class sentence classification, each sentence pair is classified
                                                                            as a label C, where C = {P, NP, SP}. In this, P specifies that the
MACHINE LEARNING ALGORITHMS                                                 sentence pair is paraphrase, NP denotes that the given sentence
When supervised machine learning algorithms are considered for              pair is not a paraphrase whereas SP specifies that the given
classification purpose, the input dataset is desired to be a labeled        sentence pair is a semi paraphrase.
one. In this study, two different supervised learning techniques are
applied for classification purpose such as SVM and ME.
                                                                            4.3.1 SVM Classification Method
                                                                            SVM is based on the structural risk minimization principle from
Classification of sentences may be categorized into two types, i.e.,
                                                                            computational learning theory. This method analyzes data and
binary sentence classification and multi-class sentence
                                                                            defines decision boundaries by having hyper-planes. In binary
classification [16]. For the given dataset, in binary classification
                                                                            classification problem, the hyper-plane separates the given vector
type, each sentence pair is classified as a label C, where C = {P,
                                                                            in one class from other class, where the separation between hyper-
NP}. In this, P denotes the given sentence pair is a paraphrase and
                                                                            planes is desired to be kept as large as possible. One property of
NP denotes that the given sentence pair is not a paraphrase. In
                                                                            SVM is that their ability to learn can be independent of the
dimensionality of the feature space. SVM measure the complexity         classified as not paraphrase by the classifier, whereas False
of hypotheses based on the margin with which they separate the          Negative are not paraphrase sentences but classifier classify it as
data, not the number of features [17]. Since SVM requires input in      paraphrase. Table 5 and Table 6 represents confusion matrix for
the form of a vector of numbers, the constructed feature file is        sub task 1 sentences and sub task 2 sentences.
given as input to SVM.
                                                                        Table 5 Confusion Matrix for Subtask1 Sentences

4.3.2 ME Classification Method                                                          Actual Status (SVM)           Actual Status (ME)

ME is a general technique for estimating probability distributions      Predicted           NP                P         NP              P
from data. The over-riding principle in maximum entropy is that           status
when nothing is known, the distribution should be as uniform as            NP              409                129       420            120
possible, that is, have maximal entropy. Labeled training data is
used to derive a set of constraints for the model that characterize         P              117                245       106            254
the class-specific expectations for the distribution. Constraints are
represented as expected values of “features” any real-valued                    Table 6. Confusion Matrix for Subtask2 Sentences
function of an example. The improved iterative scaling algorithm                       Actual Status (SVM)            Actual Status (ME)
finds the maximum entropy distribution that is consistent with the
given constraints.                                                      Predicted       NP           P         SP     NP        P        SP
                                                                          status
Due to the minimum assumptions that the ME classifier makes,
we regularly use it when we don’t know anything about the prior            NP          499           47       125    514       43       116
distributions and when it is unsafe to make any such assumptions.
                                                                            P           51           113      150     44       131      164
Moreover ME classifier is used when we can’t assume the
conditional independence of the features. This is particularly true        SP           95           113      207     87       99       202
in text classification problems where our features are usually
words which obviously are not independent. The ME method
requires more time to train comparing to Naive Bayes, primarily         In SVM method, out of 900 sentence pairs, 409 NP sentences are
due to the optimization problem that needs to be solved in order to     correctly classified as NP sentences and 129 NP sentences are
estimate the parameters of the model. Nevertheless, after               misclassified as P sentences. Similarly, 245 sentences are
computing these parameters, the method provides robust results          correctly classified as P sentences and 117 P sentences are
and it is competitive in terms of CPU and memory consumption.           misclassified as NP sentences. We believe that this
In our text classification scenario, maximum entropy estimates the      misclassification is mainly due to higher lexical similarity in false
conditional distribution of the class label given pair of sentences.    paraphrase pairs, which makes them hard to be differentiated from
Entire document is represented by a feature file. The labeled           true paraphrase pairs. In ME method, out of 900 sentence pairs,
training data is used to estimate the expected value on a class-by-     420 NP sentences are correctly classified as NP sentences and 120
class basis.                                                            NP sentences are misclassified as P sentences. Similarly, 254
                                                                        sentences are correctly classified as P sentences and 106 P
5. PERFORMANCE EVALUATION                                               sentences are misclassified as NP sentences. The ME system
PARAMETERS AND RESULTS                                                  performs fairly well compared to SVM method at identifying true
                                                                        paraphrase pairs, as given in Table 5for sub task 1 and Table 6 for
The parameters which are helpful to evaluate performance of             sub task2.
supervised machine learning algorithm is based on the element
from a matrix known as confusion matrix or contingency table. It        Based on the values obtained from confusion matrix, other
is used in supervised machine learning algorithm to help in             parameters such as “precision”, “recall”, “f-measure”, and
assessing performance of any algorithm. From classification point       “accuracy” are found out for evaluating performance of any
of view, terms such as “True Positive”(TP), “False Positive” (FP),      classifier.
“True Negative” (TN), “False Negative” (FP) are used to compare
                                                                        •Precision:
label of classes in this matrix as shown in Table 4.
                                                                                  It measures the exactness of the classifier result. It is the
               Table 4 Confusion Matrix Format
                                                                        ratio of number of examples correctly labeled as paraphrase to
  Predicted Status                     Actual Status                    total number of paraphrase sentences in sub task1 dataset. It can
                                                                        be calculated using the equation 1.
                         Not a Paraphrase           Paraphrase
                                                                                                         TP
                                                                                    � ��� �      =                                     (1)
 Not a Paraphrase       TN(True Negative)       FN(False Negative)                                    TP+FP

    Paraphrase          FP(False Positive)       TP(True Positive)      •Recall:
                                                                                   It measures the completeness of the classifier result. It is
                                                                        the ratio of total number of paraphrase sentences to total sentences
In Table 4, True Positive represents the number of sentences those      which are truly paraphrase. It can be calculated using the equation
are paraphrase and also classified as paraphrase by the classifier,     2.
where as False Positive indicates paraphrase sentences, but
                                                                                      TP
classifier does not classify it as paraphrase. Similarly, True          ������ =                                                       (2)
                                                                                    TP+FN
Negative represents the sentences which are not paraphrase also
•F-Measure:                                                                       F1 macro measure can be used when you want to know
                                                                         how the system performs overall across the sets of data.
          It is the harmonic mean of precision and recall. It is
required to optimize the system towards either precision or recall,      The comparative analysis based on results obtained using
which have more influence on final result. It can be calculated          proposed approaches are shown in Table 9. It can be analyzed that
using the equation 3.                                                    the accuracy obtained using ME method is better than that of
                                                                         SVM because of dependent feature and also high dimensionality
                                                                         and sparseness of text data.
                       P eci i   ∗Recall
� − ��� � � = 2 ∗                                             (3)             Table 9 Accuracy of Machine Learning Approaches
                       P eci i   +Recall

                                                                         Method/Task                  Accuracy           F1 Measure / F1
                                                                                                                         Macro Measure
The tables 7 and 8 represent Precision, Recall and F-Measure
summary for sub task 1 and sub task2 in identifying paraphrase            SVM – Sub Task1                0.73                   0.72
and not a paraphrase sentences. In Table 7 SVM-NP stands for
                                                                          SVM – Sub Task2                0.59                   0.53
SVM used for identifying Not Paraphrase (NP) sentences and
SVM-P stands for SVM used for identifying Paraphrase (P)                   ME – Sub Task1                0.75                   0.74
sentences. Similarly ME-NP stands for ME used for identifying
Not Paraphrase (NP) sentences and ME-P stands for SVM used                 ME – Sub Task2                0.61                   0.56
for identifying Paraphrase (P) sentences. The results show that
precision, recall and F-Measure values are high for Non
Paraphrase identification by both SVM and ME systems.                    6. Conclusion
Table 7 Precision, Recall, F-Measure summary for SubTask1                This work makes an attempt to classify given sentence pairs into
                                                                         paraphrases or not using two supervised machine learning
   Method           Precision              Recall       F-Measure        algorithms, such as SVM and ME. In this paper we utilize sixteen
                                                                         different semantic features to best represent the similarity between
 SVM – NP              0.76                 0.78           0.77
                                                                         sentences. Two machine learning algorithms such as Support
  SVM – P              0.68                 0.66           0.67          Vector Machine and Maximum Entropy have been considered for
                                                                         classification of given sentence pair into Paraphrase (P)and Not-a-
  ME - NP              0.78                 0.80           0.79          Paraphrase(NP) for task1 and Paraphrase(P), Not-a-
   ME – P              0.71                 0.68           0.69          Paraphrase(NP) and Semi-Paraphrase (SP) for task2. The
                                                                         accuracy and performance of these methods are measured on the
                                                                         basis of parameters such as accuracy, precision, recall, f-measure
                                                                         and macro F-measure. The results show that ME method
In Table 8, SVM-SP stands for SVM used for identifying Semi              outperforms       than     SVM       to     identify    paraphrases.
Paraphrase (SP) sentences and ME-SP stands for ME used for
identifying Semi Paraphrase (SP) sentences. The results show that
both the systems identify sentences in the order NP, SP and P.           7. References
Table 8 Precision, Recall, F-Measure summary for SubTask2                [1] Qayyum, Z., and Altaf, W. 2012. Paraphrase Identification
   Method           Precision              Recall       F-Measure            using Semantic Heuristic Features, Research Journal of
                                                                             Applied Sciences, Engineering and Technology, 4(22)
 SVM – NP              0.74                 0.77           0.75              (pp.4894-4904).
  SVM – P              0.36                 0.41           0.38          [2] Duclaye, F., Yvon F., Collin O., and Cedex L. 2002. Using
                                                                             the Web as a Linguistic Resource for Learning
 SVM – SP              0.50                 0.44           0.47              Reformulations Automatically. In proceedings of the Third
                                                                             International Conference on Language Resources and
  ME - NP              0.76                 0.80           0.78
                                                                             Evaluation (pp.390-396).
   ME – P              0.39                 0.48           0.43          [3] Finch, A., Hwang, Y., and Sumitha, E. 2005. Using Machine
  ME – SP              0.52                 0.42           0.46              Translation Evaluation Techniques to determine Sentence-
                                                                             level semantic Equivalence. In proceedings of the Third
                                                                             International Workshop on paraphrasing (pp.17-24).
•Accuracy:                                                               [4] Mihalcea, R., Corley and Strapparava C.2006. Corpus-based
                                                                             and Knowledge-based Measures of Text Semantic Similarity.
           It is the most common measure of classification process.          In proceedings of 21st National Conference on Artificial
It can be calculated as the ratio of correctly classified sentences to       Intelligence, Vol:1 (pp:775-780).
total number of sentences. It can be calculated using the equation
4.                                                                       [5] Fernando, S., and Stevenson, M.2008. A semantic similarity
                                                                             approach to paraphrase detection. In proceedings of 11th
                   TP+TN                                                     Annual research colloquium of the UK special interest group
���� ��� =                                                     (4)
               TP+TN+FP+FN
                                                                             for computational linguistics (pp.45-52).
                                                                         [6] Malakasiotis, P. 2009. Paraphrase recognition using machine
•F1 Macro Measure:                                                           learning to combine similarity measures. In proceedings of
      the ACL-IJCNLP 2009 Student Research Workshop (pp.27-        [13] Hsu, C., W., Chang, C. C.,and Lin, C. J. 2003. A practical
      35).                                                              guide to support vector clas- sification. Simon Fraser
[7]   Miller, G.A.,(19950. WordNet: A lexical database for              University, 8888 University Drive, Burnaby BC, Canada,
      English. Communications of the ACM, Vol;38, No:11.                V5A 1S6.

[8]   Madnani, N., and Chodorow, M.,2012.Reexamining               [14] Nigam, K., Lafferty, J., and McCallum, A. 1999. Using
      Machine Translation Metrics for Paraphrase Identification.        maximum entropy for text classification. In IJCAI-99
      In proceedings of NAACL HLT 2012 (pp.182-190).                    workshop on machine learning for information filtering: 1
                                                                        (pp. 61–67).
[9]   Socher R. Huang E and Manning C.D 2011. Dynamic
      Pooling and Unfolding Recursive Autoencoders for             [15] http://ltrc.iiit.ac.in/analyzer/tamil/
      paraphrase detection, Science (pp.1-9).                      [16] Tang, H., Tan, S.,and Cheng, X. 2009. A survey on
[10] William, B.Dolan and Chris Brockett, “Automatically                sentiment detection of reviews. Expert Systems with
     constructing a Corpus of Sentential Paraphrases”, In               Applications, 36 (7), (pp.10760–10773).
     Proceedings of IWP, 2005.                                     [17] Zhang, Jian, Hai Zhao, Liqing Zhang and Bao-Liang Lu.
[11] Anand Kumar, M., Singh, S., Kavirajan, B., and Soman, K            “An empirical Comparative study on two large-scale
     P. 2016. DPIL@FIRE2016: Overview of shared task on                 Hierarchical Text classification approaches”, International
     Detecting Paraphrases in Indian Languages, Working notes           Journal of Computer processing of Languages, 2011.
     of FIRE 2016 - Forum for Information Retrieval Evaluation,    [18] Mahalakshmi, S., Anand Kumar, M., and Soman, K.P
     Kolkata, India.                                                    2015. Paraphrase detection for Tamil language using deep
[12] nlp.amrita.edu/dpil_cen/index.html#dataset                         learning algorithm. Int. J. of Appld. Engg. Res., 10 (17),
                                                                        13929-13934.