CUSAT_NLP@DPIL-FIRE2016: Malayalam Paraphrase
                       Detection
                           Sindhu.L                                                      Sumam Mary Idicula
             Department of computer Science                                         Department of computer Science
                 College of Engineering                                                        CUSAT
                        Poonjar                                                                 Kochi
                sindhul.cep@gmail.com                                                    sumam@cusat.ac.in

ABSTRACT                                                               also judged. If there are no unpaired units orif all unpairedunits
This paper describes an approach for paraphrase detection in           are insignificant then a positive classificationis given. Comparison
Malayalam sentences developed as part of FIRE 2016 Shared              is done using a simple lexicalmatching technique.
Task on Paraphrase detection in Indian Languages. The task of          (Zhang and Patrick, 2005)proposed to create intermediate forms
paraphrasedetection is finding a sentence with the same meaning        of the sentences so that similartexts are transformed into thesame
of another sentence expressed using same or different words. This      surface representation.Next, simplelexical matching techniques
detection is done by a semantic approach which is language             are used to compare thetransformed text.(Mihalcea etal., 2006)
dependent. Individual words, their root forms and synonyms are         proposed word-to-word similarity measures anda word specificity
used in finding similarity between two given sentences. We             measure to estimate thesemantic similarity of the sentence pairs.
present an algorithm for paraphrase identification which makes
use of word similarity information derived fromCUSAT
Malayalam WordNet Padasrinkala.. The approach is evaluated
                                                                       3. PROPOSED SEMANTIC APPROACH
using the Malayalam corpus made available as part of of FIRE           The proposed task at FIRE 2016 is focused on sentence level
2016 Shared Task on Paraphrase detection in Malayalam.                 paraphrase identification for Indian languages (Malayalam). Sub
                                                                       Task 1: Given a pair of sentences from newspaper domain, the
                                                                       task is to classify them as paraphrases (P) or not paraphrases (NP).
CCS Concepts                                                           Sub Task 2: Given two sentences from newspaper domain, the
• Computing methodologies~Natural language processing                  task is to identify whether they are paraphrases (P), semi-
• Computing methodologies~Lexical semantics • Computing                paraphrase (SP) or not paraphrases (NP).
methodologies~Language          resources    • Computing               Our proposed semantic approach foridentifying theparaphrases
methodologies~Information extraction                                   comprisesof three phases – matching identical tokens, matching
                                                                       lemmas        and      matching    with      synonyms      replaced.
Keywords                                                               Similaritycomparison is performed at the sentence level using the
Paraphrase detection; semantic         matching;tokenization;POS       Jaccard, Containment, Overlap and Cosine similarity metrics and
tagging;lemmatization;corpus.                                          if thesimilarity score of a sentence pair is higher than a
                                                                       predetermined threshold, the pair ismarked as plagiarised.The
1. INTRODUCTION                                                        steps are illustrated in Figure 1.
Paraphrase is defined as the reuse of text or its meaning in another
sentence using the same or similar words or phrases. Paraphrase
detection is used to determine whethertwo texts (sentences) of         3.1 Tokenization
different lengths have the samemeaning. Such detection is used in      The two input sentences are broken down into individual words or
various natural language applicationssuch as plagiarism detection,     tokens and compared for similarity. Given two sentences S1 and
text summarisation, WSD, machine translation etc.Paraphrasing          S2, thetokens produced from S1 will be {W1,W2. . .WN}, where N
may be due to morphology based changes, lexicon-based changes,         is the number of words in the sentence S1.
syntax-based changes, discourse-based changes, semantics-based
changes etc. This approach to paraphrase detection comprises of        3.2 Lemmatization and POS tagging
pure lexical matching and also the similarity between sentences
                                                                       The individual words in the two input sentencesare reduced to
which use synonyms to convey the same meaning.
                                                                       their root form or lemmas using a suffix stripping
The outline for the rest of the paper is as follows. Section 2
                                                                       algorithm.Lemmatization is the technique of transforming words
describes some of the previous approaches to paraphrase
                                                                       into their dictionary base forms.
identification and their limitations. The approach proposed here is
described in Section 3. Section 4 gives a brief description of the     Suffix stripping algorithm:
Paraphrase Corpus which is used for evaluation. Section 5              The inflected words for similarity analysis are converted to a valid
presents the results of this evaluation. Conclusions and               root wordby means of suffix stripping along with some
suggestions for future work are presented in Section 6.                transformational rules. Each rule set consists of suffixes and their
                                                                       corresponding transformations that can generate the root word.
2. PREVIOUS APPROACHES                                                 This rule set is considers plurals and Vibhakthis in case of nouns
Purely lexical based matching techniques for paraphrase detection      and the different tense forms in case of verbs. Suffixes in
was used by (Clough et al., 2002; Qiu et al., 2006; Zhangand           Malayalam inflected word may range from a single character to a
Patrick, 2005).A two-phase process was used by (Qiu et al., 2006)      group of characters. So the algorithm starts stripping from the
where thecommon semantic units in each sentence are first              right side of the inflected word character wise. Each time a
identified and pairedoff. The significance of the other units are      character which is a valid suffix in the rule set is stripped,
corresponding transformations are done and the resulting word in      The similaritybetween two sentences is calculated using the
checked in the dictionary. If it is found the algorithm terminates.   containment similarity measure proposed by Clough and
Otherwise the procedure continues until a valid word is found.        Stevenson (2010) given in equation.
The root words are checked for correctness with the part of speech
tag.These lemmas are then compared for similarity.
                                                                                                                A B
3.3 Synonym replacement                                                               Scontainment ( A, B) 
For the remaining lemmas that are not matched, substitute                                                          A
synonyms   from     the CUSAT      Malayalam    wordnet-              A and B represent the sets of n-grams in the sentencesS1 and S2
PADASRINKALA. An example is given below                               respectively. The containmentmeasure calculates the intersecting
                                                                      n-grams but normalises them only with respectto the count of n-
                                                                      grams in the first sentence S1.
                    സമുദ്രം
WORD            :                                                         c) Overlap coefficient
                                                                      The overlap coefficient is also proposed by Clough and Stevenson
           സമുദ്രം , കടല് , ആഴി , അകൂപാരം ,                           (2010) .
           അപാംപതി ,        അപ്പതി ,  അബ്ധി ,
SYNONYMS :
           അര്ണ്ണവം ,       ഉരധി ,   ജലനിധി ,
                                                                                                             A B
           പാരാവാരം , സാഗരം                                                         Soverlap ( A, B) 
                                                                                                         min(  A  B 
                                                                      A and B are the unique n-grams contained in the sentence S1 and
                    Noun                                              sentence S2 respectively. The intersecting n-grams of both
POS             :
                                                                      sentences is dividedby the sentence with the smaller word count.


                                                                           d) Cosine Similarity
           sentence1                    sentence2                     The similaritybetween two sentences is calculated using the
                                                                      cosine similarity given in equation.

                                                                                                                    A B
                                                                                               Scos ine ( A, B) 
                                                                                                                    A B
                       Tokenization
                                                                      Sentences S1 and S2 are represented as vectors A and B
                                                                      respectively.


 Malayalam             Lemmatization             Synonym              Consider the example sentence pairs
  wordnet              & POS tagging            replacement           S1: മകളെ പീഡിപ്പിച്ച ദ്പതിയുളട കകരണ്ും
                                                                      പിതാവ്മുറിച്ചുമാറ്റി.
                                                                      S2:എട്ടുമാസം     ദ്പായമുള്ള       ളപണ്കുഞ്ഞിളന
                            Similarity
                                                                      പീഡിപ്പിച്ച      ദ്പതിയുളട         ഇരുകകകെും
                           computation
                                                                      കുട്ടിയുളട അച്ഛന് മുറിച്ചുമാറ്റി.

                           Similarity
                             report                                   From S1 and S2 we get
             Figure 1. Paraphrase detection method                    Direct matches: 3       ( പീഡിപ്പിച്ച         ,   ദ്പതിയുളട    ,
                                                                      മുറിച്ചുമാറ്റി )


3.4 Similarity computation                                            Lemma match: 0
The combined similarity obtained from direct word matches,
                                                                      Synonym match: 2 ( മകളെ↔ളപണ്കുഞ്ഞിളന
lemma matches and synonym match produces a score between 0
and 1 that indicates the similarity between sentences S1 and S2.                പിതാവ്↔ അച്ഛന് )
     a) Jaccard Similarity

                                             A B
                       Sjaccard ( A, B) 
                                             A B                    So the similarity or intersecting word count will be

      b)   Containment measure                                        Direct match + lemma match + synonym match
which is 3 + 0+ 2 = 5                                                            ൈാംഗ്ലൂര്ണ് ററായൽ.

If we find the overlap coefficient                                               ചലറേഴ്സിളന മുംകൈ ആറു

Overlap-similarity = 5/6 = 0.8                                                   വിക്കറ്റിന് റതാൽപ്പിച്ചു.
                                                                         2       സമുദ്രത്തിന്ളറ അടിത്തട്ടിലുള്ള
Similarly all other measures are calculated.
                                                                                 ളതരച്ചില്    വീണ്ും
Jaccard similarity = 0.5                                                         ആരംഭിക്ും.                                       NP
Containment similarity= 0.8                                                      ഒരു വര്ണ്ഷളമടുക്ും ളതരച്ചില്
                                                                                 പൂര്ണ്ത്തിയാകാന്.
Cosine similarity = 0.7
                                                                         3       രണ്ു വര്ണ്ഷമായി ഈസ്റ്റ്
4. PARAPHRASE CORPUS                                                             ളവള്ളിമാടുകുന്് ഭാഗത്ത്
There are no annotated corpora or benchmark data for paraphrases                 കനാലില് ളവള്ളളമത്തിയിട്ട്.                        SP
available for Indian languages till date..The data provided for this
shared task have been splitinto two training sets containing 2500                ഈസ്റ്റ് ളവള്ളിമാടുകുന്നില് കുടി
and 3500 examples respectively and two test sets containing 900                  ളവള്ളം വറ്റി.
pairs of sentences for task1 and 1400 pairs of sentences for task2.
The training data-set -1 contains 1000 sentencepairs that have
been marked by human judges as paraphrases and1500                     5. EXPERIMENTS
sentencepairs that have been marked as not paraphrases.
The training data-set -2 contains 1000 sentencepairs that have         The approach described in Section 3 was evaluatedagainst the
been marked as paraphrases , 1000 sentencepairs that have been         Paraphrase Corpus.All synonyms of Malayalam WordNet were
marked as semi-paraphrases and 1500 sentencepairs that have            considered when finding the similaritybetween words.
been marked as not paraphrases.This train/test partitionhas been       The training data was used to find the classificationthreshold
observed by all the approaches evaluatedhere.                          (paraphrase/semi-paraphrase/not-paraphrase) for the two tasks.
                                                                       Considering the four similarity measures, the following
                                                                       observations are made.
                                                                       Containment measure is useful in cases where thesuspicious text
                       Table 1. Training data                          is shorter than the source text. Overlap measure is useful in cases
                                                                       where the size of suspicious and source text varies. Jaccard
                               Number of Documents                     similarity values are less compared to the Cosine value. Hence
       Sets                                                            only the Cosine value is considered for setting the threshold.
                                        Semi                Not
                 Paraphrase                                            Accuracy, precision, recall and F measurewere evaluated for the
                                     paraphrase         paraphrase
                                                                       test corpus:These are defined as follows:
      Set-1            1000              0                1500                                          TP  TN
                                                                                  accuraccy 
      Set-2            1000            1000               1500                                     TP  TN  FP  FN


                                                                       where TP are true positives, TN are true negatives,FN are false
                           Table 2. Test data                          negatives and FP are false positives.

                                                                                                            TP
               Sets                    Number of Documents                                precision 
                                                                                                          TP  FP
                                                                                                           TP
              Task-1                              900                                         recall 
                                                                                                         TP  FN
              Task-2                            1400
                                                                                               2x precision x recall
                                                                                         F
                                                                                                precision + recall
       Table 3. Examples of sentences from Train dataset               Results for the semantic similarity approach on the test data
                                                                       areshown in Table3.
  id      Sentence pair                                      Tag                        Table3. Results on test data
  1       ററായൽ ചലറേഴ്സിളന ആറു                                          Task            No.       of     Accuracy          F-measure
                                                                 P                      sentences
          വിക്കറ്റിന് തകർത്ത് മുംകൈ
                                                                        Task-1          900              0.76              0.75
          വീണ്ും വിജയവഴിയിൽ.                                            Task-2          1400             0.52              0.51
                                                                          theAssociation for Computational Linguistics (ACL-02),,
                                                                          Pennsylvania, PA,pages 152–159.
                                                                      [3] Qiu      Long,       Min-Yen      Kan,    and    Tat-Seng
6. CONCLUSION AND FUTURE WORK                                             Chua.,2006,Paraphrase       recognition via   dissimilarity
This paper presented an approach to the problemof paraphrase
                                                                          significanceclassification., In Proceedings of the 2006
detection in Malayalam language. Paraphrase has been
                                                                          Conferenceon Empirical Methods in Natural Language
identifiedbased on the tokens and its synonyms that are common
                                                                          Processing, , Sydney, Australia, July.Association for
thathas been taken as attribute for checking paraphrase. Thewords
                                                                          computational Linguistics,pages 18-26.
are checked against Malayalam Wordnet. Bycalculating the token
matching ,lemma match and synonymtoken matching andfixing             [4] Zhang. Y and Jon Patrick.,2005, Paraphrase identification by
an appropriate threshold value, the given sentence can be                 text canonicalization, In Proceedings of Australasian
classified as paraphrase, semi-paraphrase         sentence or not         Language     Technology     Workshop      2005,     Sydney,
paraphrase.                                                               Australia,pages 160-166.
From the obtained values of Accuracy and F-measure, we                [5] Mihalcea.R, Courtney Corley, and Carlo Strapparava.,
consider combining the similarity approaches in future to improve         2006,Corpus-based and Knowledge-based Measures of Text
the efficiency of the system. Also, the accuracy of this method can       Semantic Similarity, In Proceedings of the American
be further enhanced by including a spell-checker and correcting           Association for Artificial Intelligence (AAAI ).
misspelled words before similarity checking.
                                                                      [6] Sundaram, Mahalakshmi Shanmuga, Anand Kumar M, and
7. REFERENCES                                                             Soman Kotti Padannayil,"AMRITA CEN@ SemEval-
                                                                          2015: Paraphrase Detection for Twitter using Unsupervised
                                                                          Feature Learning with Recursive Autoencoders." SemEval-
[1] AnandKumar, M., Singh, S., Kavirajan, B., and Soman, K                2015.
    .P.2016.DPIL@FIRE2016: Overview of shared task on
    Detecting Paraphrases in Indian Languages. Working notes          [7] Mahalakshmi, S., Anand Kumar, M., Soman, K.P,
    of FIRE 2016 - Forum for Information Retrieval Evaluation,            2015, Paraphrase detection for Tamil language using deep
    Kolkata,India, December7-10,CEUR                Workshop              learning algorithm, International Journal of Applied
    Proceedings,CEUR-WS.org                                               Engineering Research, 10 (17), pp. 13929-13934.

[2] Clough Paul, Robert Gaizauskas, Scott Piao, and
    YorickWilks.,2002,METER:         MEasuring        TExt
    Reuse.InProceedings of the 40th Anniversary Meeting for