<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CUSAT_NLP@DPIL-FIRE2016: Malayalam Paraphrase Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sindhu.L</string-name>
          <email>sindhul.cep@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sumam Mary Idicula</string-name>
          <email>sumam@cusat.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of computer Science</institution>
          ,
          <addr-line>CUSAT, Kochi</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of computer Science, College of Engineering</institution>
          ,
          <addr-line>Poonjar</addr-line>
        </aff>
      </contrib-group>
      <fpage>152</fpage>
      <lpage>159</lpage>
      <abstract>
        <p>This paper describes an approach for paraphrase detection in Malayalam sentences developed as part of FIRE 2016 Shared Task on Paraphrase detection in Indian Languages. The task of paraphrasedetection is finding a sentence with the same meaning of another sentence expressed using same or different words. This detection is done by a semantic approach which is language dependent. Individual words, their root forms and synonyms are used in finding similarity between two given sentences. We present an algorithm for paraphrase identification which makes use of word similarity information derived fromCUSAT Malayalam WordNet Padasrinkala.. The approach is evaluated using the Malayalam corpus made available as part of of FIRE 2016 Shared Task on Paraphrase detection in Malayalam. • Computing methodologies~Natural language processing • Computing methodologies~Lexical semantics • Computing methodologies~Language resources • Computing methodologies~Information extraction matching;tokenization;POS Paraphrase detection; semantic tagging;lemmatization;corpus.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Paraphrase is defined as the reuse of text or its meaning in another
sentence using the same or similar words or phrases. Paraphrase
detection is used to determine whethertwo texts (sentences) of
different lengths have the samemeaning. Such detection is used in
various natural language applicationssuch as plagiarism detection,
text summarisation, WSD, machine translation etc.Paraphrasing
may be due to morphology based changes, lexicon-based changes,
syntax-based changes, discourse-based changes, semantics-based
changes etc. This approach to paraphrase detection comprises of
pure lexical matching and also the similarity between sentences
which use synonyms to convey the same meaning.</p>
      <p>The outline for the rest of the paper is as follows. Section 2
describes some of the previous approaches to paraphrase
identification and their limitations. The approach proposed here is
described in Section 3. Section 4 gives a brief description of the
Paraphrase Corpus which is used for evaluation. Section 5
presents the results of this evaluation. Conclusions and
suggestions for future work are presented in Section 6.</p>
    </sec>
    <sec id="sec-2">
      <title>2. PREVIOUS APPROACHES</title>
      <p>
        Purely lexical based matching techniques for paraphrase detection
was used by
        <xref ref-type="bibr" rid="ref2 ref3 ref4">(Clough et al., 2002; Qiu et al., 2006; Zhangand
Patrick, 2005)</xref>
        .A two-phase process was used by
        <xref ref-type="bibr" rid="ref3">(Qiu et al., 2006)</xref>
        where thecommon semantic units in each sentence are first
identified and pairedoff. The significance of the other units are
also judged. If there are no unpaired units orif all unpairedunits
are insignificant then a positive classificationis given. Comparison
is done using a simple lexicalmatching technique.
        <xref ref-type="bibr" rid="ref4">(Zhang and Patrick, 2005)</xref>
        proposed to create intermediate forms
of the sentences so that similartexts are transformed into thesame
surface representation.Next, simplelexical matching techniques
are used to compare thetransformed text.
        <xref ref-type="bibr" rid="ref5">(Mihalcea etal., 2006)</xref>
        proposed word-to-word similarity measures anda word specificity
measure to estimate thesemantic similarity of the sentence pairs.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. PROPOSED SEMANTIC APPROACH</title>
      <p>The proposed task at FIRE 2016 is focused on sentence level
paraphrase identification for Indian languages (Malayalam). Sub
Task 1: Given a pair of sentences from newspaper domain, the
task is to classify them as paraphrases (P) or not paraphrases (NP).
Sub Task 2: Given two sentences from newspaper domain, the
task is to identify whether they are paraphrases (P),
semiparaphrase (SP) or not paraphrases (NP).</p>
      <p>Our proposed semantic approach foridentifying theparaphrases
comprisesof three phases – matching identical tokens, matching
lemmas and matching with synonyms replaced.
Similaritycomparison is performed at the sentence level using the
Jaccard, Containment, Overlap and Cosine similarity metrics and
if thesimilarity score of a sentence pair is higher than a
predetermined threshold, the pair ismarked as plagiarised.The
steps are illustrated in Figure 1.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Tokenization</title>
      <p>The two input sentences are broken down into individual words or
tokens and compared for similarity. Given two sentences S1 and
S2, thetokens produced from S1 will be {W1,W2. . .WN}, where N
is the number of words in the sentence S1.</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Lemmatization and POS tagging</title>
      <p>The individual words in the two input sentencesare reduced to
their root form or lemmas using a suffix stripping
algorithm.Lemmatization is the technique of transforming words
into their dictionary base forms.</p>
      <sec id="sec-5-1">
        <title>Suffix stripping algorithm:</title>
        <p>The inflected words for similarity analysis are converted to a valid
root wordby means of suffix stripping along with some
transformational rules. Each rule set consists of suffixes and their
corresponding transformations that can generate the root word.
This rule set is considers plurals and Vibhakthis in case of nouns
and the different tense forms in case of verbs. Suffixes in
Malayalam inflected word may range from a single character to a
group of characters. So the algorithm starts stripping from the
right side of the inflected word character wise. Each time a
character which is a valid suffix in the rule set is stripped,
corresponding transformations are done and the resulting word in
checked in the dictionary. If it is found the algorithm terminates.
Otherwise the procedure continues until a valid word is found.
The root words are checked for correctness with the part of speech
tag.These lemmas are then compared for similarity.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3.3 Synonym replacement</title>
      <p>For the remaining lemmas that are not matched, substitute
synonyms from the CUSAT Malayalam
wordnetPADASRINKALA. An example is given below</p>
    </sec>
    <sec id="sec-7">
      <title>3.4 Similarity computation</title>
      <p>The combined similarity obtained from direct word matches,
lemma matches and synonym match produces a score between 0
and 1 that indicates the similarity between sentences S1 and S2.
a) Jaccard Similarity</p>
      <sec id="sec-7-1">
        <title>Sjaccard ( A, B)   A  A B </title>
        <p>B 
b)</p>
        <sec id="sec-7-1-1">
          <title>Containment measure The similaritybetween two sentences is calculated using the containment similarity measure proposed by Clough and Stevenson (2010) given in equation.</title>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>Scontainment( A, B)   A B </title>
        <p> A 
A and B represent the sets of n-grams in the sentencesS1 and S2
respectively. The containmentmeasure calculates the intersecting
n-grams but normalises them only with respectto the count of
ngrams in the first sentence S1.</p>
        <sec id="sec-7-2-1">
          <title>c) Overlap coefficient The overlap coefficient is also proposed by Clough and Stevenson (2010) .</title>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>Soverlap( A, B)   A B </title>
        <p>min(  A  B 
A and B are the unique n-grams contained in the sentence S1 and
sentence S2 respectively. The intersecting n-grams of both
sentences is dividedby the sentence with the smaller word count.</p>
        <sec id="sec-7-3-1">
          <title>d) Cosine Similarity The similaritybetween two sentences is calculated using the cosine similarity given in equation.</title>
          <p>Scos ine ( A, B) </p>
          <p>A  B
A B
Sentences S1 and S2 are represented as vectors A and B
respectively.</p>
          <p>Consider the example sentence pairs
പ്ദതിയുളട</p>
          <p>കകരുണ്ം
Direct match + lemma match + synonym match</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>4. PARAPHRASE CORPUS</title>
      <p>There are no annotated corpora or benchmark data for paraphrases
available for Indian languages till date..The data provided for this
shared task have been splitinto two training sets containing 2500
and 3500 examples respectively and two test sets containing 900
pairs of sentences for task1 and 1400 pairs of sentences for task2.
The training data-set -1 contains 1000 sentencepairs that have
been marked by human judges as paraphrases and1500
sentencepairs that have been marked as not paraphrases.
The training data-set -2 contains 1000 sentencepairs that have
been marked as paraphrases , 1000 sentencepairs that have been
marked as semi-paraphrases and 1500 sentencepairs that have
been marked as not paraphrases.This train/test partitionhas been
observed by all the approaches evaluatedhere.</p>
    </sec>
    <sec id="sec-9">
      <title>5. EXPERIMENTS</title>
      <p>The approach described in Section 3 was evaluatedagainst the
Paraphrase Corpus.All synonyms of Malayalam WordNet were
considered when finding the similaritybetween words.
The training data was used to find the classificationthreshold
(paraphrase/semi-paraphrase/not-paraphrase) for the two tasks.
Considering the four similarity measures, the following
observations are made.</p>
      <p>Containment measure is useful in cases where thesuspicious text
is shorter than the source text. Overlap measure is useful in cases
where the size of suspicious and source text varies. Jaccard
similarity values are less compared to the Cosine value. Hence
only the Cosine value is considered for setting the threshold.
Accuracy, precision, recall and F measurewere evaluated for the
test corpus:These are defined as follows:
accuraccy </p>
      <p>TP  TN</p>
      <p>TP  TN  FP  FN
where TP are true positives, TN are true negatives,FN are false
negatives and FP are false positives.</p>
      <p>precision 
recall </p>
      <p>TP
TP  FP</p>
      <p>TP</p>
      <p>TP  FN
F </p>
      <sec id="sec-9-1">
        <title>2x precision x recall precision + recall</title>
        <p>Results for the semantic similarity approach on the test data
areshown in Table3.</p>
        <p>Table3. Results on test data
Task
Task-1
Task-2</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>6. CONCLUSION AND FUTURE WORK</title>
      <p>This paper presented an approach to the problemof paraphrase
detection in Malayalam language. Paraphrase has been
identifiedbased on the tokens and its synonyms that are common
thathas been taken as attribute for checking paraphrase. Thewords
are checked against Malayalam Wordnet. Bycalculating the token
matching ,lemma match and synonymtoken matching andfixing
an appropriate threshold value, the given sentence can be
classified as paraphrase, semi-paraphrase sentence or not
paraphrase.</p>
      <p>From the obtained values of Accuracy and F-measure, we
consider combining the similarity approaches in future to improve
the efficiency of the system. Also, the accuracy of this method can
be further enhanced by including a spell-checker and correcting
misspelled words before similarity checking.</p>
    </sec>
    <sec id="sec-11">
      <title>7. REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>AnandKumar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavirajan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Soman</surname>
            ,
            <given-names>K .P.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>DPIL@FIRE2016: Overview of shared task on Detecting Paraphrases in Indian Languages</article-title>
          .
          <source>Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation</source>
          , Kolkata,India,
          <fpage>December7</fpage>
          -
          <lpage>10</lpage>
          ,CEUR Workshop Proceedings,CEUR-WS.org
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Clough</given-names>
            <surname>Paul</surname>
          </string-name>
          , Robert Gaizauskas,
          <string-name>
            <given-names>Scott</given-names>
            <surname>Piao</surname>
          </string-name>
          , and YorickWilks.,
          <year>2002</year>
          ,METER:
          <article-title>MEasuring TExt Reuse</article-title>
          .
          <article-title>InProceedings of the 40th Anniversary Meeting for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Qiu</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <surname>Min-Yen Kan</surname>
          </string-name>
          , and
          <string-name>
            <surname>Tat-Seng Chua</surname>
          </string-name>
          .,
          <year>2006</year>
          ,
          <article-title>Paraphrase recognition via dissimilarity significanceclassification</article-title>
          .,
          <source>In Proceedings of the 2006 Conferenceon Empirical Methods in Natural Language Processing</source>
          , , Sydney, Australia, July.
          <source>Association for computational Linguistics</source>
          ,pages
          <fpage>18</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          . Y and
          <string-name>
            <surname>Jon Patrick</surname>
          </string-name>
          .,
          <year>2005</year>
          ,
          <article-title>Paraphrase identification by text canonicalization</article-title>
          ,
          <source>In Proceedings of Australasian Language Technology Workshop</source>
          <year>2005</year>
          , Sydney, Australia,pages
          <fpage>160</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Mihalcea.R</given-names>
            ,
            <surname>Courtney</surname>
          </string-name>
          <string-name>
            <surname>Corley</surname>
          </string-name>
          , and Carlo Strapparava.,
          <year>2006</year>
          ,
          <article-title>Corpus-based and Knowledge-based Measures of Text Semantic Similarity</article-title>
          ,
          <source>In Proceedings of the American Association for Artificial Intelligence (AAAI ).</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Sundaram</surname>
            ,
            <given-names>Mahalakshmi</given-names>
          </string-name>
          <string-name>
            <surname>Shanmuga</surname>
          </string-name>
          ,
          <string-name>
            <surname>Anand Kumar</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <article-title>and Soman Kotti Padannayil,"AMRITA CEN@ SemEval2015: Paraphrase Detection for Twitter using Unsupervised Feature Learning with Recursive Autoencoders." SemEval2015.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Mahalakshmi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anand</surname>
            <given-names>Kumar</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Soman</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.P</surname>
          </string-name>
          ,
          <year>2015</year>
          ,
          <article-title>Paraphrase detection for Tamil language using deep learning algorithm</article-title>
          ,
          <source>International Journal of Applied Engineering Research</source>
          ,
          <volume>10</volume>
          (
          <issue>17</issue>
          ), pp.
          <fpage>13929</fpage>
          -
          <lpage>13934</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>