CUSAT_NLP@DPIL-FIRE2016: Malayalam Paraphrase Detection Sindhu.L Sumam Mary Idicula Department of computer Science Department of computer Science College of Engineering CUSAT Poonjar Kochi sindhul.cep@gmail.com sumam@cusat.ac.in ABSTRACT also judged. If there are no unpaired units orif all unpairedunits This paper describes an approach for paraphrase detection in are insignificant then a positive classificationis given. Comparison Malayalam sentences developed as part of FIRE 2016 Shared is done using a simple lexicalmatching technique. Task on Paraphrase detection in Indian Languages. The task of (Zhang and Patrick, 2005)proposed to create intermediate forms paraphrasedetection is finding a sentence with the same meaning of the sentences so that similartexts are transformed into thesame of another sentence expressed using same or different words. This surface representation.Next, simplelexical matching techniques detection is done by a semantic approach which is language are used to compare thetransformed text.(Mihalcea etal., 2006) dependent. Individual words, their root forms and synonyms are proposed word-to-word similarity measures anda word specificity used in finding similarity between two given sentences. We measure to estimate thesemantic similarity of the sentence pairs. present an algorithm for paraphrase identification which makes use of word similarity information derived fromCUSAT Malayalam WordNet Padasrinkala.. The approach is evaluated 3. PROPOSED SEMANTIC APPROACH using the Malayalam corpus made available as part of of FIRE The proposed task at FIRE 2016 is focused on sentence level 2016 Shared Task on Paraphrase detection in Malayalam. paraphrase identification for Indian languages (Malayalam). Sub Task 1: Given a pair of sentences from newspaper domain, the task is to classify them as paraphrases (P) or not paraphrases (NP). CCS Concepts Sub Task 2: Given two sentences from newspaper domain, the • Computing methodologies~Natural language processing task is to identify whether they are paraphrases (P), semi- • Computing methodologies~Lexical semantics • Computing paraphrase (SP) or not paraphrases (NP). methodologies~Language resources • Computing Our proposed semantic approach foridentifying theparaphrases methodologies~Information extraction comprisesof three phases – matching identical tokens, matching lemmas and matching with synonyms replaced. Keywords Similaritycomparison is performed at the sentence level using the Paraphrase detection; semantic matching;tokenization;POS Jaccard, Containment, Overlap and Cosine similarity metrics and tagging;lemmatization;corpus. if thesimilarity score of a sentence pair is higher than a predetermined threshold, the pair ismarked as plagiarised.The 1. INTRODUCTION steps are illustrated in Figure 1. Paraphrase is defined as the reuse of text or its meaning in another sentence using the same or similar words or phrases. Paraphrase detection is used to determine whethertwo texts (sentences) of 3.1 Tokenization different lengths have the samemeaning. Such detection is used in The two input sentences are broken down into individual words or various natural language applicationssuch as plagiarism detection, tokens and compared for similarity. Given two sentences S1 and text summarisation, WSD, machine translation etc.Paraphrasing S2, thetokens produced from S1 will be {W1,W2. . .WN}, where N may be due to morphology based changes, lexicon-based changes, is the number of words in the sentence S1. syntax-based changes, discourse-based changes, semantics-based changes etc. This approach to paraphrase detection comprises of 3.2 Lemmatization and POS tagging pure lexical matching and also the similarity between sentences The individual words in the two input sentencesare reduced to which use synonyms to convey the same meaning. their root form or lemmas using a suffix stripping The outline for the rest of the paper is as follows. Section 2 algorithm.Lemmatization is the technique of transforming words describes some of the previous approaches to paraphrase into their dictionary base forms. identification and their limitations. The approach proposed here is described in Section 3. Section 4 gives a brief description of the Suffix stripping algorithm: Paraphrase Corpus which is used for evaluation. Section 5 The inflected words for similarity analysis are converted to a valid presents the results of this evaluation. Conclusions and root wordby means of suffix stripping along with some suggestions for future work are presented in Section 6. transformational rules. Each rule set consists of suffixes and their corresponding transformations that can generate the root word. 2. PREVIOUS APPROACHES This rule set is considers plurals and Vibhakthis in case of nouns Purely lexical based matching techniques for paraphrase detection and the different tense forms in case of verbs. Suffixes in was used by (Clough et al., 2002; Qiu et al., 2006; Zhangand Malayalam inflected word may range from a single character to a Patrick, 2005).A two-phase process was used by (Qiu et al., 2006) group of characters. So the algorithm starts stripping from the where thecommon semantic units in each sentence are first right side of the inflected word character wise. Each time a identified and pairedoff. The significance of the other units are character which is a valid suffix in the rule set is stripped, corresponding transformations are done and the resulting word in The similaritybetween two sentences is calculated using the checked in the dictionary. If it is found the algorithm terminates. containment similarity measure proposed by Clough and Otherwise the procedure continues until a valid word is found. Stevenson (2010) given in equation. The root words are checked for correctness with the part of speech tag.These lemmas are then compared for similarity.  A B 3.3 Synonym replacement Scontainment ( A, B)  For the remaining lemmas that are not matched, substitute  A synonyms from the CUSAT Malayalam wordnet- A and B represent the sets of n-grams in the sentencesS1 and S2 PADASRINKALA. An example is given below respectively. The containmentmeasure calculates the intersecting n-grams but normalises them only with respectto the count of n- grams in the first sentence S1. സമുദ്രം WORD : c) Overlap coefficient The overlap coefficient is also proposed by Clough and Stevenson സമുദ്രം , കടല് , ആഴി , അകൂപാരം , (2010) . അപാംപതി , അപ്പതി , അബ്ധി , SYNONYMS : അര്ണ്ണവം , ഉരധി , ജലനിധി ,  A B പാരാവാരം , സാഗരം Soverlap ( A, B)  min(  A  B  A and B are the unique n-grams contained in the sentence S1 and Noun sentence S2 respectively. The intersecting n-grams of both POS : sentences is dividedby the sentence with the smaller word count. d) Cosine Similarity sentence1 sentence2 The similaritybetween two sentences is calculated using the cosine similarity given in equation. A B Scos ine ( A, B)  A B Tokenization Sentences S1 and S2 are represented as vectors A and B respectively. Malayalam Lemmatization Synonym Consider the example sentence pairs wordnet & POS tagging replacement S1: മകളെ പീഡിപ്പിച്ച ദ്പതിയുളട കകരണ്ും പിതാവ്മുറിച്ചുമാറ്റി. S2:എട്ടുമാസം ദ്പായമുള്ള ളപണ്കുഞ്ഞിളന Similarity പീഡിപ്പിച്ച ദ്പതിയുളട ഇരുകകകെും computation കുട്ടിയുളട അച്ഛന് മുറിച്ചുമാറ്റി. Similarity report From S1 and S2 we get Figure 1. Paraphrase detection method Direct matches: 3 ( പീഡിപ്പിച്ച , ദ്പതിയുളട , മുറിച്ചുമാറ്റി ) 3.4 Similarity computation Lemma match: 0 The combined similarity obtained from direct word matches, Synonym match: 2 ( മകളെ↔ളപണ്കുഞ്ഞിളന lemma matches and synonym match produces a score between 0 and 1 that indicates the similarity between sentences S1 and S2. പിതാവ്↔ അച്ഛന് ) a) Jaccard Similarity  A B Sjaccard ( A, B)   A B So the similarity or intersecting word count will be b) Containment measure Direct match + lemma match + synonym match which is 3 + 0+ 2 = 5 ൈാംഗ്ലൂര്ണ് ററായൽ. If we find the overlap coefficient ചലറേഴ്സിളന മുംകൈ ആറു Overlap-similarity = 5/6 = 0.8 വിക്കറ്റിന് റതാൽപ്പിച്ചു. 2 സമുദ്രത്തിന്ളറ അടിത്തട്ടിലുള്ള Similarly all other measures are calculated. ളതരച്ചില് വീണ്ും Jaccard similarity = 0.5 ആരംഭിക്ും. NP Containment similarity= 0.8 ഒരു വര്ണ്ഷളമടുക്ും ളതരച്ചില് പൂര്ണ്ത്തിയാകാന്. Cosine similarity = 0.7 3 രണ്ു വര്ണ്ഷമായി ഈസ്റ്റ് 4. PARAPHRASE CORPUS ളവള്ളിമാടുകുന്് ഭാഗത്ത് There are no annotated corpora or benchmark data for paraphrases കനാലില് ളവള്ളളമത്തിയിട്ട്. SP available for Indian languages till date..The data provided for this shared task have been splitinto two training sets containing 2500 ഈസ്റ്റ് ളവള്ളിമാടുകുന്നില് കുടി and 3500 examples respectively and two test sets containing 900 ളവള്ളം വറ്റി. pairs of sentences for task1 and 1400 pairs of sentences for task2. The training data-set -1 contains 1000 sentencepairs that have been marked by human judges as paraphrases and1500 5. EXPERIMENTS sentencepairs that have been marked as not paraphrases. The training data-set -2 contains 1000 sentencepairs that have The approach described in Section 3 was evaluatedagainst the been marked as paraphrases , 1000 sentencepairs that have been Paraphrase Corpus.All synonyms of Malayalam WordNet were marked as semi-paraphrases and 1500 sentencepairs that have considered when finding the similaritybetween words. been marked as not paraphrases.This train/test partitionhas been The training data was used to find the classificationthreshold observed by all the approaches evaluatedhere. (paraphrase/semi-paraphrase/not-paraphrase) for the two tasks. Considering the four similarity measures, the following observations are made. Containment measure is useful in cases where thesuspicious text Table 1. Training data is shorter than the source text. Overlap measure is useful in cases where the size of suspicious and source text varies. Jaccard Number of Documents similarity values are less compared to the Cosine value. Hence Sets only the Cosine value is considered for setting the threshold. Semi Not Paraphrase Accuracy, precision, recall and F measurewere evaluated for the paraphrase paraphrase test corpus:These are defined as follows: Set-1 1000 0 1500 TP  TN accuraccy  Set-2 1000 1000 1500 TP  TN  FP  FN where TP are true positives, TN are true negatives,FN are false Table 2. Test data negatives and FP are false positives. TP Sets Number of Documents precision  TP  FP TP Task-1 900 recall  TP  FN Task-2 1400 2x precision x recall F precision + recall Table 3. Examples of sentences from Train dataset Results for the semantic similarity approach on the test data areshown in Table3. id Sentence pair Tag Table3. Results on test data 1 ററായൽ ചലറേഴ്സിളന ആറു Task No. of Accuracy F-measure P sentences വിക്കറ്റിന് തകർത്ത് മുംകൈ Task-1 900 0.76 0.75 വീണ്ും വിജയവഴിയിൽ. Task-2 1400 0.52 0.51 theAssociation for Computational Linguistics (ACL-02),, Pennsylvania, PA,pages 152–159. [3] Qiu Long, Min-Yen Kan, and Tat-Seng 6. CONCLUSION AND FUTURE WORK Chua.,2006,Paraphrase recognition via dissimilarity This paper presented an approach to the problemof paraphrase significanceclassification., In Proceedings of the 2006 detection in Malayalam language. Paraphrase has been Conferenceon Empirical Methods in Natural Language identifiedbased on the tokens and its synonyms that are common Processing, , Sydney, Australia, July.Association for thathas been taken as attribute for checking paraphrase. Thewords computational Linguistics,pages 18-26. are checked against Malayalam Wordnet. Bycalculating the token matching ,lemma match and synonymtoken matching andfixing [4] Zhang. Y and Jon Patrick.,2005, Paraphrase identification by an appropriate threshold value, the given sentence can be text canonicalization, In Proceedings of Australasian classified as paraphrase, semi-paraphrase sentence or not Language Technology Workshop 2005, Sydney, paraphrase. Australia,pages 160-166. From the obtained values of Accuracy and F-measure, we [5] Mihalcea.R, Courtney Corley, and Carlo Strapparava., consider combining the similarity approaches in future to improve 2006,Corpus-based and Knowledge-based Measures of Text the efficiency of the system. Also, the accuracy of this method can Semantic Similarity, In Proceedings of the American be further enhanced by including a spell-checker and correcting Association for Artificial Intelligence (AAAI ). misspelled words before similarity checking. [6] Sundaram, Mahalakshmi Shanmuga, Anand Kumar M, and 7. REFERENCES Soman Kotti Padannayil,"AMRITA CEN@ SemEval- 2015: Paraphrase Detection for Twitter using Unsupervised Feature Learning with Recursive Autoencoders." SemEval- [1] AnandKumar, M., Singh, S., Kavirajan, B., and Soman, K 2015. .P.2016.DPIL@FIRE2016: Overview of shared task on Detecting Paraphrases in Indian Languages. Working notes [7] Mahalakshmi, S., Anand Kumar, M., Soman, K.P, of FIRE 2016 - Forum for Information Retrieval Evaluation, 2015, Paraphrase detection for Tamil language using deep Kolkata,India, December7-10,CEUR Workshop learning algorithm, International Journal of Applied Proceedings,CEUR-WS.org Engineering Research, 10 (17), pp. 13929-13934. [2] Clough Paul, Robert Gaizauskas, Scott Piao, and YorickWilks.,2002,METER: MEasuring TExt Reuse.InProceedings of the 40th Anniversary Meeting for