-

DPIL@FIRE 2016: Overview of Shared Task on Detecting Paraphrases in Indian Languages (DPIL)

CCS Concepts

0 1

Paraphrase detection

0 1

Semantic analysis

0 1

Indian languages

0 1

DPIL Corpora

0 1 0 Anand Kumar M, Shivkaran Singh Center for Computational Engineering and Networking (CEN) Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham Amrita University 1 Kavirajan B, Soman K P Center for Computational Engineering and Networking (CEN) Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham Amrita University

This paper explains the overview of the shared task "Detecting Paraphrases in Indian Languages" (DPIL) conducted at FIRE 2016. Given a pair of sentences in the same language, participants are asked to detect the semantic equivalence between the sentences. The shared task is proposed for four Indian languages namely Tamil, Malayalam, Hindi and Punjabi. The dataset created for the shared task has been made available online and it is the first open-source paraphrase detection corpora for Indian languages.

1. INTRODUCTION A Paraphrase can be defined as “the same meaning of a sentence is expressed in another sentence using different words”.

Paraphrases can be identified, generated or extracted. The proposed task is focused on sentence-level paraphrase identification for Indian languages (Tamil, Malayalam, Hindi and Punjabi). Identifying paraphrases in Indian languages is a difficult task because evaluating the semantic similarity of the underlying content and the understanding the morphological variations of the language are more critical. Paraphrase identification is strongly connected with generation and extraction of paraphrases. The paraphrase identification systems improve the performance of a paraphrase generation in terms of choosing the best paraphrase candidate from the list of candidates generated by paraphrase generation system. Paraphrase Identification is also used in validating the paraphrase extraction system and the machine translation system. In question answering system, Paraphrase identification plays a vital role in matching the questions asked by the user to the original questions for choosing the best answer.

Automatic short answers grading is another interesting application which needs semantic similarity for providing grades to the short answers. Plagiarism detection is another task which needs the paraphrase identification technique to detect the sentences which are paraphrases of other sentences.

One of the most commonly used corpora for paraphrase detection is the MSRP corpus[ 1 ], which contains 5,801 English sentence pairs from news articles manually labeled with 67% paraphrases and 33% non-paraphrases. Since there are no annotated corpora or automated semantic interpretation systems available for Indian languages till date, creating benchmark data for paraphrases and utilizing that data in open shared task competitions will motivate the research community for further research in Indian languages.

Details about the task and dataset can be found on the website1 of the shared task. The descriptions of the subtasks and evaluation metrics are discussed in Section 2, Paraphrase corpus creation and statistics are explored in Section 3, System descriptions of participants and result analyses are done in Section 4. We discuss the findings from the results Section 5. 2. RELATED TASKS AND CORPORA

In SemEval-20152, shared task on Paraphrase and Semantic Similarity In Twitter (PIT) [ 2 ] was conducted with the English Twitter Paraphrase Corpus [ 3 ]. The task has two sentence-level sub-tasks: a paraphrase identification task and a semantic textual similarity task. The same dataset was used for both sub-tasks but it differs in annotation and evaluation. A freely available manually annotated corpus of Russian sentence pairs is ParaPhraser [ 4 ], which is used in the recently organized shared task on Paraphrase detection for the Russian language [whit pap]. There were two subtasks, one was three-class classification: given a pair of sentences, to predict whether they are precise paraphrases, near paraphrases or non-paraphrases and another was binary classification: given a pair of sentences to predict whether they are paraphrases or non-paraphrases. Microsoft Research Paraphrase (MSRP) corpus is a well-known corpus which is manually annotated and it consists of 5,801 paraphrase pairs in the English language. The PAN plagiarism corpus 2010 (Paraphrase for Plagiarism -P4P) is used for the evaluation of automatic plagiarism detection algorithms. The corpus [ 5 ] is manually annotated with the paraphrase phenomena they contain. It is composed of 847 source-plagiarism pairs in English. The complete summary of existing paraphrase corpora and Linguistic phenomenon for paraphrases are discussed in [ 6 ]. In [ 7 ], issue of text plagiarism for Hindi language using English documents is addressed. For Tamil languages, paraphrase detection using deep learning techniques is applied in [ 8 ]. For Malayalam, paraphrase identification using fingerprinting [ 9 ] and statistical similarity [ 10 ] has been performed. 1 http://nlp.amrita.edu/dpil_cen/ 2 http://alt.qcri.org/semeval2015/ a b l e , d e c e a s e d

N i s h a w a s t h e e l d e s t [ S i n c e i n d e p e n d e n c e 1 1 m a l e a t h l e t e s h a v e b e e n t o

O l y m p i c s ] ச ுபதச் ே ர ி ய ில் 8 4 ே த ீவ த வ ர புபக் த ி ுவ அ ்பதுல க ல ர ்ம க ன ப வ ந ி ப ை ச வமு ற் வ ப க ய ில் ம ர த ்ம ஒ ுர ச ே ய ்ற ப க ச க ர ்ள அ ுனப் ப த ி ்ட ட

்ம , p l a n n i n g i s t o s e n d a s a t e l l i t e p e r m o n t h ]

P T a m i l ஒ ுர ச ே ய ்ற ப க ச க ர ப ை அ ுனப் ப ச வ ்ணடும எ ்ன ப ுத அ ்பதுல க ல ர ம ி ்ன க ன

ுவ , ஓ வ ி ய ்ங க ்ள க ி ப ட த ்த ன ளு ைட்

ன . 1

T a s k d e s c r i p t i o n .

= , t h e

= , c a n b e c a l c u l a t e d a s : 1 − S u b t a s k 2 :

G i v e n a p a i r o f s e n t e n c e s f r o m n e w s p a p e r d o m a i n = ( 2 ) s h a r e d t a s k i s t o i d e n t i f y w h e t h e r t h e y a r e p a r a p h r a s e s ( P ) o r +

s e m i p a r a p h r a s e s ( S

P ) o r n o t p a r a p h r a s e s ( N

P ) .

T h e s u b s c r i p t r e f e r s t o p a r a p h r a s e ( P ) c l a s s f o r t h e s u b t a s k 1 . T h e s u b t a s k 2 w a s s i m i l a r t o t h e s u b t a s k 1 e x c e p t t h e 3 p o i n t s c a l e

S i m i l a r l y a n d f o r n o n p a r a p h r a s e c l a s s 1 − t a g i n p a r a p h r a s e s .

T h i s m a k e s t h e s h a r e d t a s k e v e n m o r e c o u l d b e c a l c u l a t e d . c h a l l e n g i n g T o e v a l u a t e r u n s f o r s u b t a s k 2 w e u s e d . 2

E v a l u a t i o n m e t r i c s a n d .

S i n c e i t i s a m u l t i c l a s s − T h e e v a l u a t i o n m e t r i c s u s e d f o r s u b t a s k 1 a n d s u b t a s k 2 w e r e c l a s s i f i c a t i o n t a s k a n d g i v e s − s l i g h t l y d i f f e r e n t b e c a u s e o f u n i q u e n e s s o f t h e t a s k s .

T o e v a l u a t e i d e n t i c a l s c o r e s . , w e u s e d a c c u r a c y a n d f s c o r e v a l u e s . 4. PARAPHRASE CORPUS FOR INDIAN LANGUAGES A paraphrase is a linguistic phenomenon. It has many applications in the field of language teaching as well as computational linguistics. Linguistically, paraphrases are defined in terms of meaning. According to Meaning-Text Theory [ 11 ], if one or more syntactic construction retains semantic evenness, those are addressed as paraphrases. The exchangeability of semantic alikeness between the source text and paraphrased version mark the range of semantic alikeness between them. A paraphrase is a very fine mechanism to shape various language models. Different linguistic units like synonyms, semi-synonyms, figurative meaning and metaphors are considered as the basic elements for paraphrasing. Paraphrasing is closely related with synonyms.

Paraphrasing is not only found in lexical level but another linguistic level such as phrasal and sentential level also. Different levels of paraphrasing disclose the diversified forms of paraphrases and the semantic relationship to its source text. In paraphrase typologies, Lexical paraphrasing is the most popular forms of paraphrasing found in the literature. For example: If a source text is, “The two ships were acquired by the navy after the war”, then possible paraphrased versions are: “The two ships were conquered by the navy after the war” and “The two ships were won by the navy after the war”. There are even more paraphrases possible for the given sentence. Here the source verb ‘acquire’ is paraphrased with its exact synonyms. The source and paraphrases show the same syntactic structural phenomena. These types of paraphrase are the best examples for exact paraphrases. Some of the other common paraphrase typologies are; approximate paraphrases, sentential level paraphrases, adding extra linguistic units, changing the order etc.

The shared task on Detecting Paraphrases in Indian Languages (DPIL)3 required participants to identify sentential paraphrases in four Indian languages, namely Hindi, Tamil, Malayalam, and Punjabi. The corpora creation task for these Indian languages started with collecting news articles from various web-based news sources. The collected dataset was further cleaned from any noise or informal information. Apart from cleaning, some sentences required spelling corrections and text transformations. After removing all the irregularities, the dataset was annotated according to the paraphrases phenomena (Paraphrase, Non-Paraphrase, Semi-Paraphrase) present in each sentence pair. The annotation tags used were P, SP and NP corresponding to Paraphrase, Semi-Paraphrase and NonParaphrase. These annotations were done by language experts for each language. The annotated files were further proofread by a linguistic expert and then again by language expert (Two-step Proofreading). Additionally, the annotated dataset proofread by linguistic expert was converted to Extensible Markup Language (XML) format. 4.1 Corpora statistics The paraphrase corpus was further analysed for certain parameters such as number of sentence pairs for each class (P, NP, and SP), average number of words per sentence per task, and overall vocabulary size. The statistics for number of sentence pairs in testing and training phase for each subtask is given in Table 2. The average number of words per sentence along with average pair length for subtask 1 and subtask 2 is given in Table 3 & Table 4. The overall vocabulary size (Subtask 1 & Subtask 2) for training as well as testing for all the languages is shown in the form of line chart in Figure 1.Notably, vocabulary size for Hindi & Punjabi languages is less than Tamil and Malayalam. This is because, like other Dravidian languages (Kannada & Telugu), Tamil and Malayalam are agglutinative in nature. Due to this phenomenon, Dravidian languages end up having more unique words and hence larger vocabulary. 5. SYSTEM DESCRIPTION AND RESULTS A total of 35 teams registered for the organized shared task and out of those, 11 teams successfully submitted their runs. A brief description about the methodologies used by each team is given in the following subsection. Tamil Hindi Malayalam

Punjabi Train-Task1

Train-Task2

Test-Task1

Test-Task2 5.1 Participants System Description The brief description of the techniques used by all the teams to submit the runs for the shared task are as follows: ANUJ: This team participated only for the Hindi language. They pre-process the sentences using stemmer, soundex, synonym handler. After that, they extracted the features using overlapping words and normalized IDF scores. Finally, the Random forest classifier is used for classification.

ASE: This team participated only for Hindi Language. They extracted the features using POS tags and stemming information.

Semantic similarity metric is employed which extracts the word synonyms from WordNet to check whether the compared words are synonyms. Finally, decision tree classifier is used to detect the paraphrases.

BITS_PILANI: This team participated for Hindi language only.

They attempted paraphrase detection with different classifiers and finally used Logistic Regression for Subtask-1 and Random Forest for Subtask2.

CUSAT-TEAM: This team participated only for the Malayalam Language. They stemmed the words and calculated the sentence vector using Bag of Words model and find out the similarity score between sentences. Finally, they set a threshold for determining the appropriate class.

CUSAT_NLP: This team participated only in the Malayalam Language. They used identical tokens, matching lemmas and synonyms for finding the similarity between sentences. They also utilized in-house Malayalam Wordnet to replace the synonyms.

Finally, the similarity score is compared and a threshold is fixed to identify the exact class.

HIT2016: This team participated in all the four languages. Cosine Distance, Jaccard Coefficient, Dice Distance and METEOR features are used and classification is done based on Gradient Boosting Tree. They experiment various aspects of the classification method for detecting paraphrases.

JU_NLP: This team competed in all the four languages. They used similarity based features, word overlapping features and scores from the machine translation evaluation metrics to find out the similarity scores between pair of sentences. They tried with three different classifiers namely Naïve Bayes, SVM and SMO.

KEC@NLP: This team participated in Tamil language only.

They used existing Tamil Shallow parser to extract the morphological features and utilizing Support Vector Machine and Maximum Entropy for classifying paraphrases.

KS_JU: This team participated in all the four languages. They used different lexical and semantic level (Word embeddings) similarity measures for computing features and used multinomial logistic regression model as a classifier.

NLP-NITMZ: This team also participated in all the four languages. They used features based on Jaccard Similarity, length normalized Edit Distance and Cosine Similarity. Finally, these feature-set are trained using Probabilistic Neural Network (PNN) to detect the paraphrases. 5.2 Overall Results As announced during the shared task, we are giving Sarwan award for top performers in each languages. The name of the top performing team in each language is given in Table 5.The overall results of all the participating teams can be seen in Table 6. For representation purpose we have truncated the evaluation measures (Precision, Recall, and Accuracy) to two digits4. 6. DISCUSSIONS Out of the 11 teams which submitted their runs, 10 teams successfully submitted their working notes. There were four teams which participated in all the four languages and rest of the teams (3-Hindi, 2-Malayalam and 1-Tamil) participated in only one language. Two out of ten teams used the threshold based method to detect the paraphrases, remaining teams used the machine learning based approaches. The different types of feature set used by the participant teams are illustrated in Table 7. Most of the teams used the common similarity based features like cosine, Jaccard, and only two teams used the Machine Translation evaluation metrics, BLEU and METEOR as features. Very few teams used the synonym replacement and Wordnet features. For Tamil language, team KEC@NLP used the morphological information as features to the machine learning based classifier.

KS_JU team created the word2vec embeddings with the help of additional in-house unlabeled data and found out the semantic similarity features which were used as features in the classifier.

The top performing team (HIT-2016) for the three languages used the character n-gram based features and they experimented the results for different n-gram size.

We calculated F1-measure and accuracy for evaluating the submissions of the teams. The accuracy of the Task-2 is comparably low with the accuracy of Task-1 due to complexity of the task. In general, the accuracy obtained by runs submitted for Tamil and Malayalam language is low as compared to the accuracy obtained by Hindi and Punjabi language. This is due to the agglutinative nature of the Dravidian languages. 4 It does not affect the result of the participating teams  Due to some formatting issues, this participant re-submitted the system after deadline.  This participant didn’t submitted the working notes. 7. CONCLUSIONS AND FUTURE SCOPE In this overview paper, we explained the paraphrase corpus details and evaluation results of subtask-1 and subtask-2 of Detecting Paraphrases in Indian Languages (DPIL) shared task held at the 8th Forum for Information Retrieval (FIRE) Conference - 2016. A total number of 35 teams registered in which 11 teams submitted their runs successfully. The corpora developed for the shared task is the first publicly available paraphrase detection corpora for Indian languages. Detecting paraphrases and semantic similarity in Indian languages is a challenging task because the morphological variations and the semantic relations in Indian languages are more crucial to understand. Discrepancies can be found in manually annotated paraphrase corpus, to revise the annotations feedbacks are welcome and appreciated. Our detailed experiment analysis provides fundamental insights into the performance of paraphrase identification in Indian languages.

Overall, HIT-2016 (HeiLongJiang Institute of Technology) got the first place in Tamil, Malayalam, and Punjabi languages and Anuj (Sapient Global Markets) got the first place in Hindi. As a future work, we plan to extend the task to analyze the performance of cross-genre and cross-lingual paraphrases for more Indian languages. Detecting paraphrases in social media content of Indian languages, plagiarism detection and use of paraphrases in Machine Translation Evaluation are also interesting areas for further study. 8. ACKNOWLEDEMENT First, we would like to thank FIRE 2016 organizers for giving us an opportunity to organize the shared task on Detecting Paraphrases for Indian Languages (DPIL). We would like to extend our gratitude to the advisory committee members Prof.

Ramanan, RelAgent Pvt. Ltd, and Prof. Rajendran S, Computational Engineering and Networking (CEN) for actively supporting us throughout the track. We would like to thank our PG students at CEN for helping us in creating the paraphrase corpora.

[1] Dolan , W.B. and Brockett , C. , 2005 , October. Automatically constructing a corpus of sentential paraphrases . In Proc. of IWP.

[2] Xu , W. , Callison-Burch , C. and Dolan , W.B., 2015 . SemEval -2015 Task 1: Paraphrase and semantic similarity in Twitter (PIT) . Proceedings of SemEval.

[3] Xu , W. , Ritter , A. , Callison-Burch , C. , Dolan , W.B. and Ji , Y. , 2014 . Extracting lexically divergent paraphrases from Twitter. Transactions of the Association for Computational Linguistics, 2 , pp. 435 - 448 .

[4] Pronoza , E. , Yagunova , E. and Pronoza , A. , 2016 . Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction . In Information Retrieval (pp. 146 - 157 ). Springer International Publishing.

[5] Potthast , M. , Stein , B. , Barrón-Cedeño , A. and Rosso , P. , 2010 , August. An evaluation framework for plagiarism detection . In Proceedings of the 23rd international conference on computational linguistics: Posters (pp. 997 - 1005 ). Association for Computational Linguistics .

[6] Rus , V. , Banjade , R. and Lintean , M.C. , 2014 . On Paraphrase Identification Corpora . In LREC (pp. 2422 - 2429 ).

[7] Kothwal , R. and Varma , V. , 2013 . Cross lingual text reuse detection based on keyphrase extraction and similarity measures . In Multilingual Information Access in South Asian Languages (pp. 71 - 78 ). Springer Berlin Heidelberg.

[8] Mahalakshmi , S. , Anand

Kumar

, M. , Soman , K.P. , 2015 . Paraphrase detection for Tamil language using Deep learning algorithm . International journal of Applied Engineering Research , 10 ( 17 ), pp. 13929 - 13934

[9] Idicula , S.M. , 2015 , December. Fingerprinting based detection system for identifying plagiarism in Malayalam text documents . In 2015 International Conference on Computing and Network Communications (CoCoNet) (pp. 553 - 558 ). IEEE.

[10] Mathew , D. and Idicula , S.M. , 2013 , December. Paraphrase identification of malayalam sentences-an experience . In 2013 Fifth International Conference on Advanced Computing (ICoAC) (pp. 376 - 382 ). IEEE.

[11] Kahane , S. , 2003 . The meaning-text theory . Dependency and Valency. An International Handbook of Contemporary Research , 1 , pp. 546 - 570 .