=Paper=
{{Paper
|id=Vol-1737/T6-1
|storemode=property
|title= DPIL@FIRE2016: Overview of the Shared task on Detecting Paraphrases in Indian language
|pdfUrl=https://ceur-ws.org/Vol-1737/T6-1.pdf
|volume=Vol-1737
|authors=Anand Kumar M,Shivkaran Singh,Kavirajan B,Soman K P
|dblpUrl=https://dblp.org/rec/conf/fire/MSBP16
}}
== DPIL@FIRE2016: Overview of the Shared task on Detecting Paraphrases in Indian language==
DPIL@FIRE 2016: Overview of Shared Task on Detecting Paraphrases in Indian Languages (DPIL) Anand Kumar M, Shivkaran Singh Kavirajan B, Soman K P Center for Computational Engineering and Networking Center for Computational Engineering and Networking (CEN) (CEN) Amrita School of Engineering, Coimbatore Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham Amrita University Amrita University m_anandkumar@cb.amrita.edu kp_soman@amrita.edu ABSTRACT paraphrases and 33% non-paraphrases. Since there are no This paper explains the overview of the shared task "Detecting annotated corpora or automated semantic interpretation systems Paraphrases in Indian Languages" (DPIL) conducted at FIRE available for Indian languages till date, creating benchmark data 2016. Given a pair of sentences in the same language, participants for paraphrases and utilizing that data in open shared task are asked to detect the semantic equivalence between the competitions will motivate the research community for further sentences. The shared task is proposed for four Indian languages research in Indian languages. namely Tamil, Malayalam, Hindi and Punjabi. The dataset Details about the task and dataset can be found on the website 1 of created for the shared task has been made available online and it is the shared task. The descriptions of the subtasks and evaluation the first open-source paraphrase detection corpora for Indian metrics are discussed in Section 2, Paraphrase corpus creation and languages. statistics are explored in Section 3, System descriptions of participants and result analyses are done in Section 4. We discuss CCS Concepts the findings from the results Section 5. Computing methodologies → Artificial intelligence → Natural language processing → Language resources 2. RELATED TASKS AND CORPORA Computing methodologies → Artificial intelligence → Natural In SemEval-20152, shared task on Paraphrase and Semantic language processing → Lexical semantics Similarity In Twitter (PIT) [2] was conducted with the English Twitter Paraphrase Corpus [3]. The task has two sentence-level Keywords sub-tasks: a paraphrase identification task and a semantic textual Paraphrase detection; Semantic analysis, Indian languages; DPIL similarity task. The same dataset was used for both sub-tasks but Corpora it differs in annotation and evaluation. A freely available manually annotated corpus of Russian sentence pairs is ParaPhraser [4], 1. INTRODUCTION which is used in the recently organized shared task on Paraphrase A Paraphrase can be defined as “the same meaning of a sentence detection for the Russian language [whit pap]. There were two is expressed in another sentence using different words”. subtasks, one was three-class classification: given a pair of Paraphrases can be identified, generated or extracted. The sentences, to predict whether they are precise paraphrases, near proposed task is focused on sentence-level paraphrase paraphrases or non-paraphrases and another was binary identification for Indian languages (Tamil, Malayalam, Hindi and classification: given a pair of sentences to predict whether they are Punjabi). Identifying paraphrases in Indian languages is a difficult paraphrases or non-paraphrases. Microsoft Research Paraphrase task because evaluating the semantic similarity of the underlying (MSRP) corpus is a well-known corpus which is manually content and the understanding the morphological variations of the annotated and it consists of 5,801 paraphrase pairs in the English language are more critical. Paraphrase identification is strongly language. The PAN plagiarism corpus 2010 (Paraphrase for connected with generation and extraction of paraphrases. The Plagiarism -P4P) is used for the evaluation of automatic paraphrase identification systems improve the performance of a plagiarism detection algorithms. The corpus [5] is manually paraphrase generation in terms of choosing the best paraphrase annotated with the paraphrase phenomena they contain. It is candidate from the list of candidates generated by paraphrase composed of 847 source-plagiarism pairs in English. The generation system. Paraphrase Identification is also used in complete summary of existing paraphrase corpora and Linguistic validating the paraphrase extraction system and the machine phenomenon for paraphrases are discussed in [6]. In [7], issue of translation system. In question answering system, Paraphrase text plagiarism for Hindi language using English documents is identification plays a vital role in matching the questions asked by addressed. For Tamil languages, paraphrase detection using deep the user to the original questions for choosing the best answer. learning techniques is applied in [8]. For Malayalam, paraphrase Automatic short answers grading is another interesting application identification using fingerprinting [9] and statistical similarity which needs semantic similarity for providing grades to the short [10] has been performed. answers. Plagiarism detection is another task which needs the paraphrase identification technique to detect the sentences which are paraphrases of other sentences. One of the most commonly used corpora for paraphrase 1 http://nlp.amrita.edu/dpil_cen/ detection is the MSRP corpus[1], which contains 5,801 English 2 http://alt.qcri.org/semeval2015/ sentence pairs from news articles manually labeled with 67% Table 1. Examples for Hindi and Tamil language मृतका निशा तीि भाई-बहिों में सबसे बडी थी। [The deceased Nisha was eldest of three siblings ] P तीि भाई-बहिों में सबसे बडी थी मृतका निशा। [Out of three siblings, deceased Nisha was the eldest] उपमंत्री की बेनसक सैलरी 10 हजार से बढ़कर 35 हजार हो गई है। [The basic salary of deputy minister is increased from 10k to 35k] Hindi SP उपमंत्री की बेनसक सैलरी 35 हजार हो गई है। [The basic salary of deputy minister is 35k] नजमिानटिक में दीपा 4th पोनजशि पर रहीथीं। [Deepa came at 4th position in gymnastics] NP 11 भारतीय पुरुष नजमिाटि आजादी के बाद से ओललंनपक में जाचुकेहैं। [Since independence 11 male athletes have been to Olympics] புதுச்சேரியில் 84 ேதவீத வரக்குப்பதிவு [84 percent voting in Puducherry] P புதுச்சேரி ேட்டேபப சதர்தலில் 84 ேதவீத ஓட்டுப்பதிவரனது [Puducherry assembly elections recorded 84 percent of the vote] அப்துல்கலரம் கனபவ நிபைசவற்றும் வபகயில் மரதம் ஒரு சேயற்பகசகரள் அனுப்ப திட்டம் [In order to fulfill Abdul Kalam’s dream, planning is to send a satellite per month] SP Tamil ஒரு சேயற்பகசகரபை அனுப்ப சவண்டும் என்பது அப்துல்கலரமின் கனவு [Abdul Kalam's dream was to send a satellite] அபைகைில் இருந்தும் ேிபலகள், ஓவியங்கள் கிபடத்தன [Statues and paintings were found from the rooms] மூன்று நரட்கள் நடத்தப்பட்ட சேரதபனயில் சமரத்தம் 71 கற்ேிபலகள் NP மீட்கப்பட்டுள்ைன [A total of 71 stone statues have been recovered in a three day raid] 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (1) and 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 (2) for subtask 1 were calculated 3. TASK DESCRIPTION & EVALUATION as follows: METRIC 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 3.1 Task description 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝟏) 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 There were two subtasks under shared task on Detecting 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑎𝑟𝑎𝑝ℎ𝑟𝑎𝑠𝑒𝑠 Paraphrase in Indian Languages (DPIL). The description of the 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑃 = subtask are: 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑝𝑎𝑟𝑎𝑝ℎ𝑟𝑎𝑠𝑒𝑠 Subtask 1: Given a pair of sentences from newspaper domain, the 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑎𝑟𝑎𝑝ℎ𝑟𝑎𝑠𝑒𝑠 𝑅𝑒𝑐𝑎𝑙𝑙𝑃 = shared task is to classify them as paraphrases (P) or not 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑝𝑎𝑟𝑎𝑝ℎ𝑟𝑎𝑠𝑒𝑠 paraphrases (NP). Subsequently, 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 can be calculated as: Subtask 2: Given a pair of sentences from newspaper domain, the 2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑝 × 𝑅𝑒𝑐𝑎𝑙𝑙𝑝 shared task is to identify whether they are paraphrases (P) or 𝐹1 − 𝑠𝑐𝑜𝑟𝑒𝑃 = (2) 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑝 + 𝑅𝑒𝑐𝑎𝑙𝑙𝑝 semi-paraphrases (SP) or not paraphrases (NP). The subtask 2 was similar to the subtask 1 except the 3-point scale The subscript 𝑝 refers to paraphrase (P) class for the subtask 1. tag in paraphrases. This makes the shared task even more Similarly, 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 for non-paraphrase class challenging could be calculated. To evaluate runs for subtask 2, we used 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦, 𝑚𝑖𝑐𝑟𝑜 − 3.2 Evaluation metrics 𝐹 𝑠𝑐𝑜𝑟𝑒 and 𝑚𝑎𝑐𝑟𝑜 − 𝐹 𝑠𝑐𝑜𝑟𝑒. Since it is a multiclass The evaluation metrics used for subtask 1 and subtask 2 were classification task, 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝑚𝑖𝑐𝑟𝑜 − 𝐹 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 gives slightly different because of uniqueness of the tasks. To evaluate identical scores. The 𝑚𝑎𝑐𝑟𝑜 − 𝐹 𝑠𝑐𝑜𝑟𝑒 (3) could be computed as: runs for subtask 1, we used accuracy and f-score values. The 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑃 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑁𝑃 +𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑆𝑃 Table 1 includes examples of Paraphrase, Semi-Paraphrase, 𝑀𝑎𝑐𝑟𝑜 − 𝑃 = and Non-Paraphrase for Hindi and Punjabi Language. Where H 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 stands for Hindi and P stand for Punjabi and P, SP and NP are the 𝑅𝑒𝑐𝑎𝑙𝑙𝑃 + 𝑅𝑒𝑐𝑎𝑙𝑙𝑁𝑃 +𝑅𝑒𝑐𝑎𝑙𝑙𝑆𝑃 𝑀𝑎𝑐𝑟𝑜 − 𝑅𝑒 = tags used for Paraphrase, Semi-Paraphrase, and Non-Paraphrase. 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 English translation for each sentence pairs is given for the non- 2 × 𝑀𝑎𝑐𝑟𝑜 − 𝑃 × 𝑀𝑎𝑐𝑟𝑜 − 𝑅 native speakers to understand the meaning. It can be seen that 𝑀𝑎𝑐𝑟𝑜 − 𝐹1 𝑠𝑐𝑜𝑟𝑒 = (3) Paraphrased sentence pairs contain the same information, Semi 𝑀𝑎𝑐𝑟𝑜 − 𝑃 + 𝑀𝑎𝑐𝑟𝑜 − 𝑅 paraphrased sentence pair’s lacks additional information and Non- Where 𝑀𝑎𝑐𝑟𝑜 − 𝑃 and 𝑀𝑎𝑐𝑟𝑜 − 𝑅 are the macro precision and Paraphrases conveys totally different information. macro recall, which is used to calculate 𝑀𝑎𝑐𝑟𝑜 − 𝐹1 𝑠𝑐𝑜𝑟𝑒. 4. PARAPHRASE CORPUS FOR INDIAN 4.1 Corpora statistics The paraphrase corpus was further analysed for certain parameters LANGUAGES such as number of sentence pairs for each class (P, NP, and SP), A paraphrase is a linguistic phenomenon. It has many applications average number of words per sentence per task, and overall in the field of language teaching as well as computational vocabulary size. The statistics for number of sentence pairs in linguistics. Linguistically, paraphrases are defined in terms of testing and training phase for each subtask is given in Table 2. meaning. According to Meaning-Text Theory [11], if one or more syntactic construction retains semantic evenness, those are Table 2. Statistic for sentence pairs in Subtask 1 & 2 addressed as paraphrases. The exchangeability of semantic Subtask1 (in pairs) Subtask2 (in pairs) alikeness between the source text and paraphrased version mark Language Train Test Train Test the range of semantic alikeness between them. A paraphrase is a Tamil 2500 900 3500 1400 very fine mechanism to shape various language models. Different Malayalam 2500 900 3500 1400 linguistic units like synonyms, semi-synonyms, figurative Hindi 2500 900 3500 1400 meaning and metaphors are considered as the basic elements for Punjabi 1700 500 2200 750 paraphrasing. Paraphrasing is closely related with synonyms. Paraphrasing is not only found in lexical level but another linguistic level such as phrasal and sentential level also. Different The average number of words per sentence along with average levels of paraphrasing disclose the diversified forms of pair length for subtask 1 and subtask 2 is given in Table 3 & Table paraphrases and the semantic relationship to its source text. In 4. paraphrase typologies, Lexical paraphrasing is the most popular Table 3. Average number of words per sentence for Subtask 1 forms of paraphrasing found in the literature. For example: If a source text is, “The two ships were acquired by the navy after the Subtask - 1 Language war”, then possible paraphrased versions are: “The two ships were Sentence 1 Sentence 2 Pair conquered by the navy after the war” and “The two ships were Hindi 16.058 16.376 16.217 won by the navy after the war”. There are even more paraphrases Tamil 11.092 12.044 11.568 possible for the given sentence. Here the source verb ‘acquire’ is paraphrased with its exact synonyms. The source and paraphrases Malayalam 9.253 9.035 9.144 show the same syntactic structural phenomena. These types of Punjabi 19.485 19.582 19.534 paraphrase are the best examples for exact paraphrases. Some of the other common paraphrase typologies are; approximate Table 4. Average number of words per sentence for Subtask 2 paraphrases, sentential level paraphrases, adding extra linguistic units, changing the order etc. Subtask - 2 Language The shared task on Detecting Paraphrases in Indian Sentence 1 Sentence 2 Pair Languages (DPIL)3 required participants to identify sentential Hindi 17.78 16.48 17.130 paraphrases in four Indian languages, namely Hindi, Tamil, Tamil 11.097 11.777 11.437 Malayalam, and Punjabi. The corpora creation task for these Malayalam 9.414 8.449 8.932 Indian languages started with collecting news articles from Punjabi 20.994 19.699 20.347 various web-based news sources. The collected dataset was further cleaned from any noise or informal information. Apart from cleaning, some sentences required spelling corrections and The overall vocabulary size (Subtask 1 & Subtask 2) for training text transformations. After removing all the irregularities, the as well as testing for all the languages is shown in the form of line dataset was annotated according to the paraphrases phenomena chart in Figure 1.Notably, vocabulary size for Hindi & Punjabi (Paraphrase, Non-Paraphrase, Semi-Paraphrase) present in each languages is less than Tamil and Malayalam. This is because, like sentence pair. The annotation tags used were P, SP and NP other Dravidian languages (Kannada & Telugu), Tamil and corresponding to Paraphrase, Semi-Paraphrase and Non- Malayalam are agglutinative in nature. Due to this phenomenon, Paraphrase. These annotations were done by language experts for Dravidian languages end up having more unique words and hence each language. The annotated files were further proofread by a larger vocabulary. linguistic expert and then again by language expert (Two-step Proofreading). Additionally, the annotated dataset proofread by 5. SYSTEM DESCRIPTION AND RESULTS linguistic expert was converted to Extensible Markup Language A total of 35 teams registered for the organized shared task and (XML) format. out of those, 11 teams successfully submitted their runs. A brief description about the methodologies used by each team is given in the following subsection. 3http://nlp.amrita.edu/dpil_cen/ morphological features and utilizing Support Vector Machine and 35000 Tamil Maximum Entropy for classifying paraphrases. 30000 Hindi KS_JU: This team participated in all the four languages. They 25000 Malayalam used different lexical and semantic level (Word embeddings) 20000 Punjabi similarity measures for computing features and used multinomial logistic regression model as a classifier. 15000 10000 NLP-NITMZ: This team also participated in all the four languages. They used features based on Jaccard Similarity, length 5000 normalized Edit Distance and Cosine Similarity. Finally, these 0 feature-set are trained using Probabilistic Neural Network (PNN) Train-Task1 Train-Task2 Test-Task1 Test-Task2 to detect the paraphrases. Figure 1. Overall vocabulary size 5.2 Overall Results As announced during the shared task, we are giving Sarwan 5.1 Participants System Description award for top performers in each languages. The name of the top The brief description of the techniques used by all the teams to performing team in each language is given in Table 5.The overall submit the runs for the shared task are as follows: results of all the participating teams can be seen in Table 6. For representation purpose we have truncated the evaluation measures ANUJ: This team participated only for the Hindi language. They (Precision, Recall, and Accuracy) to two digits4. pre-process the sentences using stemmer, soundex, synonym handler. After that, they extracted the features using overlapping Table 5. Top performers in each language words and normalized IDF scores. Finally, the Random forest Punjabi Hindi Malayalam Tamil Rank classifier is used for classification. 0.932 0.907 0.785 0.776 First* ASE: This team participated only for Hindi Language. They (HIT) (Anuj) (HIT) (HIT) extracted the features using POS tags and stemming information. 0.922 0.896 0.729 0.741 Semantic similarity metric is employed which extracts the word Second (JU_KS) (HIT) (JU_KS) (KEC) synonyms from WordNet to check whether the compared words 0.913 0.876 0.713 0.727 are synonyms. Finally, decision tree classifier is used to detect the Third (JU) (JU_KS) (NIT-MZ) (NIT-MZ) paraphrases. BITS_PILANI: This team participated for Hindi language only. They attempted paraphrase detection with different classifiers and 6. DISCUSSIONS finally used Logistic Regression for Subtask-1 and Random Forest Out of the 11 teams which submitted their runs, 10 teams for Subtask2. successfully submitted their working notes. There were four teams which participated in all the four languages and rest of the teams CUSAT-TEAM: This team participated only for the Malayalam (3-Hindi, 2-Malayalam and 1-Tamil) participated in only one Language. They stemmed the words and calculated the sentence language. Two out of ten teams used the threshold based method vector using Bag of Words model and find out the similarity score to detect the paraphrases, remaining teams used the machine between sentences. Finally, they set a threshold for determining learning based approaches. The different types of feature set used the appropriate class. by the participant teams are illustrated in Table 7. Most of the CUSAT_NLP: This team participated only in the Malayalam teams used the common similarity based features like cosine, Language. They used identical tokens, matching lemmas and Jaccard, and only two teams used the Machine Translation synonyms for finding the similarity between sentences. They also evaluation metrics, BLEU and METEOR as features. Very few utilized in-house Malayalam Wordnet to replace the synonyms. teams used the synonym replacement and Wordnet features. For Finally, the similarity score is compared and a threshold is fixed Tamil language, team KEC@NLP used the morphological to identify the exact class. information as features to the machine learning based classifier. KS_JU team created the word2vec embeddings with the help of HIT2016: This team participated in all the four languages. Cosine additional in-house unlabeled data and found out the semantic Distance, Jaccard Coefficient, Dice Distance and METEOR similarity features which were used as features in the classifier. features are used and classification is done based on Gradient The top performing team (HIT-2016) for the three languages used Boosting Tree. They experiment various aspects of the the character n-gram based features and they experimented the classification method for detecting paraphrases. results for different n-gram size. We calculated F1-measure and accuracy for evaluating the JU_NLP: This team competed in all the four languages. They submissions of the teams. The accuracy of the Task-2 is used similarity based features, word overlapping features and comparably low with the accuracy of Task-1 due to complexity of scores from the machine translation evaluation metrics to find out the task. In general, the accuracy obtained by runs submitted for the similarity scores between pair of sentences. They tried with Tamil and Malayalam language is low as compared to the three different classifiers namely Naïve Bayes, SVM and SMO. accuracy obtained by Hindi and Punjabi language. This is due to the agglutinative nature of the Dravidian languages. KEC@NLP: This team participated in Tamil language only. They used existing Tamil Shallow parser to extract the 4 It does not affect the result of the participating teams Table 6. Overall result for Subtask 1 & Subtask 2 Subtask 1 Subtask 2 Team Name Language F1 F1 Precision Recall Accuracy Precision Recall Accuracy Score Score Anuj Hindi 0.95 0.90 0.9200 0.91 0.90 0.90 0.9014 0.90 ASE Hindi 0.41 0.35 0.3588 0.34 0.35 0.35 0.3542 0.35 ASE Hindi 0.82 0.97 0.8922 0.89 0.68 0.67 0.6660 0.67 BITS-PILANI Hindi 0.91 0.90 0.8977 0.89 0.72 0.72 0.7171 0.71 CUSAT NLP Malayalam 0.83 0.72 0.7622 0.75 0.52 0.52 0.5207 0.51 CUSATTEAM Malayalam 0.79 0.88 0.8044 0.76 0.51 0.50 0.5085 0.46 DAVPBI Punjabi 0.95 0.92 0.9380 0.94 0.77 0.76 0.7466 0.73 HIT2016 Hindi 0.97 0.84 0.8966 0.89 0.90 0.90 0.9000 0.89 HIT2016 Malayalam 0.84 0.87 0.8377 0.81 0.74 0.75 0.7485 0.74 HIT2016 Punjabi 0.95 0.94 0.9440 0.94 0.95 0.95 0.9226 0.92 HIT2016 Tamil 0.82 0.87 0.8211 0.79 0.75 0.75 0.7550 0.73 JU-NLP Hindi 0.75 0.99 0.8222 0.74 0.68 0.68 0.6857 0.68 JU-NLP Malayalam 0.58 0.99 0.5900 0.16 0.42 0.42 0.4221 0.30 JU-NLP Punjabi 0.95 0.94 0.9420 0.94 0.91 0.91 0.8866 0.88 JU-NLP Tamil 0.57 1.00 0.5755 0.09 0.55 0.55 0.5507 0.43 KS_JU Hindi 0.94 0.89 0.9066 0.90 0.85 0.85 0.8521 0.84 KS_JU Malayalam 0.83 0.82 0.8100 0.79 0.66 0.66 0.6614 0.65 KS_JU Punjabi 0.95 0.94 0.9460 0.95 0.92 0.92 0.8960 0.89 KS_JU Tamil 0.79 0.85 0.7888 0.75 0.67 0.67 0.6735 0.66 NLP@KEC Tamil 0.82 0.87 0.8233 0.79 0.68 0.68 0.6857 0.66 NLP-NITMZ Hindi 0.92 0.92 0.9155 0.91 0.78 0.78 0.7857 0.76 NLP-NITMZ Malayalam 0.8 0.94 0.8344 0.79 0.62 0.62 0.6243 0.60 NLP-NITMZ Punjabi 0.95 0.94 0.9420 0.94 0.83 0.83 0.8120 0.80 NLP-NITMZ Tamil 0.8 0.92 0.8333 0.79 0.66 0.66 0.6571 0.63 Table 7. various Features used by the participants BITS- CUSAT CUSAT NLP- Features Anuj ASE HIT2016 JU-NLP KS_JU NLP@KEC PILANI NLP TEAM NITMZ POS Stem/Lemma Stopwords Word Overlap Synonym Cosine Jaccord Levinstin METEOR/BLEU Others IDF Soundex WordNet BoW N-gram Dice word2vec Morph Log Reg/ Gradient Multi- Random Maximum Classifier J 48 Random Threshold Threshold Tree SMO nomial Log Prob NN Forest Entropy Forest Boosting Reg Due to some formatting issues, this participant re-submitted the system after deadline. This participant didn’t submitted the working notes. 7. CONCLUSIONS AND FUTURE SCOPE [7] Kothwal, R. and Varma, V., 2013. Cross lingual text reuse In this overview paper, we explained the paraphrase corpus details detection based on keyphrase extraction and similarity and evaluation results of subtask-1 and subtask-2 of Detecting measures. In Multilingual Information Access in South Asian Paraphrases in Indian Languages (DPIL) shared task held at the Languages (pp. 71-78). Springer Berlin Heidelberg. 8th Forum for Information Retrieval (FIRE) Conference - 2016. A [8] Mahalakshmi, S., Anand Kumar, M., Soman, K.P., 2015. total number of 35 teams registered in which 11 teams submitted Paraphrase detection for Tamil language using Deep learning their runs successfully. The corpora developed for the shared task algorithm. International journal of Applied Engineering is the first publicly available paraphrase detection corpora for Research, 10 (17), pp. 13929-13934 Indian languages. Detecting paraphrases and semantic similarity [9] Idicula, S.M., 2015, December. Fingerprinting based in Indian languages is a challenging task because the detection system for identifying plagiarism in Malayalam morphological variations and the semantic relations in Indian text documents. In 2015 International Conference on languages are more crucial to understand. Discrepancies can be Computing and Network Communications (CoCoNet) (pp. found in manually annotated paraphrase corpus, to revise the 553-558). IEEE. annotations feedbacks are welcome and appreciated. Our detailed [10] Mathew, D. and Idicula, S.M., 2013, December. Paraphrase experiment analysis provides fundamental insights into the identification of malayalam sentences-an experience. In 2013 performance of paraphrase identification in Indian languages. Fifth International Conference on Advanced Computing Overall, HIT-2016 (HeiLongJiang Institute of Technology) got (ICoAC) (pp. 376-382). IEEE. the first place in Tamil, Malayalam, and Punjabi languages and Anuj (Sapient Global Markets) got the first place in Hindi. As a [11] Kahane, S., 2003. The meaning-text theory. Dependency and future work, we plan to extend the task to analyze the Valency. An International Handbook of Contemporary performance of cross-genre and cross-lingual paraphrases for Research, 1, pp.546-570. more Indian languages. Detecting paraphrases in social media content of Indian languages, plagiarism detection and use of paraphrases in Machine Translation Evaluation are also interesting areas for further study. 8. ACKNOWLEDEMENT First, we would like to thank FIRE 2016 organizers for giving us an opportunity to organize the shared task on Detecting Paraphrases for Indian Languages (DPIL). We would like to extend our gratitude to the advisory committee members Prof. Ramanan, RelAgent Pvt. Ltd, and Prof. Rajendran S, Computational Engineering and Networking (CEN) for actively supporting us throughout the track. We would like to thank our PG students at CEN for helping us in creating the paraphrase corpora. 9. REFERENCES [1] Dolan, W.B. and Brockett, C., 2005, October. Automatically constructing a corpus of sentential paraphrases. In Proc. of IWP. [2] Xu, W., Callison-Burch, C. and Dolan, W.B., 2015. SemEval-2015 Task 1: Paraphrase and semantic similarity in Twitter (PIT). Proceedings of SemEval. [3] Xu, W., Ritter, A., Callison-Burch, C., Dolan, W.B. and Ji, Y., 2014. Extracting lexically divergent paraphrases from Twitter. Transactions of the Association for Computational Linguistics, 2, pp.435-448. [4] Pronoza, E., Yagunova, E. and Pronoza, A., 2016. Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In Information Retrieval (pp. 146- 157). Springer International Publishing. [5] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P., 2010, August. An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics: Posters (pp. 997- 1005). Association for Computational Linguistics. [6] Rus, V., Banjade, R. and Lintean, M.C., 2014. On Paraphrase Identification Corpora. In LREC (pp. 2422-2429).