=Paper= {{Paper |id=Vol-1737/T6-1 |storemode=property |title= DPIL@FIRE2016: Overview of the Shared task on Detecting Paraphrases in Indian language |pdfUrl=https://ceur-ws.org/Vol-1737/T6-1.pdf |volume=Vol-1737 |authors=Anand Kumar M,Shivkaran Singh,Kavirajan B,Soman K P |dblpUrl=https://dblp.org/rec/conf/fire/MSBP16 }} == DPIL@FIRE2016: Overview of the Shared task on Detecting Paraphrases in Indian language== https://ceur-ws.org/Vol-1737/T6-1.pdf
  DPIL@FIRE 2016: Overview of Shared Task on Detecting
        Paraphrases in Indian Languages (DPIL)
          Anand Kumar M, Shivkaran Singh                                             Kavirajan B, Soman K P
  Center for Computational Engineering and Networking                  Center for Computational Engineering and Networking
                         (CEN)                                                                (CEN)
       Amrita School of Engineering, Coimbatore                            Amrita School of Engineering, Coimbatore
              Amrita Vishwa Vidyapeetham                                           Amrita Vishwa Vidyapeetham
                    Amrita University                                                    Amrita University
            m_anandkumar@cb.amrita.edu                                                kp_soman@amrita.edu

ABSTRACT                                                               paraphrases and 33% non-paraphrases. Since there are no
This paper explains the overview of the shared task "Detecting         annotated corpora or automated semantic interpretation systems
Paraphrases in Indian Languages" (DPIL) conducted at FIRE              available for Indian languages till date, creating benchmark data
2016. Given a pair of sentences in the same language, participants     for paraphrases and utilizing that data in open shared task
are asked to detect the semantic equivalence between the               competitions will motivate the research community for further
sentences. The shared task is proposed for four Indian languages       research in Indian languages.
namely Tamil, Malayalam, Hindi and Punjabi. The dataset                Details about the task and dataset can be found on the website 1 of
created for the shared task has been made available online and it is   the shared task. The descriptions of the subtasks and evaluation
the first open-source paraphrase detection corpora for Indian          metrics are discussed in Section 2, Paraphrase corpus creation and
languages.                                                             statistics are explored in Section 3, System descriptions of
                                                                       participants and result analyses are done in Section 4. We discuss
CCS Concepts                                                           the findings from the results Section 5.
Computing methodologies → Artificial intelligence → Natural
language processing → Language resources                               2. RELATED TASKS AND CORPORA
Computing methodologies → Artificial intelligence → Natural                  In SemEval-20152, shared task on Paraphrase and Semantic
language processing → Lexical semantics                                Similarity In Twitter (PIT) [2] was conducted with the English
                                                                       Twitter Paraphrase Corpus [3]. The task has two sentence-level
Keywords                                                               sub-tasks: a paraphrase identification task and a semantic textual
Paraphrase detection; Semantic analysis, Indian languages; DPIL        similarity task. The same dataset was used for both sub-tasks but
Corpora                                                                it differs in annotation and evaluation. A freely available manually
                                                                       annotated corpus of Russian sentence pairs is ParaPhraser [4],
1. INTRODUCTION                                                        which is used in the recently organized shared task on Paraphrase
A Paraphrase can be defined as “the same meaning of a sentence         detection for the Russian language [whit pap]. There were two
is expressed in another sentence using different words”.               subtasks, one was three-class classification: given a pair of
Paraphrases can be identified, generated or extracted. The             sentences, to predict whether they are precise paraphrases, near
proposed task is focused on sentence-level paraphrase                  paraphrases or non-paraphrases and another was                 binary
identification for Indian languages (Tamil, Malayalam, Hindi and       classification: given a pair of sentences to predict whether they are
Punjabi). Identifying paraphrases in Indian languages is a difficult   paraphrases or non-paraphrases. Microsoft Research Paraphrase
task because evaluating the semantic similarity of the underlying      (MSRP) corpus is a well-known corpus which is manually
content and the understanding the morphological variations of the      annotated and it consists of 5,801 paraphrase pairs in the English
language are more critical. Paraphrase identification is strongly      language. The PAN plagiarism corpus 2010 (Paraphrase for
connected with generation and extraction of paraphrases. The           Plagiarism -P4P) is used for the evaluation of automatic
paraphrase identification systems improve the performance of a         plagiarism detection algorithms. The corpus [5] is manually
paraphrase generation in terms of choosing the best paraphrase         annotated with the paraphrase phenomena they contain. It is
candidate from the list of candidates generated by paraphrase          composed of 847 source-plagiarism pairs in English. The
generation system. Paraphrase Identification is also used in           complete summary of existing paraphrase corpora and Linguistic
validating the paraphrase extraction system and the machine            phenomenon for paraphrases are discussed in [6]. In [7], issue of
translation system. In question answering system, Paraphrase           text plagiarism for Hindi language using English documents is
identification plays a vital role in matching the questions asked by   addressed. For Tamil languages, paraphrase detection using deep
the user to the original questions for choosing the best answer.       learning techniques is applied in [8]. For Malayalam, paraphrase
Automatic short answers grading is another interesting application     identification using fingerprinting [9] and statistical similarity
which needs semantic similarity for providing grades to the short      [10] has been performed.
answers. Plagiarism detection is another task which needs the
paraphrase identification technique to detect the sentences which
are paraphrases of other sentences.
 One of the most commonly used corpora for paraphrase                  1 http://nlp.amrita.edu/dpil_cen/
detection is the MSRP corpus[1], which contains 5,801 English
                                                                       2 http://alt.qcri.org/semeval2015/
sentence pairs from news articles manually labeled with 67%
                                            Table 1. Examples for Hindi and Tamil language

                   मृतका निशा तीि भाई-बहिों में सबसे बडी थी।
                   [The deceased Nisha was eldest of three siblings ]
                                                                                                                               P
                   तीि भाई-बहिों में सबसे बडी थी मृतका निशा।
                   [Out of three siblings, deceased Nisha was the eldest]

                   उपमंत्री की बेनसक सैलरी 10 हजार से बढ़कर 35 हजार हो गई है।
                   [The basic salary of deputy minister is increased from 10k to 35k]
        Hindi                                                                                                                 SP
                   उपमंत्री की बेनसक सैलरी 35 हजार हो गई है।
                   [The basic salary of deputy minister is 35k]
                   नजमिानटिक में दीपा 4th पोनजशि पर रहीथीं।
                   [Deepa came at 4th position in gymnastics]
                                                                                                                             NP
                   11 भारतीय पुरुष नजमिाटि आजादी के बाद से ओललंनपक में जाचुकेहैं।
                   [Since independence 11 male athletes have been to Olympics]
                   புதுச்சேரியில் 84 ேதவீத வரக்குப்பதிவு
                   [84 percent voting in Puducherry]
                                                                                                                               P
                   புதுச்சேரி ேட்டேபப சதர்தலில் 84 ேதவீத ஓட்டுப்பதிவரனது
                   [Puducherry assembly elections recorded 84 percent of the vote]
                   அப்துல்கலரம் கனபவ நிபைசவற்றும் வபகயில் மரதம் ஒரு
                   சேயற்பகசகரள் அனுப்ப திட்டம்
                   [In order to fulfill Abdul Kalam’s dream, planning is to send a satellite per month]
                                                                                                                              SP
        Tamil ஒரு சேயற்பகசகரபை அனுப்ப சவண்டும் என்பது அப்துல்கலரமின்
                   கனவு
                   [Abdul Kalam's dream was to send a satellite]
                   அபைகைில் இருந்தும் ேிபலகள், ஓவியங்கள் கிபடத்தன
                   [Statues and paintings were found from the rooms]
                   மூன்று நரட்கள் நடத்தப்பட்ட சேரதபனயில் சமரத்தம் 71 கற்ேிபலகள்                                              NP
                   மீட்கப்பட்டுள்ைன
                   [A total of 71 stone statues have been recovered in a three day raid]

                                                                            𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (1) and 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 (2) for subtask 1 were calculated
3. TASK DESCRIPTION & EVALUATION                                            as follows:
METRIC                                                                                     𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
3.1 Task description                                                        𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =                                                   (𝟏)
                                                                                            𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠
There were two subtasks under shared task on Detecting
                                                                                            𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑎𝑟𝑎𝑝ℎ𝑟𝑎𝑠𝑒𝑠
Paraphrase in Indian Languages (DPIL). The description of the               𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑃 =
subtask are:                                                                                𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑒𝑡𝑒𝑐𝑡𝑒𝑑 𝑝𝑎𝑟𝑎𝑝ℎ𝑟𝑎𝑠𝑒𝑠
Subtask 1: Given a pair of sentences from newspaper domain, the                          𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑝𝑎𝑟𝑎𝑝ℎ𝑟𝑎𝑠𝑒𝑠
                                                                            𝑅𝑒𝑐𝑎𝑙𝑙𝑃 =
shared task is to classify them as paraphrases (P) or not                               𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑝𝑎𝑟𝑎𝑝ℎ𝑟𝑎𝑠𝑒𝑠
paraphrases (NP).                                                           Subsequently, 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 can be calculated as:
Subtask 2: Given a pair of sentences from newspaper domain, the                              2 × 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑝 × 𝑅𝑒𝑐𝑎𝑙𝑙𝑝
shared task is to identify whether they are paraphrases (P) or              𝐹1 − 𝑠𝑐𝑜𝑟𝑒𝑃 =                                                (2)
                                                                                               𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑝 + 𝑅𝑒𝑐𝑎𝑙𝑙𝑝
semi-paraphrases (SP) or not paraphrases (NP).
The subtask 2 was similar to the subtask 1 except the 3-point scale         The subscript 𝑝 refers to paraphrase (P) class for the subtask 1.
tag in paraphrases. This makes the shared task even more                    Similarly, 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝐹1 − 𝑠𝑐𝑜𝑟𝑒 for non-paraphrase class
challenging                                                                 could be calculated.
                                                                            To evaluate runs for subtask 2, we used 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦, 𝑚𝑖𝑐𝑟𝑜 −
3.2 Evaluation metrics                                                      𝐹 𝑠𝑐𝑜𝑟𝑒 and 𝑚𝑎𝑐𝑟𝑜 − 𝐹 𝑠𝑐𝑜𝑟𝑒. Since it is a multiclass
The evaluation metrics used for subtask 1 and subtask 2 were                classification task, 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 and 𝑚𝑖𝑐𝑟𝑜 − 𝐹 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 gives
slightly different because of uniqueness of the tasks. To evaluate          identical scores. The 𝑚𝑎𝑐𝑟𝑜 − 𝐹 𝑠𝑐𝑜𝑟𝑒 (3) could be computed as:
runs for subtask 1, we used accuracy and f-score values. The
               𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑃 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑁𝑃 +𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑆𝑃                         Table 1 includes examples of Paraphrase, Semi-Paraphrase,
𝑀𝑎𝑐𝑟𝑜 − 𝑃 =                                                             and Non-Paraphrase for Hindi and Punjabi Language. Where H
                        𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
                                                                        stands for Hindi and P stand for Punjabi and P, SP and NP are the
                𝑅𝑒𝑐𝑎𝑙𝑙𝑃 + 𝑅𝑒𝑐𝑎𝑙𝑙𝑁𝑃 +𝑅𝑒𝑐𝑎𝑙𝑙𝑆𝑃
𝑀𝑎𝑐𝑟𝑜 − 𝑅𝑒 =                                                            tags used for Paraphrase, Semi-Paraphrase, and Non-Paraphrase.
                      𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠                                 English translation for each sentence pairs is given for the non-
                       2 × 𝑀𝑎𝑐𝑟𝑜 − 𝑃 × 𝑀𝑎𝑐𝑟𝑜 − 𝑅                        native speakers to understand the meaning. It can be seen that
𝑀𝑎𝑐𝑟𝑜 − 𝐹1 𝑠𝑐𝑜𝑟𝑒 =                                         (3)          Paraphrased sentence pairs contain the same information, Semi
                         𝑀𝑎𝑐𝑟𝑜 − 𝑃 + 𝑀𝑎𝑐𝑟𝑜 − 𝑅
                                                                        paraphrased sentence pair’s lacks additional information and Non-
Where 𝑀𝑎𝑐𝑟𝑜 − 𝑃 and 𝑀𝑎𝑐𝑟𝑜 − 𝑅 are the macro precision and
                                                                        Paraphrases conveys totally different information.
macro recall, which is used to calculate 𝑀𝑎𝑐𝑟𝑜 − 𝐹1 𝑠𝑐𝑜𝑟𝑒.

4. PARAPHRASE CORPUS FOR INDIAN                                         4.1 Corpora statistics
                                                                        The paraphrase corpus was further analysed for certain parameters
LANGUAGES                                                               such as number of sentence pairs for each class (P, NP, and SP),
A paraphrase is a linguistic phenomenon. It has many applications       average number of words per sentence per task, and overall
in the field of language teaching as well as computational              vocabulary size. The statistics for number of sentence pairs in
linguistics. Linguistically, paraphrases are defined in terms of        testing and training phase for each subtask is given in Table 2.
meaning. According to Meaning-Text Theory [11], if one or more
syntactic construction retains semantic evenness, those are                  Table 2. Statistic for sentence pairs in Subtask 1 & 2
addressed as paraphrases. The exchangeability of semantic                                  Subtask1 (in pairs)      Subtask2 (in pairs)
alikeness between the source text and paraphrased version mark             Language
                                                                                           Train       Test         Train       Test
the range of semantic alikeness between them. A paraphrase is a              Tamil         2500        900           3500       1400
very fine mechanism to shape various language models. Different            Malayalam       2500        900           3500       1400
linguistic units like synonyms, semi-synonyms, figurative
                                                                             Hindi         2500        900           3500       1400
meaning and metaphors are considered as the basic elements for
                                                                            Punjabi        1700        500           2200        750
paraphrasing. Paraphrasing is closely related with synonyms.
Paraphrasing is not only found in lexical level but another
linguistic level such as phrasal and sentential level also. Different   The average number of words per sentence along with average
levels of paraphrasing disclose the diversified forms of                pair length for subtask 1 and subtask 2 is given in Table 3 & Table
paraphrases and the semantic relationship to its source text. In        4.
paraphrase typologies, Lexical paraphrasing is the most popular         Table 3. Average number of words per sentence for Subtask 1
forms of paraphrasing found in the literature. For example: If a
source text is, “The two ships were acquired by the navy after the                                         Subtask - 1
                                                                              Language
war”, then possible paraphrased versions are: “The two ships were                            Sentence 1       Sentence 2        Pair
conquered by the navy after the war” and “The two ships were                   Hindi           16.058           16.376         16.217
won by the navy after the war”. There are even more paraphrases
                                                                               Tamil           11.092           12.044         11.568
possible for the given sentence. Here the source verb ‘acquire’ is
paraphrased with its exact synonyms. The source and paraphrases              Malayalam         9.253            9.035          9.144
show the same syntactic structural phenomena. These types of                  Punjabi          19.485           19.582         19.534
paraphrase are the best examples for exact paraphrases. Some of
the other common paraphrase typologies are; approximate                 Table 4. Average number of words per sentence for Subtask 2
paraphrases, sentential level paraphrases, adding extra linguistic
units, changing the order etc.                                                                              Subtask - 2
                                                                             Language
     The shared task on Detecting Paraphrases in Indian                                       Sentence 1      Sentence 2        Pair
Languages (DPIL)3 required participants to identify sentential                  Hindi           17.78           16.48          17.130
paraphrases in four Indian languages, namely Hindi, Tamil,                      Tamil           11.097          11.777         11.437
Malayalam, and Punjabi. The corpora creation task for these                   Malayalam         9.414           8.449          8.932
Indian languages started with collecting news articles from                    Punjabi          20.994          19.699         20.347
various web-based news sources. The collected dataset was
further cleaned from any noise or informal information. Apart
from cleaning, some sentences required spelling corrections and         The overall vocabulary size (Subtask 1 & Subtask 2) for training
text transformations. After removing all the irregularities, the        as well as testing for all the languages is shown in the form of line
dataset was annotated according to the paraphrases phenomena            chart in Figure 1.Notably, vocabulary size for Hindi & Punjabi
(Paraphrase, Non-Paraphrase, Semi-Paraphrase) present in each           languages is less than Tamil and Malayalam. This is because, like
sentence pair. The annotation tags used were P, SP and NP               other Dravidian languages (Kannada & Telugu), Tamil and
corresponding to Paraphrase, Semi-Paraphrase and Non-                   Malayalam are agglutinative in nature. Due to this phenomenon,
Paraphrase. These annotations were done by language experts for         Dravidian languages end up having more unique words and hence
each language. The annotated files were further proofread by a          larger vocabulary.
linguistic expert and then again by language expert (Two-step
Proofreading). Additionally, the annotated dataset proofread by         5. SYSTEM DESCRIPTION AND RESULTS
linguistic expert was converted to Extensible Markup Language           A total of 35 teams registered for the organized shared task and
(XML) format.                                                           out of those, 11 teams successfully submitted their runs. A brief
                                                                        description about the methodologies used by each team is given in
                                                                        the following subsection.
3http://nlp.amrita.edu/dpil_cen/
                                                                        morphological features and utilizing Support Vector Machine and
   35000
                                                        Tamil           Maximum Entropy for classifying paraphrases.
   30000                                                Hindi
                                                                        KS_JU: This team participated in all the four languages. They
   25000                                                Malayalam
                                                                        used different lexical and semantic level (Word embeddings)
   20000                                                Punjabi         similarity measures for computing features and used multinomial
                                                                        logistic regression model as a classifier.
   15000

   10000                                                                NLP-NITMZ: This team also participated in all the four
                                                                        languages. They used features based on Jaccard Similarity, length
    5000                                                                normalized Edit Distance and Cosine Similarity. Finally, these
       0
                                                                        feature-set are trained using Probabilistic Neural Network (PNN)
            Train-Task1   Train-Task2   Test-Task1    Test-Task2        to detect the paraphrases.

                Figure 1. Overall vocabulary size                       5.2 Overall Results
                                                                        As announced during the shared task, we are giving Sarwan
5.1 Participants System Description                                     award for top performers in each languages. The name of the top
The brief description of the techniques used by all the teams to        performing team in each language is given in Table 5.The overall
submit the runs for the shared task are as follows:                     results of all the participating teams can be seen in Table 6. For
                                                                        representation purpose we have truncated the evaluation measures
ANUJ: This team participated only for the Hindi language. They
                                                                        (Precision, Recall, and Accuracy) to two digits4.
pre-process the sentences using stemmer, soundex, synonym
handler. After that, they extracted the features using overlapping                 Table 5. Top performers in each language
words and normalized IDF scores. Finally, the Random forest                Punjabi      Hindi         Malayalam        Tamil    Rank
classifier is used for classification.
                                                                           0.932        0.907         0.785            0.776
                                                                                                                                First*
ASE: This team participated only for Hindi Language. They                  (HIT)        (Anuj)        (HIT)            (HIT)
extracted the features using POS tags and stemming information.            0.922        0.896         0.729            0.741
Semantic similarity metric is employed which extracts the word                                                                  Second
                                                                           (JU_KS)      (HIT)         (JU_KS)          (KEC)
synonyms from WordNet to check whether the compared words                  0.913        0.876         0.713            0.727
are synonyms. Finally, decision tree classifier is used to detect the                                                           Third
                                                                           (JU)         (JU_KS)       (NIT-MZ)         (NIT-MZ)
paraphrases.
BITS_PILANI: This team participated for Hindi language only.
They attempted paraphrase detection with different classifiers and      6. DISCUSSIONS
finally used Logistic Regression for Subtask-1 and Random Forest        Out of the 11 teams which submitted their runs, 10 teams
for Subtask2.                                                           successfully submitted their working notes. There were four teams
                                                                        which participated in all the four languages and rest of the teams
CUSAT-TEAM: This team participated only for the Malayalam               (3-Hindi, 2-Malayalam and 1-Tamil) participated in only one
Language. They stemmed the words and calculated the sentence            language. Two out of ten teams used the threshold based method
vector using Bag of Words model and find out the similarity score       to detect the paraphrases, remaining teams used the machine
between sentences. Finally, they set a threshold for determining        learning based approaches. The different types of feature set used
the appropriate class.                                                  by the participant teams are illustrated in Table 7. Most of the
CUSAT_NLP: This team participated only in the Malayalam                 teams used the common similarity based features like cosine,
Language. They used identical tokens, matching lemmas and               Jaccard, and only two teams used the Machine Translation
synonyms for finding the similarity between sentences. They also        evaluation metrics, BLEU and METEOR as features. Very few
utilized in-house Malayalam Wordnet to replace the synonyms.            teams used the synonym replacement and Wordnet features. For
Finally, the similarity score is compared and a threshold is fixed      Tamil language, team KEC@NLP used the morphological
to identify the exact class.                                            information as features to the machine learning based classifier.
                                                                        KS_JU team created the word2vec embeddings with the help of
HIT2016: This team participated in all the four languages. Cosine       additional in-house unlabeled data and found out the semantic
Distance, Jaccard Coefficient, Dice Distance and METEOR                 similarity features which were used as features in the classifier.
features are used and classification is done based on Gradient          The top performing team (HIT-2016) for the three languages used
Boosting Tree. They experiment various aspects of the                   the character n-gram based features and they experimented the
classification method for detecting paraphrases.                        results for different n-gram size.
                                                                        We calculated F1-measure and accuracy for evaluating the
JU_NLP: This team competed in all the four languages. They              submissions of the teams. The accuracy of the Task-2 is
used similarity based features, word overlapping features and           comparably low with the accuracy of Task-1 due to complexity of
scores from the machine translation evaluation metrics to find out      the task. In general, the accuracy obtained by runs submitted for
the similarity scores between pair of sentences. They tried with        Tamil and Malayalam language is low as compared to the
three different classifiers namely Naïve Bayes, SVM and SMO.            accuracy obtained by Hindi and Punjabi language. This is due to
                                                                        the agglutinative nature of the Dravidian languages.
KEC@NLP: This team participated in Tamil language only.
They used existing Tamil Shallow parser to extract the
                                                                        4 It does not affect the result of the participating teams
                                               Table 6. Overall result for Subtask 1 & Subtask 2
                                                             Subtask 1                                         Subtask 2
     Team Name         Language                                                F1                                                 F1
                                        Precision     Recall       Accuracy                 Precision   Recall       Accuracy
                                                                               Score                                              Score
     Anuj              Hindi            0.95          0.90         0.9200      0.91         0.90        0.90         0.9014       0.90
     ASE               Hindi            0.41          0.35         0.3588      0.34         0.35        0.35         0.3542       0.35
     ASE              Hindi            0.82          0.97         0.8922      0.89         0.68        0.67         0.6660       0.67
     BITS-PILANI       Hindi            0.91          0.90         0.8977      0.89         0.72        0.72         0.7171       0.71
     CUSAT NLP         Malayalam        0.83          0.72         0.7622      0.75         0.52        0.52         0.5207       0.51
     CUSATTEAM         Malayalam        0.79          0.88         0.8044      0.76         0.51        0.50         0.5085       0.46
     DAVPBI           Punjabi          0.95          0.92         0.9380      0.94         0.77        0.76         0.7466       0.73
     HIT2016           Hindi            0.97          0.84         0.8966      0.89         0.90        0.90         0.9000       0.89
     HIT2016           Malayalam        0.84          0.87         0.8377      0.81         0.74        0.75         0.7485       0.74
     HIT2016           Punjabi          0.95          0.94         0.9440      0.94         0.95        0.95         0.9226       0.92
     HIT2016           Tamil            0.82          0.87         0.8211      0.79         0.75        0.75         0.7550       0.73
     JU-NLP            Hindi            0.75          0.99         0.8222      0.74         0.68        0.68         0.6857       0.68
     JU-NLP            Malayalam        0.58          0.99         0.5900      0.16         0.42        0.42         0.4221       0.30
     JU-NLP            Punjabi          0.95          0.94         0.9420      0.94         0.91        0.91         0.8866       0.88
     JU-NLP            Tamil            0.57          1.00         0.5755      0.09         0.55        0.55         0.5507       0.43
     KS_JU             Hindi            0.94          0.89         0.9066      0.90         0.85        0.85         0.8521       0.84
     KS_JU             Malayalam        0.83          0.82         0.8100      0.79         0.66        0.66         0.6614       0.65
     KS_JU             Punjabi          0.95          0.94         0.9460      0.95         0.92        0.92         0.8960       0.89
     KS_JU             Tamil            0.79          0.85         0.7888      0.75         0.67        0.67         0.6735       0.66
     NLP@KEC           Tamil            0.82          0.87         0.8233      0.79         0.68        0.68         0.6857       0.66
     NLP-NITMZ         Hindi            0.92          0.92         0.9155      0.91         0.78        0.78         0.7857       0.76
     NLP-NITMZ         Malayalam        0.8           0.94         0.8344      0.79         0.62        0.62         0.6243       0.60
     NLP-NITMZ         Punjabi          0.95          0.94         0.9420      0.94         0.83        0.83         0.8120       0.80
     NLP-NITMZ         Tamil            0.8           0.92         0.8333      0.79         0.66        0.66         0.6571       0.63


                                               Table 7. various Features used by the participants
                                           BITS-         CUSAT       CUSAT                                                           NLP-
   Features            Anuj      ASE                                           HIT2016       JU-NLP       KS_JU        NLP@KEC
                                          PILANI          NLP        TEAM                                                           NITMZ
   POS                                                                                                                   
   Stem/Lemma                                                                                             
   Stopwords                                                           
   Word Overlap                                                                                              
   Synonym                                              
   Cosine                                                                                                                           
   Jaccord                                                                                                                             
   Levinstin                                                                                                                            
   METEOR/BLEU                                                                                
   Others              IDF               Soundex       WordNet       BoW       N-gram         Dice      word2vec        Morph
                                         Log Reg/                              Gradient                   Multi-
                     Random                                                                                            Maximum
   Classifier                    J 48    Random        Threshold   Threshold     Tree        SMO        nomial Log                 Prob NN
                      Forest                                                                                            Entropy
                                          Forest                               Boosting                    Reg



 Due to some formatting issues, this participant re-submitted the system after deadline.

 This participant didn’t submitted the working notes.
7. CONCLUSIONS AND FUTURE SCOPE                                      [7] Kothwal, R. and Varma, V., 2013. Cross lingual text reuse
In this overview paper, we explained the paraphrase corpus details       detection based on keyphrase extraction and similarity
and evaluation results of subtask-1 and subtask-2 of Detecting           measures. In Multilingual Information Access in South Asian
Paraphrases in Indian Languages (DPIL) shared task held at the           Languages (pp. 71-78). Springer Berlin Heidelberg.
8th Forum for Information Retrieval (FIRE) Conference - 2016. A      [8] Mahalakshmi, S., Anand Kumar, M., Soman, K.P., 2015.
total number of 35 teams registered in which 11 teams submitted          Paraphrase detection for Tamil language using Deep learning
their runs successfully. The corpora developed for the shared task       algorithm. International journal of Applied Engineering
is the first publicly available paraphrase detection corpora for         Research, 10 (17), pp. 13929-13934
Indian languages. Detecting paraphrases and semantic similarity      [9] Idicula, S.M., 2015, December. Fingerprinting based
in Indian languages is a challenging task because the                    detection system for identifying plagiarism in Malayalam
morphological variations and the semantic relations in Indian            text documents. In 2015 International Conference on
languages are more crucial to understand. Discrepancies can be           Computing and Network Communications (CoCoNet) (pp.
found in manually annotated paraphrase corpus, to revise the             553-558). IEEE.
annotations feedbacks are welcome and appreciated. Our detailed
                                                                     [10] Mathew, D. and Idicula, S.M., 2013, December. Paraphrase
experiment analysis provides fundamental insights into the
                                                                          identification of malayalam sentences-an experience. In 2013
performance of paraphrase identification in Indian languages.
                                                                          Fifth International Conference on Advanced Computing
Overall, HIT-2016 (HeiLongJiang Institute of Technology) got
                                                                          (ICoAC) (pp. 376-382). IEEE.
the first place in Tamil, Malayalam, and Punjabi languages and
Anuj (Sapient Global Markets) got the first place in Hindi. As a     [11] Kahane, S., 2003. The meaning-text theory. Dependency and
future work, we plan to extend the task to analyze the                    Valency. An International Handbook of Contemporary
performance of cross-genre and cross-lingual paraphrases for              Research, 1, pp.546-570.
more Indian languages. Detecting paraphrases in social media
content of Indian languages, plagiarism detection and use of
paraphrases in Machine Translation Evaluation are also
interesting areas for further study.

8. ACKNOWLEDEMENT
First, we would like to thank FIRE 2016 organizers for giving us
an opportunity to organize the shared task on Detecting
Paraphrases for Indian Languages (DPIL). We would like to
extend our gratitude to the advisory committee members Prof.
Ramanan, RelAgent Pvt. Ltd, and Prof. Rajendran S,
Computational Engineering and Networking (CEN) for actively
supporting us throughout the track. We would like to thank our
PG students at CEN for helping us in creating the paraphrase
corpora.

9. REFERENCES
[1] Dolan, W.B. and Brockett, C., 2005, October. Automatically
    constructing a corpus of sentential paraphrases. In Proc. of
    IWP.
[2] Xu, W., Callison-Burch, C. and Dolan, W.B., 2015.
    SemEval-2015 Task 1: Paraphrase and semantic similarity in
    Twitter (PIT). Proceedings of SemEval.
[3] Xu, W., Ritter, A., Callison-Burch, C., Dolan, W.B. and Ji,
    Y., 2014. Extracting lexically divergent paraphrases from
    Twitter. Transactions of the Association for Computational
    Linguistics, 2, pp.435-448.
[4] Pronoza, E., Yagunova, E. and Pronoza, A., 2016.
    Construction of a Russian paraphrase corpus: unsupervised
    paraphrase extraction. In Information Retrieval (pp. 146-
    157). Springer International Publishing.
[5] Potthast, M., Stein, B., Barrón-Cedeño, A. and Rosso, P.,
    2010, August. An evaluation framework for plagiarism
    detection. In Proceedings of the 23rd international
    conference on computational linguistics: Posters (pp. 997-
    1005). Association for Computational Linguistics.
[6] Rus, V., Banjade, R. and Lintean, M.C., 2014. On Paraphrase
    Identification Corpora. In LREC (pp. 2422-2429).