<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DPIL@FIRE 2016: Overview of Shared Task on Detecting Paraphrases in Indian Languages (DPIL)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paraphrase detection</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Semantic analysis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Indian languages</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DPIL Corpora</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Anand Kumar M, Shivkaran Singh Center for Computational Engineering and Networking (CEN) Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham Amrita University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Kavirajan B, Soman K P Center for Computational Engineering and Networking (CEN) Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham Amrita University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper explains the overview of the shared task "Detecting Paraphrases in Indian Languages" (DPIL) conducted at FIRE 2016. Given a pair of sentences in the same language, participants are asked to detect the semantic equivalence between the sentences. The shared task is proposed for four Indian languages namely Tamil, Malayalam, Hindi and Punjabi. The dataset created for the shared task has been made available online and it is the first open-source paraphrase detection corpora for Indian languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. INTRODUCTION
A Paraphrase can be defined as “the same meaning of a sentence
is expressed in another sentence using different words”.</p>
      <p>Paraphrases can be identified, generated or extracted. The
proposed task is focused on sentence-level paraphrase
identification for Indian languages (Tamil, Malayalam, Hindi and
Punjabi). Identifying paraphrases in Indian languages is a difficult
task because evaluating the semantic similarity of the underlying
content and the understanding the morphological variations of the
language are more critical. Paraphrase identification is strongly
connected with generation and extraction of paraphrases. The
paraphrase identification systems improve the performance of a
paraphrase generation in terms of choosing the best paraphrase
candidate from the list of candidates generated by paraphrase
generation system. Paraphrase Identification is also used in
validating the paraphrase extraction system and the machine
translation system. In question answering system, Paraphrase
identification plays a vital role in matching the questions asked by
the user to the original questions for choosing the best answer.</p>
      <p>Automatic short answers grading is another interesting application
which needs semantic similarity for providing grades to the short
answers. Plagiarism detection is another task which needs the
paraphrase identification technique to detect the sentences which
are paraphrases of other sentences.</p>
      <p>
        One of the most commonly used corpora for paraphrase
detection is the MSRP corpus[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which contains 5,801 English
sentence pairs from news articles manually labeled with 67%
paraphrases and 33% non-paraphrases. Since there are no
annotated corpora or automated semantic interpretation systems
available for Indian languages till date, creating benchmark data
for paraphrases and utilizing that data in open shared task
competitions will motivate the research community for further
research in Indian languages.
      </p>
      <p>Details about the task and dataset can be found on the website1 of
the shared task. The descriptions of the subtasks and evaluation
metrics are discussed in Section 2, Paraphrase corpus creation and
statistics are explored in Section 3, System descriptions of
participants and result analyses are done in Section 4. We discuss
the findings from the results Section 5.
2. RELATED TASKS AND CORPORA</p>
      <p>
        In SemEval-20152, shared task on Paraphrase and Semantic
Similarity In Twitter (PIT) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was conducted with the English
Twitter Paraphrase Corpus [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The task has two sentence-level
sub-tasks: a paraphrase identification task and a semantic textual
similarity task. The same dataset was used for both sub-tasks but
it differs in annotation and evaluation. A freely available manually
annotated corpus of Russian sentence pairs is ParaPhraser [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
which is used in the recently organized shared task on Paraphrase
detection for the Russian language [whit pap]. There were two
subtasks, one was three-class classification: given a pair of
sentences, to predict whether they are precise paraphrases, near
paraphrases or non-paraphrases and another was binary
classification: given a pair of sentences to predict whether they are
paraphrases or non-paraphrases. Microsoft Research Paraphrase
(MSRP) corpus is a well-known corpus which is manually
annotated and it consists of 5,801 paraphrase pairs in the English
language. The PAN plagiarism corpus 2010 (Paraphrase for
Plagiarism -P4P) is used for the evaluation of automatic
plagiarism detection algorithms. The corpus [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is manually
annotated with the paraphrase phenomena they contain. It is
composed of 847 source-plagiarism pairs in English. The
complete summary of existing paraphrase corpora and Linguistic
phenomenon for paraphrases are discussed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], issue of
text plagiarism for Hindi language using English documents is
addressed. For Tamil languages, paraphrase detection using deep
learning techniques is applied in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. For Malayalam, paraphrase
identification using fingerprinting [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and statistical similarity
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] has been performed.
1 http://nlp.amrita.edu/dpil_cen/
2 http://alt.qcri.org/semeval2015/
a
b
l
e
,
d
e
c
e
a
s
e
d
      </p>
      <p>N
i
s
h
a
w
a
s
t
h
e
e
l
d
e
s
t
[
S
i
n
c
e
i
n
d
e
p
e
n
d
e
n
c
e
1
1
m
a
l
e
a
t
h
l
e
t
e
s
h
a
v
e
b
e
e
n
t
o</p>
      <p>O
l
y
m
p
i
c
s
]
ச
ுபதச்
ே
ர
ி
ய
ில்
8
4
ே
த
ீவ
த
வ
ர
புபக்
த
ி
ுவ
அ
்பதுல
க
ல
ர
்ம
க
ன
ப
வ
ந
ி
ப
ை
ச
வமு ற்
வ
ப
க
ய
ில்
ம
ர
த
்ம
ஒ
ுர
ச
ே
ய
்ற
ப
க
ச
க
ர
்ள
அ
ுனப்
ப
த
ி
்ட
ட</p>
      <p>்ம
,
p
l
a
n
n
i
n
g
i
s
t
o
s
e
n
d
a
s
a
t
e
l
l
i
t
e
p
e
r
m
o
n
t
h
]</p>
      <p>S</p>
      <p>P
T
a
m
i
l
ஒ
ுர
ச
ே
ய
்ற
ப
க
ச
க
ர
ப
ை
அ
ுனப்
ப
ச
வ
்ணடும
எ
்ன
ப
ுத
அ
்பதுல
க
ல
ர
ம
ி
்ன
க
ன</p>
      <p>ுவ
,
ஓ
வ
ி
ய
்ங
க
்ள
க
ி
ப
ட
த
்த
ன
ளு ைட்</p>
      <p>ன
.
1</p>
      <p>T
a
s
k
d
e
s
c
r
i
p
t
i
o
n


.</p>
      <p>=

,
t
h
e</p>
      <p>=

,
c
a
n
b
e
c
a
l
c
u
l
a
t
e
d
a
s
:

1
−
 
S
u
b
t
a
s
k
2
:</p>
      <p>G
i
v
e
n
a
p
a
i
r
o
f
s
e
n
t
e
n
c
e
s
f
r
o
m
n
e
w
s
p
a
p
e
r
d
o
m
a
i
n
=
(
2
)
s
h
a
r
e
d
t
a
s
k
i
s
t
o
i
d
e
n
t
i
f
y
w
h
e
t
h
e
r
t
h
e
y
a
r
e
p
a
r
a
p
h
r
a
s
e
s
(
P
)
o
r

 
+</p>
      <p>s
e
m
i
p
a
r
a
p
h
r
a
s
e
s
(
S</p>
      <p>P
)
o
r
n
o
t
p
a
r
a
p
h
r
a
s
e
s
(
N</p>
      <p>P
)
.</p>
      <p>T
h
e
s
u
b
s
c
r
i
p
t
r
e
f
e
r
s
t
o
p
a
r
a
p
h
r
a
s
e
(
P
)
c
l
a
s
s
f
o
r
t
h
e
s
u
b
t
a
s
k
1
.

T
h
e
s
u
b
t
a
s
k
2
w
a
s
s
i
m
i
l
a
r
t
o
t
h
e
s
u
b
t
a
s
k
1
e
x
c
e
p
t
t
h
e
3
p
o
i
n
t
s
c
a
l
e</p>
      <p>S
i
m
i
l
a
r
l
y
a
n
d
f
o
r
n
o
n
p
a
r
a
p
h
r
a
s
e
c
l
a
s
s


1
−
 
t
a
g
i
n
p
a
r
a
p
h
r
a
s
e
s
.</p>
      <p>T
h
i
s
m
a
k
e
s
t
h
e
s
h
a
r
e
d
t
a
s
k
e
v
e
n
m
o
r
e
c
o
u
l
d
b
e
c
a
l
c
u
l
a
t
e
d
.
c
h
a
l
l
e
n
g
i
n
g
T
o
e
v
a
l
u
a
t
e
r
u
n
s
f
o
r
s
u
b
t
a
s
k
2
w
e
u
s
e
d
.
2</p>
      <p>E
v
a
l
u
a
t
i
o
n
m
e
t
r
i
c
s
a
n
d
.</p>
      <p>S
i
n
c
e
i
t
i
s
a
m
u
l
t
i
c
l
a
s
s


 
−

 
T
h
e
e
v
a
l
u
a
t
i
o
n
m
e
t
r
i
c
s
u
s
e
d
f
o
r
s
u
b
t
a
s
k
1
a
n
d
s
u
b
t
a
s
k
2
w
e
r
e
c
l
a
s
s
i
f
i
c
a
t
i
o
n
t
a
s
k
a
n
d
g
i
v
e
s

 
−

 
s
l
i
g
h
t
l
y
d
i
f
f
e
r
e
n
t
b
e
c
a
u
s
e
o
f
u
n
i
q
u
e
n
e
s
s
o
f
t
h
e
t
a
s
k
s
.</p>
      <p>
        T
o
e
v
a
l
u
a
t
e
i
d
e
n
t
i
c
a
l
s
c
o
r
e
s
.
,
w
e
u
s
e
d
a
c
c
u
r
a
c
y
a
n
d
f
s
c
o
r
e
v
a
l
u
e
s
.
4. PARAPHRASE CORPUS FOR INDIAN
LANGUAGES
A paraphrase is a linguistic phenomenon. It has many applications
in the field of language teaching as well as computational
linguistics. Linguistically, paraphrases are defined in terms of
meaning. According to Meaning-Text Theory [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], if one or more
syntactic construction retains semantic evenness, those are
addressed as paraphrases. The exchangeability of semantic
alikeness between the source text and paraphrased version mark
the range of semantic alikeness between them. A paraphrase is a
very fine mechanism to shape various language models. Different
linguistic units like synonyms, semi-synonyms, figurative
meaning and metaphors are considered as the basic elements for
paraphrasing. Paraphrasing is closely related with synonyms.
      </p>
      <p>Paraphrasing is not only found in lexical level but another
linguistic level such as phrasal and sentential level also. Different
levels of paraphrasing disclose the diversified forms of
paraphrases and the semantic relationship to its source text. In
paraphrase typologies, Lexical paraphrasing is the most popular
forms of paraphrasing found in the literature. For example: If a
source text is, “The two ships were acquired by the navy after the
war”, then possible paraphrased versions are: “The two ships were
conquered by the navy after the war” and “The two ships were
won by the navy after the war”. There are even more paraphrases
possible for the given sentence. Here the source verb ‘acquire’ is
paraphrased with its exact synonyms. The source and paraphrases
show the same syntactic structural phenomena. These types of
paraphrase are the best examples for exact paraphrases. Some of
the other common paraphrase typologies are; approximate
paraphrases, sentential level paraphrases, adding extra linguistic
units, changing the order etc.</p>
      <p>The shared task on Detecting Paraphrases in Indian
Languages (DPIL)3 required participants to identify sentential
paraphrases in four Indian languages, namely Hindi, Tamil,
Malayalam, and Punjabi. The corpora creation task for these
Indian languages started with collecting news articles from
various web-based news sources. The collected dataset was
further cleaned from any noise or informal information. Apart
from cleaning, some sentences required spelling corrections and
text transformations. After removing all the irregularities, the
dataset was annotated according to the paraphrases phenomena
(Paraphrase, Non-Paraphrase, Semi-Paraphrase) present in each
sentence pair. The annotation tags used were P, SP and NP
corresponding to Paraphrase, Semi-Paraphrase and
NonParaphrase. These annotations were done by language experts for
each language. The annotated files were further proofread by a
linguistic expert and then again by language expert (Two-step
Proofreading). Additionally, the annotated dataset proofread by
linguistic expert was converted to Extensible Markup Language
(XML) format.
4.1 Corpora statistics
The paraphrase corpus was further analysed for certain parameters
such as number of sentence pairs for each class (P, NP, and SP),
average number of words per sentence per task, and overall
vocabulary size. The statistics for number of sentence pairs in
testing and training phase for each subtask is given in Table 2.
The average number of words per sentence along with average
pair length for subtask 1 and subtask 2 is given in Table 3 &amp; Table
4.
The overall vocabulary size (Subtask 1 &amp; Subtask 2) for training
as well as testing for all the languages is shown in the form of line
chart in Figure 1.Notably, vocabulary size for Hindi &amp; Punjabi
languages is less than Tamil and Malayalam. This is because, like
other Dravidian languages (Kannada &amp; Telugu), Tamil and
Malayalam are agglutinative in nature. Due to this phenomenon,
Dravidian languages end up having more unique words and hence
larger vocabulary.
5. SYSTEM DESCRIPTION AND RESULTS
A total of 35 teams registered for the organized shared task and
out of those, 11 teams successfully submitted their runs. A brief
description about the methodologies used by each team is given in
the following subsection.
Tamil
Hindi
Malayalam</p>
      <p>Punjabi
Train-Task1</p>
      <p>Train-Task2</p>
      <p>Test-Task1</p>
      <p>Test-Task2
5.1 Participants System Description
The brief description of the techniques used by all the teams to
submit the runs for the shared task are as follows:
ANUJ: This team participated only for the Hindi language. They
pre-process the sentences using stemmer, soundex, synonym
handler. After that, they extracted the features using overlapping
words and normalized IDF scores. Finally, the Random forest
classifier is used for classification.</p>
      <p>ASE: This team participated only for Hindi Language. They
extracted the features using POS tags and stemming information.</p>
      <p>Semantic similarity metric is employed which extracts the word
synonyms from WordNet to check whether the compared words
are synonyms. Finally, decision tree classifier is used to detect the
paraphrases.</p>
      <p>BITS_PILANI: This team participated for Hindi language only.</p>
      <p>They attempted paraphrase detection with different classifiers and
finally used Logistic Regression for Subtask-1 and Random Forest
for Subtask2.</p>
      <p>CUSAT-TEAM: This team participated only for the Malayalam
Language. They stemmed the words and calculated the sentence
vector using Bag of Words model and find out the similarity score
between sentences. Finally, they set a threshold for determining
the appropriate class.</p>
      <p>CUSAT_NLP: This team participated only in the Malayalam
Language. They used identical tokens, matching lemmas and
synonyms for finding the similarity between sentences. They also
utilized in-house Malayalam Wordnet to replace the synonyms.</p>
      <p>Finally, the similarity score is compared and a threshold is fixed
to identify the exact class.</p>
      <p>HIT2016: This team participated in all the four languages. Cosine
Distance, Jaccard Coefficient, Dice Distance and METEOR
features are used and classification is done based on Gradient
Boosting Tree. They experiment various aspects of the
classification method for detecting paraphrases.</p>
      <p>JU_NLP: This team competed in all the four languages. They
used similarity based features, word overlapping features and
scores from the machine translation evaluation metrics to find out
the similarity scores between pair of sentences. They tried with
three different classifiers namely Naïve Bayes, SVM and SMO.</p>
      <p>KEC@NLP: This team participated in Tamil language only.</p>
      <p>They used existing Tamil Shallow parser to extract the
morphological features and utilizing Support Vector Machine and
Maximum Entropy for classifying paraphrases.</p>
      <p>KS_JU: This team participated in all the four languages. They
used different lexical and semantic level (Word embeddings)
similarity measures for computing features and used multinomial
logistic regression model as a classifier.</p>
      <p>NLP-NITMZ: This team also participated in all the four
languages. They used features based on Jaccard Similarity, length
normalized Edit Distance and Cosine Similarity. Finally, these
feature-set are trained using Probabilistic Neural Network (PNN)
to detect the paraphrases.
5.2 Overall Results
As announced during the shared task, we are giving Sarwan
award for top performers in each languages. The name of the top
performing team in each language is given in Table 5.The overall
results of all the participating teams can be seen in Table 6. For
representation purpose we have truncated the evaluation measures
(Precision, Recall, and Accuracy) to two digits4.
6. DISCUSSIONS
Out of the 11 teams which submitted their runs, 10 teams
successfully submitted their working notes. There were four teams
which participated in all the four languages and rest of the teams
(3-Hindi, 2-Malayalam and 1-Tamil) participated in only one
language. Two out of ten teams used the threshold based method
to detect the paraphrases, remaining teams used the machine
learning based approaches. The different types of feature set used
by the participant teams are illustrated in Table 7. Most of the
teams used the common similarity based features like cosine,
Jaccard, and only two teams used the Machine Translation
evaluation metrics, BLEU and METEOR as features. Very few
teams used the synonym replacement and Wordnet features. For
Tamil language, team KEC@NLP used the morphological
information as features to the machine learning based classifier.</p>
      <p>KS_JU team created the word2vec embeddings with the help of
additional in-house unlabeled data and found out the semantic
similarity features which were used as features in the classifier.</p>
      <p>The top performing team (HIT-2016) for the three languages used
the character n-gram based features and they experimented the
results for different n-gram size.</p>
      <p>We calculated F1-measure and accuracy for evaluating the
submissions of the teams. The accuracy of the Task-2 is
comparably low with the accuracy of Task-1 due to complexity of
the task. In general, the accuracy obtained by runs submitted for
Tamil and Malayalam language is low as compared to the
accuracy obtained by Hindi and Punjabi language. This is due to
the agglutinative nature of the Dravidian languages.
4 It does not affect the result of the participating teams
 Due to some formatting issues, this participant re-submitted the system after deadline.
 This participant didn’t submitted the working notes.
7. CONCLUSIONS AND FUTURE SCOPE
In this overview paper, we explained the paraphrase corpus details
and evaluation results of subtask-1 and subtask-2 of Detecting
Paraphrases in Indian Languages (DPIL) shared task held at the
8th Forum for Information Retrieval (FIRE) Conference - 2016. A
total number of 35 teams registered in which 11 teams submitted
their runs successfully. The corpora developed for the shared task
is the first publicly available paraphrase detection corpora for
Indian languages. Detecting paraphrases and semantic similarity
in Indian languages is a challenging task because the
morphological variations and the semantic relations in Indian
languages are more crucial to understand. Discrepancies can be
found in manually annotated paraphrase corpus, to revise the
annotations feedbacks are welcome and appreciated. Our detailed
experiment analysis provides fundamental insights into the
performance of paraphrase identification in Indian languages.</p>
      <p>Overall, HIT-2016 (HeiLongJiang Institute of Technology) got
the first place in Tamil, Malayalam, and Punjabi languages and
Anuj (Sapient Global Markets) got the first place in Hindi. As a
future work, we plan to extend the task to analyze the
performance of cross-genre and cross-lingual paraphrases for
more Indian languages. Detecting paraphrases in social media
content of Indian languages, plagiarism detection and use of
paraphrases in Machine Translation Evaluation are also
interesting areas for further study.
8. ACKNOWLEDEMENT
First, we would like to thank FIRE 2016 organizers for giving us
an opportunity to organize the shared task on Detecting
Paraphrases for Indian Languages (DPIL). We would like to
extend our gratitude to the advisory committee members Prof.</p>
      <p>Ramanan, RelAgent Pvt. Ltd, and Prof. Rajendran S,
Computational Engineering and Networking (CEN) for actively
supporting us throughout the track. We would like to thank our
PG students at CEN for helping us in creating the paraphrase
corpora.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Dolan</surname>
            ,
            <given-names>W.B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Brockett</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <year>2005</year>
          ,
          <string-name>
            <surname>October.</surname>
          </string-name>
          <article-title>Automatically constructing a corpus of sentential paraphrases</article-title>
          .
          <source>In Proc. of IWP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dolan</surname>
          </string-name>
          , W.B.,
          <year>2015</year>
          . SemEval
          <article-title>-2015 Task 1: Paraphrase and semantic similarity in Twitter (PIT)</article-title>
          .
          <source>Proceedings of SemEval.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Callison-Burch</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dolan</surname>
            ,
            <given-names>W.B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>Extracting lexically divergent paraphrases from Twitter. Transactions of the Association for Computational Linguistics, 2</article-title>
          , pp.
          <fpage>435</fpage>
          -
          <lpage>448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Pronoza</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yagunova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pronoza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <article-title>Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction</article-title>
          .
          <source>In Information Retrieval</source>
          (pp.
          <fpage>146</fpage>
          -
          <lpage>157</lpage>
          ). Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrón-Cedeño</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <year>2010</year>
          ,
          <string-name>
            <surname>August.</surname>
          </string-name>
          <article-title>An evaluation framework for plagiarism detection</article-title>
          .
          <source>In Proceedings of the 23rd international conference on computational linguistics: Posters</source>
          (pp.
          <fpage>997</fpage>
          -
          <lpage>1005</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Rus</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banjade</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Lintean</surname>
            ,
            <given-names>M.C.</given-names>
          </string-name>
          ,
          <year>2014</year>
          .
          <article-title>On Paraphrase Identification Corpora</article-title>
          . In LREC (pp.
          <fpage>2422</fpage>
          -
          <lpage>2429</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Kothwal</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Varma</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <year>2013</year>
          .
          <article-title>Cross lingual text reuse detection based on keyphrase extraction and similarity measures</article-title>
          .
          <source>In Multilingual Information Access in South Asian Languages</source>
          (pp.
          <fpage>71</fpage>
          -
          <lpage>78</lpage>
          ). Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Mahalakshmi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anand</surname>
            <given-names>Kumar</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Soman</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.P.</surname>
          </string-name>
          ,
          <year>2015</year>
          .
          <article-title>Paraphrase detection for Tamil language using Deep learning algorithm</article-title>
          .
          <source>International journal of Applied Engineering Research</source>
          ,
          <volume>10</volume>
          (
          <issue>17</issue>
          ), pp.
          <fpage>13929</fpage>
          -
          <lpage>13934</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Idicula</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <year>2015</year>
          , December.
          <article-title>Fingerprinting based detection system for identifying plagiarism in Malayalam text documents</article-title>
          .
          <source>In 2015 International Conference on Computing and Network Communications (CoCoNet)</source>
          (pp.
          <fpage>553</fpage>
          -
          <lpage>558</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Mathew</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Idicula</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <year>2013</year>
          , December.
          <article-title>Paraphrase identification of malayalam sentences-an experience</article-title>
          .
          <source>In 2013 Fifth International Conference on Advanced Computing (ICoAC)</source>
          (pp.
          <fpage>376</fpage>
          -
          <lpage>382</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Kahane</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <year>2003</year>
          .
          <article-title>The meaning-text theory</article-title>
          .
          <source>Dependency and Valency. An International Handbook of Contemporary Research</source>
          ,
          <volume>1</volume>
          , pp.
          <fpage>546</fpage>
          -
          <lpage>570</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>