CUSAT TEAM@DPIL-FIRE2016: Detecting Paraphrase in
               Indian Languages-Malayalam

                               Manju K                                         Sumam Mary Idicula
                      Research Scholar,                                        Head of Department,
              Department of Computer science and                        Department of Computer science and
                         Engineering,                                              Engineering,
                Cochin University of Science and                          Cochin University of Science and
                       Technology,India.                                         Technology,India.
                        manju@mec.ac.in                                        sumam@cusat.ac.in

ABSTRACT                                                           2.   RELATED WORKS
This paper describes the work done as part of the shared task         Paraphrase identification has a lot of significance in differ-
on Detecting Paraphrases in Indian Languages(DPIL) in Fo-          ent areas of Natural language Processing. Paraphrase iden-
rum for Information Retrieval and Evaluation(FIRE 2016).           tification techniques are mainly classified into statistical and
Paraphrase identification is the task of deciding whether two      semantic methods. In statistical methods, the similarity be-
given text fragments have the same meaning. Our detection          tween sentences is measured only on the basis of statisti-
system is for Malayalam language and makes use of the co-          cal information in the sentences whereas semantic method
sine similarity measure, an existing state of the art method       makes use of word meanings. Work which shows the compar-
for determining the similarity between sentences. The ex-          ison of statistical and semantic similarity measures[1], which
periments were done on the standard data set and the re-           was tested on the same data set stated that the performance
sults showed that the system was able to give performance          of both measures are comparable. One of the most com-
comparable to methods employing more sophisticated pro-            monly used corpora for paraphrase detection is the MSRP
cedures.                                                           corpus[3], which contains 5,801 English sentence pairs from
                                                                   news articles manually labelled with 67% paraphrases and
                                                                   33% non-paraphrases. Since there are no annotated corpora
CCS Concepts                                                       or automated semantic interpretation systems available for
•Information Processing → Similarity Measures; •Natural            Indian languages till date, the initiative made as part of
Language Processing → Paraphrase Identification; •Text             the open shared task competition is highly appreciable and
Mining → Text Summarization;                                       is of great help to the research community. The automatic
                                                                   plagiarism detection framework for Malayalam documents[5]
                                                                   uses Jaccard similarity for determining the relation between
Keywords                                                           sentences.
Paraphrase; Cosine similarity; text tagging                           The proposed method implements Paraphrase Identifica-
                                                                   tion for Malayalam Language using similarity measures[4].
1.   INTRODUCTION
   Paraphrases are alternate ways to convey the same infor-        3.   TASK DESCRIPTION
mation. In natural languages, we can express a single event           The task is focused on sentence level paraphrase identi-
in different ways which conveys the same information. Para-        fication for Indian languages-Tamil, Malayalam, Hindi and
phrase identification, the ability to determine whether two        Punjabi. The proposed method considers only Malayalam
formally distinct strings are similar or not, have applica-        language. Malayalam is one among the 22 scheduled lan-
tion in various NLP tasks like Information retrieval, Ques-        guages of India. It is the official language in the state of
tion Answering, Plagiarism detection, Text Mining and Au-          Kerala and in the Union territories of Lakshadweep and
tomatic summarization. Paraphrase identification basically         Puduchery. Malayalam belongs to the Dravidian language
uses a simple lexical matching comparison of sentences.            family and is spoken by approximately 33 million people.The
   In order to select a sentence pair as paraphrase, they          task provided is divided into two sub tasks where sub task
should describe the same event and should contain same             1 is to classify the given pair of sentences to paraphrase or
information about the event. However there are instances           non paraphrase and in sub task 2 the sentences are classi-
when the concept behind the sentences are difficult to iden-       fied on a 3 point scale, to completely equivalent(P), roughly
tify, even for humans this is a difficult task.                    equivalent(SP) or not equivalent(NP).
   The rest of the paper is organized as follows: Section 2 dis-
cusses related work in the area of Paraphrase detection. Sec-
tion 3 presents the Task Description. Section 4 tells about        4.   DATA SET
the data set provided by the DPIL task[2] organizers. Sec-           The shared task challenge provided data for four lan-
tion 5 explains the methodology used and Section 6 gives the       guages Tamil, Malayalam, Hindi and Punjabi. We were
Result and evaluation. Section 7 presents the conclusion and       provided with 2500 sentence pairs for sub task 1 and 3500
the future improvements that can be made.                          sentence pairs for sub task 2 as training data and 900 sen-
tence pairs for sub task 1 and 1500 sentence pairs for sub
task 2 as test data. The data set available was in XML                                              D1 ∗ D2
format taken from prominent Newspapers.                                           Sim(D1 , D2 ) = p      p                   (3)
                                                                                                   D12 ∗ D22
                                                                  Similarity score will be a value between 0 and 1.
5.   SYSTEM DESCRIPTION                                             It was decided to set a threshold for determining the classes
   Data was given in XML format and that file was processed       Paraphrase, Semi Paraphrase and Non Paraphrase. Through
to extract each pair of sentences for paraphrase detection.       experiment using the training data given for task1 and task2
Cosine similarity measure was used for paraphrase identi-         a threshold of 0.4 was set for Paraphrase, 0.3 for SemiPara-
fication and the concerned two sentences in each pair was         phrase and any value less than that as NonParaphrase.
considered as two documents D1 and D2 . D1 and D2 con-
tain only one sentence each. The overall architecture of the
system is shown in Fig 1. D1 and D2 are subjected to tok-
                                                                  6.     RESULTS AND EVALUATION
enization and stop word removal. A look up table was used           The proposed system was experimented with the data set
for stop word removal. Due to the agglutinative nature of         provided by the open shared task. Fig 2 shows the similarity
the language, the same word can appear with different in-         score obtained for the 3 classes of sentence pairs.
flections in the sentences. To eliminate these inflections,
stemming was performed. Even though literature related
to stemming in Malayalam language is available, there is
no full fledged tool which can be used in the work. We
have custom tailored the Silpa Stemmer[6] by Swathanthra
Malayalam Computing group for our purpose. The stemmer
removes longest matching suffix from each word with proper
replacement to get the base word.


                                                                            Figure 2: Similarity Score Obtained

                                                                    The accuracy and F-score for this methodology of para-
                                                                  phrase identification is tabulated in Table 1 for subtask 1
            Figure 1: System Architecture
                                                                  and subtask 2
   The words in the resulting sentences after preprocessing
are the bag of words(vocabulary) for the vector represen-                              Table 1: Results
tation of the sentences. The sentence vector is formulated             Language         SubTask1             SubTask2
using bag-of-words model to extract frequency information                         Accuracy F1 Score Accuracy F1 Score
of words in the sentence. The size of the vector will be the       Malayalam       0.80444     0.76     0.50857    0.46576
size of the vocabulary set and the value at each vector in-
dex i represents the count of word i in the sentence. This
is the Term Frequency(TF) Vector.For determining the im-          7.     CONCLUSION
portance of each word with respect to the two documents
its Inverse Document Frequency (IDF) is also calculated ac-          This paper discussed on how cosine similarity can be used
cording to equation(1).                                           for Paraphrase identification. The morphological richness
                                                                  and agglutinative nature of the language demands for stem-
                                                                  ming of the sentence pairs before paraphrase scoring. The
                                     N                            accuracy of the preprocessing phase has got a significant
                        Idft = log                         (1)
                                     Nt                           role in the paraphrase identification system. Performance of
where N is the total sentences in a document D, here it is 2      the system can be improved by considering semantic simi-
and Nt is the number of sentences in which the term t occurs.     larity using word net in addition to statistical measures. An
The sentence vector is computed according to equation(2).         ensemble of different similarity scores may improve the ac-
                                                                  curacy of the system. The vague demarcation between semi
                      Si = T ft,i ∗ Idft                   (2)    paraphrase and non paraphrase is a challenge in this type of
                                                                  work.
where T ft,i is the frequency of term t in Sentence Si and Idft
gives the information, how important is the term t. Using
equation(3) the similarity between documents are computed         8.     REFERENCES
where D1 contains the first sentence and D2 contains the          [1] S. S. Abraham and S. M. Idicula. Comparison of
second sentence in the pair.                                          statistical and semantic similarity techniques for
    paraphrase identification. In 2012 International
    Conference on Data Science & Engineering (ICDSE).
[2] M. Anand Kumar, S. Shivkaran, B. Kavirajan, and
    K. P. Soman. DPIL@FIRE2016: Overview of shared
    task on detecting paraphrases in indian languages. In
    Working notes of FIRE 2016 - Forum for Information
    Retrieval Evaluation, Kolkata, India, December 7-10,
    2016, CEUR Workshop Proceedings. CEUR-WS.org,
    2016.
[3] W. B. Dolan and C. Brockett. Automatically
    constructing a corpus of sentential paraphrases. In
    Proc. of IWP, 2005.
[4] S. Fernando and M. Stevenson. A semantic similarity
    approach to paraphrase detection. In Proceedings of the
    11th Annual Research Colloquium of the UK Special
    Interest Group for Computational Linguistics, pages
    45–52. Citeseer, 2008.
[5] L. Sindhu, B. B. Thomas, and S. M. Idicula.
    Automated plagiarism detection system for malayalam
    text documents. International Journal of Computer
    Applications, 106(15), 2014.
[6] S. Thottungal.
    Silpastemmer: http://libindic.org/stemmer.