=Paper= {{Paper |id=Vol-1587/T2-2 |storemode=property |title=AmritaCEN_NLP @ FIRE 2015 Language Identification for Indian Languages in Social Media Text |pdfUrl=https://ceur-ws.org/Vol-1587/T2-2.pdf |volume=Vol-1587 |authors=Rahul Venkatesh Kumar,Anand Kumar M, Soman KP |dblpUrl=https://dblp.org/rec/conf/fire/KumarKS15 }} ==AmritaCEN_NLP @ FIRE 2015 Language Identification for Indian Languages in Social Media Text== https://ceur-ws.org/Vol-1587/T2-2.pdf
AmritaCEN_NLP @ FIRE 2015 Language Identification
     for Indian Languages in Social Media Text

Rahul Venkatesh Kumar                                 Anand Kumar M                                   Soman KP
         RM
  Centre for Excellence in                       Centre for Excellence in                     Centre for Excellence in
 Computational Engineering                      Computational Engineering                    Computational Engineering
       and Networking                                 and Networking                               and Networking
Amrita Vishwa Vidyapeetham,                    Amrita Vishwa Vidyapeetham,                  Amrita Vishwa Vidyapeetham,
      Coimbatore, India                              Coimbatore, India                            Coimbatore, India

     rahulvks@gmail.com                         m_anandkumar@cb.amrita.edu                     kp_soman@amrita.edu



ABSTRACT
                                                                     non English language which includes Indian languages too.
                                                                     In social media users are generally using their native
The progression of social media contents, similar like               languages in Romanized form to express their thoughts
Twitter and Facebook messages and blog post, has created,            [2][3]. To handle this Multi-lingual text processing
many new opportunities for language technology. The user             problem, we need to label the token into corresponding
generated contents such as tweets and blogs in most of the           languages. The idea of Multi-Script IR was first introduced
languages are written using Roman script due to distinct             by P Gupta, Kalika Bali, R E Banchs, M Choudhury, P
social culture and technology. Some of them using own                Rosso in 2013 SIGIR conference [3]. This task addresses
language script and mixed script. The primary challenges in          problem of language identification in code mixed queries.
process the short message is identifying languages.                  Task focuses on sentence level language identification in
Therefore, the language identification is not restricted to a        code mixed queries in English and any 8 Indian Languages
language but also to multiple languages. The task is to label        (L) Hindi, Bengali, Tamil, Guajarati, Marathi, Kannada,
the words with the following categories L1, L2, Named                Telugu and Malayalam. Our language identification system
Entities, Mixed, Punctuation and Others This paper                   uses Support Vector Machine for word level classification.
presents the AmritaCen_NLP team participation in
FIRE2015-Shared Task on Mixed Script Information                     2.RELATED WORKS
Retrieval Subtask 1: Query Word Labeling on language
identification of each word in text, Named Entities, Mixed,          The problem of language identification is researched for
Punctuation and Others which uses sequence level query               half century (Gold,1967) and code switching for several
labelling with Support Vector Machine.                               decades. But there has been less work on automatic
                                                                     language identification for mixed script analysis in social
CCS Concepts                                                         media websites and forums. Research showed that the
• Theory of computation~Support vector machines                      predominant language used in Twitter and Face book in
• Computing methodologies~Natural language                           their earlier days was English [4][5]. With the worldwide
   Processing                                                        growth social media, people started to write in their own
• Information systems~Information extraction • Human-                language with the help of roman script. Number of people
centered computing~Social tagging systems                            who using mixed script in social media commutation has
                                                                     increased tremendously. According to the report 45% of
Keywords                                                             users using mixed script in facebook,40% of people using
Language Identification, Support Vector Machine (SVM),               English for communicating and 15% people used their
Information retrieval, Mixed Script, Short Message.                  native language [6][7]. Identification of the language in
                                                                     social media content and their analysis is essential for
1.INTRODUCTION                                                       extracting information which can be further used in aiding
                                                                     search engines and monitoring online behavior so as to
                                                                     ensure security [8][9]. Few years back documents were
This paper describes our system for FIRE 2015 Shared                 written only in a single language. With the emergence of
Task on Query Word Labeling on Mixed Script                          social media these day’s documents were written in mixed
Information Retrieval. The faster growth of internet in              script [10].
current period the Webpages are not limited to English,
social media content in other languages increasing rapidly
[1]. Now a day’s webpages can be found in every popular


                                                                26
                                                                      Tokens which constructed two parts, each coming from a
3.DATA SET DESCRIPTION                                                different language are labelled as MIX, Emoticons, hash
                                                                      and punctuation are labelled as MIX. Foreign languages are
In training data set, input query is constructed and                  labelled as O. There is no extra data set is used in this task.
annotated with their label. The query is written in roman             Input query many contain mixture of 1 or 2 languages,
script. Input query and annotated set are given as a part of          named entities, mixed, punctuation and others. Table 1
the Subtask. The training data contains annotation and                contains the counts for mixed, punctuation and others with
input file each have 2908 sentences (Tokens 54,088). The              overall token count. The languages token count is
test data contains 792 sentences (Tokens 11,999). Tokens              mentioned in Table 2. Named Entities have nine different
are person name, location, organization and abbreviation              tag set and total count of the NER tokens are mentioned in
comes under NER label.                                                the Table 3.
Table 1: Tag set and Total Count.
            Language Token                                            4.METHODOLOGY AND FEATURE
            Count                  41,515                             DESCRIPTION
            Named Entities
                                    2,391                             We participated in the Query Word Labeling task which is
            Mixed (Mix)              70                               described very briefly as follows: Suppose that q: w₁ w₂ w₃
            Punctuations (X)                                          is a query which is written in Roman script. The words, w₁
                                    7710                              w₂ etc., could be standard English words or transliterated
            Others (O)               11                               from another language L = {Bengali (Bn), Gujarati (Gu),
                                                                      Hindi (Hi), Kannada (Ka), Malayalam (Ml), Marathi (Mr),
            Total Tag Count
                                                                      Tamil (Ta), Telugu (Te)}. The task is labeling the words as
                                   54,088
                                                                      En or L. In Query word labeling we used Support Vector
                                                                      Machine classifier to predict language of a particular word
Table 2: Language Data Training Set count.                            which belong to either Indian language or English. As the
                                                                      training corpus is very huge, the words from the corpus are
           Language          Token Count                              taken as features. As a method of preprocessing, the input
              Tamil              3169                                 raw data taken as token per sequence is annotated with
                                                                      corresponding tag set. This annotated data set is assigned as
             English            18017                                 input for the machine from which the features are extracted.
                                                                      Various features are taken for better labelling of language.
              Hindi              4615                                 The three prefixes and suffixes of the current word, length
             Bengali             3556                                 of the present token, position of current word are taken as
                                                                      features. Punctuation, comma, colon/Semi Colon, dot and
            Guajarati            890                                  word starting with ‘@’ and ‘#’ are taken as binary features.
                                                                      This set of feature has been mainly used to identify Indian
             Marathi             1960
                                                                      languages. They constitute checks on token endings in
            Kannada              1674                                 terms of presence of certain characters. Along with the
                                                                      features machine also learns from the training data set
             Telugu              6474                                 which is already labelled. When it comes to test data, same
           Malayalam             1160                                 preprocessing step is carried out. Annotated test data is
                                                                      given as input for Support Vector Machine Classifier and
                                                                      classified output is taken. Sample output is given in Table
                                                                      4.
Table 3: Named Entities Training Set count.
           Named        Token Count
           Entities                                                   Table 4: Input query with desired output.
           NE           2028
           NE_P         257
           NE_L         29                                           Input Query                      Output
           NE_O         22                                          And ibruna meet maadid            And\en ibruna\kn meet\en
           NE_PA        7                                           kushinu aythu !!!                 madid\kn kushinu\kn aythu\kn
                                                                                                      !\X
           NE_LA        1                                           Dhoni risk edutha gumbala         Dhoni\NE Risk\en edutha\ta
           NE_X         38                                          risk edukanam !                   gumbala\ta risk\en edukanam\ta
           NE_XA        5                                                                             !\x
           NE_OA        24



                                                               27
5.PROPOSED SYSTEM                                                     7.REFERENCE
The query and corresponding tag set is given as a training            [1] Irshad Ahmad Bhat(IIT-H), Vandan Mujadia(IIT-H),
data of the shared task and these are annotated as                        Aniruddha Tammewar(IIT-H). IIT-H System
preprocessing procedure. Flow of the proposed system is                   Submission for FIRE2014 Shared Task on
illustrated in Fig 1. From the annotated data the features are            Transliterated Search.
extracted. Along with the extracted feature sequence of
lines are given as an input for the Support vector machine            [2] P Gupta, Kalika Bali, R E Banchs, M Choudhury, P
classifier in which in creates a module file. The test data               Rosso.Query Expansion for Mixed-Script Information
and module file is given to the classifier and output is                  Retrieval. In Processing’s of the 37th international
extracted. Further the output is processed in which the                   ACM SIGIR conference on Research & development in
utterance id is properly paired with test data.                           information retrieval2014.

                                                                      [3] Dinesh Kumar Prabhakar, Sukomal Pal (Indian School
                                                                          Of Mines) ISM@FIRE-2014: Shared task on
                                                                          Transliterated Search FIRE 2014.

                                                                      [4] Channa Bankapur, Adithya Abraham Philip, Saimadhav
                                                                          A Heblikar (PES University). Query Word Labeling
                                                                          using Supervised Machine Learning: Shared task report
                                                                          by PESIT team 2014.

          Fig 1: Proposed system flow diagram.                        [5] Utsab Barman, Amitava Das, Code Mixing: A
                                                                      Challenge
                                                                          for Language Identification in the Language of Social
                                                                          Media. Joachim Waanger and Jennifer Foster CNGL
6.RESULT AND CONCLUSION                                                   Center for Global Intelligent Content National Center
In this paper, we described our system for Subtask 1 in                    for Language Identification 2014.
FIRE 2015 - Query Word Labelling on Mixed Script                      .
Information Retrieval. The query word labelling is very               [6] Induja, Indu M, P.C Reghu Raj. Text Based Language
                                                                          Identification System for Indian Languages Following
useful in search engines. We used SVM classifier to
                                                                          Devanagari. International Journal of Engineering
identify languages, Punctuations, NEs, Mixed and Other.
                                                                          Research & Technology (2014) (IJERT) IJERTIJERT
SVM uses set of features guaranteeing reasonable accuracy                 ISSN: 2278-0181.
for mixed languages query and other tags. In proposed
language identification system, the word sequences are                [7] Abinaya.N, Neethu John, Dr.M. Anand Kumar and
divided into tokens which are trained using SVM classifier                Dr.K.P.        P        Soman        -      Amrita
and the system is evaluated against the given test data.                  University.AMRITA@FIRE-2014:        Named    Entity
System is elevated separately for each tag in language pair,              Recognition for Indian Languages FIRE 2014.
Mixed and Named Entities using Recall, Precision and F1-
Score.Concentration is required more on Mixed Script and              [8] Kalika Bali, Yogarshi Vyas, Monojit Choudhury–
NEs. As a future work words from the language dictionary                  Microsoft India and University of Maryland.POS
and word as distributed vector can also be included as                    Tagging of English-Hindi Code-Mixed Social Media
feature which will improve the accuracy of the system.                    Content.Proceedings of the 2014 EMNLP pages 974–
                                                                          979, October 25-29 (2014).
Overall scores for tags set is mentioned in Table 5.

              Table 5: Summary of Scores.                             [9] Supriya Anand, Bangalore. India. FIRE-2015
                                                                          Language identification for transliterated forms of
         Mixes Accuracy             8.3333                                Indian Languages queries.

         NEs Accuracy               36.3964                           [10] Anupam Jamatia,Amitava Das.Part-of-Speech Tagging
                                                                           System for Indian Social Media Text on Twitter.
         Token Accuracy             76.6231
                                                                           Proceedings Workshop on Language Technologies
                                    16.9182                                For Indian Social Media(SOCIAL-INDIA), Pages 21-
         Utterance Accuracy
                                                                           28).
         Average F -Measure         0.682876
                                                                      [11] Yogarshi Vysas, Spandana Gella,Jatin sharma,Kalika
         Weighted F-Measure         0.766462                               Bali, Monojit Choudary.POS Tagging of English-
                                                                           Hindi Code-Mixed Social Media Content. (EMNLP)
                                                                           Conference on Empirical Methods in Natural
                                                                           Language Processing-2014, Pages 974-979.


                                                                 28