=Paper=
{{Paper
|id=Vol-1587/T2-2
|storemode=property
|title=AmritaCEN_NLP @ FIRE 2015 Language Identification for Indian Languages in Social Media Text
|pdfUrl=https://ceur-ws.org/Vol-1587/T2-2.pdf
|volume=Vol-1587
|authors=Rahul Venkatesh Kumar,Anand Kumar M, Soman KP
|dblpUrl=https://dblp.org/rec/conf/fire/KumarKS15
}}
==AmritaCEN_NLP @ FIRE 2015 Language Identification for Indian Languages in Social Media Text==
AmritaCEN_NLP @ FIRE 2015 Language Identification for Indian Languages in Social Media Text Rahul Venkatesh Kumar Anand Kumar M Soman KP RM Centre for Excellence in Centre for Excellence in Centre for Excellence in Computational Engineering Computational Engineering Computational Engineering and Networking and Networking and Networking Amrita Vishwa Vidyapeetham, Amrita Vishwa Vidyapeetham, Amrita Vishwa Vidyapeetham, Coimbatore, India Coimbatore, India Coimbatore, India rahulvks@gmail.com m_anandkumar@cb.amrita.edu kp_soman@amrita.edu ABSTRACT non English language which includes Indian languages too. In social media users are generally using their native The progression of social media contents, similar like languages in Romanized form to express their thoughts Twitter and Facebook messages and blog post, has created, [2][3]. To handle this Multi-lingual text processing many new opportunities for language technology. The user problem, we need to label the token into corresponding generated contents such as tweets and blogs in most of the languages. The idea of Multi-Script IR was first introduced languages are written using Roman script due to distinct by P Gupta, Kalika Bali, R E Banchs, M Choudhury, P social culture and technology. Some of them using own Rosso in 2013 SIGIR conference [3]. This task addresses language script and mixed script. The primary challenges in problem of language identification in code mixed queries. process the short message is identifying languages. Task focuses on sentence level language identification in Therefore, the language identification is not restricted to a code mixed queries in English and any 8 Indian Languages language but also to multiple languages. The task is to label (L) Hindi, Bengali, Tamil, Guajarati, Marathi, Kannada, the words with the following categories L1, L2, Named Telugu and Malayalam. Our language identification system Entities, Mixed, Punctuation and Others This paper uses Support Vector Machine for word level classification. presents the AmritaCen_NLP team participation in FIRE2015-Shared Task on Mixed Script Information 2.RELATED WORKS Retrieval Subtask 1: Query Word Labeling on language identification of each word in text, Named Entities, Mixed, The problem of language identification is researched for Punctuation and Others which uses sequence level query half century (Gold,1967) and code switching for several labelling with Support Vector Machine. decades. But there has been less work on automatic language identification for mixed script analysis in social CCS Concepts media websites and forums. Research showed that the • Theory of computation~Support vector machines predominant language used in Twitter and Face book in • Computing methodologies~Natural language their earlier days was English [4][5]. With the worldwide Processing growth social media, people started to write in their own • Information systems~Information extraction • Human- language with the help of roman script. Number of people centered computing~Social tagging systems who using mixed script in social media commutation has increased tremendously. According to the report 45% of Keywords users using mixed script in facebook,40% of people using Language Identification, Support Vector Machine (SVM), English for communicating and 15% people used their Information retrieval, Mixed Script, Short Message. native language [6][7]. Identification of the language in social media content and their analysis is essential for 1.INTRODUCTION extracting information which can be further used in aiding search engines and monitoring online behavior so as to ensure security [8][9]. Few years back documents were This paper describes our system for FIRE 2015 Shared written only in a single language. With the emergence of Task on Query Word Labeling on Mixed Script social media these day’s documents were written in mixed Information Retrieval. The faster growth of internet in script [10]. current period the Webpages are not limited to English, social media content in other languages increasing rapidly [1]. Now a day’s webpages can be found in every popular 26 Tokens which constructed two parts, each coming from a 3.DATA SET DESCRIPTION different language are labelled as MIX, Emoticons, hash and punctuation are labelled as MIX. Foreign languages are In training data set, input query is constructed and labelled as O. There is no extra data set is used in this task. annotated with their label. The query is written in roman Input query many contain mixture of 1 or 2 languages, script. Input query and annotated set are given as a part of named entities, mixed, punctuation and others. Table 1 the Subtask. The training data contains annotation and contains the counts for mixed, punctuation and others with input file each have 2908 sentences (Tokens 54,088). The overall token count. The languages token count is test data contains 792 sentences (Tokens 11,999). Tokens mentioned in Table 2. Named Entities have nine different are person name, location, organization and abbreviation tag set and total count of the NER tokens are mentioned in comes under NER label. the Table 3. Table 1: Tag set and Total Count. Language Token 4.METHODOLOGY AND FEATURE Count 41,515 DESCRIPTION Named Entities 2,391 We participated in the Query Word Labeling task which is Mixed (Mix) 70 described very briefly as follows: Suppose that q: w₁ w₂ w₃ Punctuations (X) is a query which is written in Roman script. The words, w₁ 7710 w₂ etc., could be standard English words or transliterated Others (O) 11 from another language L = {Bengali (Bn), Gujarati (Gu), Hindi (Hi), Kannada (Ka), Malayalam (Ml), Marathi (Mr), Total Tag Count Tamil (Ta), Telugu (Te)}. The task is labeling the words as 54,088 En or L. In Query word labeling we used Support Vector Machine classifier to predict language of a particular word Table 2: Language Data Training Set count. which belong to either Indian language or English. As the training corpus is very huge, the words from the corpus are Language Token Count taken as features. As a method of preprocessing, the input Tamil 3169 raw data taken as token per sequence is annotated with corresponding tag set. This annotated data set is assigned as English 18017 input for the machine from which the features are extracted. Various features are taken for better labelling of language. Hindi 4615 The three prefixes and suffixes of the current word, length Bengali 3556 of the present token, position of current word are taken as features. Punctuation, comma, colon/Semi Colon, dot and Guajarati 890 word starting with ‘@’ and ‘#’ are taken as binary features. This set of feature has been mainly used to identify Indian Marathi 1960 languages. They constitute checks on token endings in Kannada 1674 terms of presence of certain characters. Along with the features machine also learns from the training data set Telugu 6474 which is already labelled. When it comes to test data, same Malayalam 1160 preprocessing step is carried out. Annotated test data is given as input for Support Vector Machine Classifier and classified output is taken. Sample output is given in Table 4. Table 3: Named Entities Training Set count. Named Token Count Entities Table 4: Input query with desired output. NE 2028 NE_P 257 NE_L 29 Input Query Output NE_O 22 And ibruna meet maadid And\en ibruna\kn meet\en NE_PA 7 kushinu aythu !!! madid\kn kushinu\kn aythu\kn !\X NE_LA 1 Dhoni risk edutha gumbala Dhoni\NE Risk\en edutha\ta NE_X 38 risk edukanam ! gumbala\ta risk\en edukanam\ta NE_XA 5 !\x NE_OA 24 27 5.PROPOSED SYSTEM 7.REFERENCE The query and corresponding tag set is given as a training [1] Irshad Ahmad Bhat(IIT-H), Vandan Mujadia(IIT-H), data of the shared task and these are annotated as Aniruddha Tammewar(IIT-H). IIT-H System preprocessing procedure. Flow of the proposed system is Submission for FIRE2014 Shared Task on illustrated in Fig 1. From the annotated data the features are Transliterated Search. extracted. Along with the extracted feature sequence of lines are given as an input for the Support vector machine [2] P Gupta, Kalika Bali, R E Banchs, M Choudhury, P classifier in which in creates a module file. The test data Rosso.Query Expansion for Mixed-Script Information and module file is given to the classifier and output is Retrieval. In Processing’s of the 37th international extracted. Further the output is processed in which the ACM SIGIR conference on Research & development in utterance id is properly paired with test data. information retrieval2014. [3] Dinesh Kumar Prabhakar, Sukomal Pal (Indian School Of Mines) ISM@FIRE-2014: Shared task on Transliterated Search FIRE 2014. [4] Channa Bankapur, Adithya Abraham Philip, Saimadhav A Heblikar (PES University). Query Word Labeling using Supervised Machine Learning: Shared task report by PESIT team 2014. Fig 1: Proposed system flow diagram. [5] Utsab Barman, Amitava Das, Code Mixing: A Challenge for Language Identification in the Language of Social Media. Joachim Waanger and Jennifer Foster CNGL 6.RESULT AND CONCLUSION Center for Global Intelligent Content National Center In this paper, we described our system for Subtask 1 in for Language Identification 2014. FIRE 2015 - Query Word Labelling on Mixed Script . Information Retrieval. The query word labelling is very [6] Induja, Indu M, P.C Reghu Raj. Text Based Language Identification System for Indian Languages Following useful in search engines. We used SVM classifier to Devanagari. International Journal of Engineering identify languages, Punctuations, NEs, Mixed and Other. Research & Technology (2014) (IJERT) IJERTIJERT SVM uses set of features guaranteeing reasonable accuracy ISSN: 2278-0181. for mixed languages query and other tags. In proposed language identification system, the word sequences are [7] Abinaya.N, Neethu John, Dr.M. Anand Kumar and divided into tokens which are trained using SVM classifier Dr.K.P. P Soman - Amrita and the system is evaluated against the given test data. University.AMRITA@FIRE-2014: Named Entity System is elevated separately for each tag in language pair, Recognition for Indian Languages FIRE 2014. Mixed and Named Entities using Recall, Precision and F1- Score.Concentration is required more on Mixed Script and [8] Kalika Bali, Yogarshi Vyas, Monojit Choudhury– NEs. As a future work words from the language dictionary Microsoft India and University of Maryland.POS and word as distributed vector can also be included as Tagging of English-Hindi Code-Mixed Social Media feature which will improve the accuracy of the system. Content.Proceedings of the 2014 EMNLP pages 974– 979, October 25-29 (2014). Overall scores for tags set is mentioned in Table 5. Table 5: Summary of Scores. [9] Supriya Anand, Bangalore. India. FIRE-2015 Language identification for transliterated forms of Mixes Accuracy 8.3333 Indian Languages queries. NEs Accuracy 36.3964 [10] Anupam Jamatia,Amitava Das.Part-of-Speech Tagging System for Indian Social Media Text on Twitter. Token Accuracy 76.6231 Proceedings Workshop on Language Technologies 16.9182 For Indian Social Media(SOCIAL-INDIA), Pages 21- Utterance Accuracy 28). Average F -Measure 0.682876 [11] Yogarshi Vysas, Spandana Gella,Jatin sharma,Kalika Weighted F-Measure 0.766462 Bali, Monojit Choudary.POS Tagging of English- Hindi Code-Mixed Social Media Content. (EMNLP) Conference on Empirical Methods in Natural Language Processing-2014, Pages 974-979. 28