=Paper=
{{Paper
|id=Vol-1587/T2-7
|storemode=property
|title=NELIS - Named Entity and Language Identification System: Shared Task System Description
|pdfUrl=https://ceur-ws.org/Vol-1587/T2-7.pdf
|volume=Vol-1587
|authors=Rampreeth Ethiraj,Sampath Shanmugam,Gowri Srinivasa,Navneet Sinha
|dblpUrl=https://dblp.org/rec/conf/fire/EthirajSSS15
}}
==NELIS - Named Entity and Language Identification System: Shared Task System Description==
NELIS - Named Entity and Language Identification System: Shared Task System Description Rampreeth Ethiraj1, Sampath Shanmugam2, Gowri Srinivasa3 Navneet Sinha PES Center for Pattern Recognition Rochester Institute of Technology PESIT Bangalore South Campus Rochester, New York Bengaluru, Karnataka USA India navneet.sinha27@gmail.com 1 ethirajrampreeth@gmail.com, 2sampath_shanmugam@outlook.com, 3 gsrinivasa@pes.edu ABSTRACT 2. DATASETS This paper proposes a simple and elegant solution for language The core of the system was building strong dictionaries for each identification and named entity (NE) recognition at a word level, language. The wordlists used to compile the dictionaries are listed as a part of Subtask-1: Query Word Labeling of FIRE 2015. in Table 1. Given any query q1:w1 w2 w3 … wn in Roman script, the task calls Table 1. Primary sources used to prepare dictionaries. for labeling words of the query as English (En) or a member of L, where L = {Bengali (Bn), Gujarati (Gu), Hindi (Hi), Kannada Class Source (Kn), Malayalam (Ml), Marathi (Mr), Tamil (Ta), Telugu (Te)}. Bn, Hi, Gu FIRE 2013 Dataset [2] The approach presented in this paper uses the combination of a dictionary lookup with a Naïve Bayes classifier trained over Mieliestronk's word list character n-grams. Also, we devise an algorithm to resolve En http://www.mieliestronk.com/wordlist.html ambiguities between languages, for any given word in a query. + FIRE 2013 Dataset Our system achieved impressive f-measure scores of 85-90% in four languages and 74-80% in another four languages. MIX FIRE 2015 Dataset NE FIRE 2015 Dataset Keywords Bn, Gu, Kn, List of most frequently used English words Language Identification, N-grams, Naïve Bayes classifier Ml, Ta, Te [3] were translated and transliterated. 1. INTRODUCTION The most frequently used words in En were translated into their India's heritage in languages is one of the richest in the world and respective Indian language equivalents, using Google's online is also known as the "Museum of Languages". India is a multi- translation service1. But the translated words were all in their language, multi-script country, with 22 official languages. A large native scripts. These had to be transliterated into their Roman number of these languages are written using indigenous scripts. equivalents. The process of phonetically representing the words of However, often websites and user generated content such as a language in a non-native script is called transliteration [4]. tweets and blogs in these languages are written using Roman Baraha Software2 was used to transliterate these words into their script [1] due to various social, cultural and technological reasons. Roman script equivalents. This paper presents an approach to analyze a sentence written in While this sufficed for En and Hi, the data collected was not En and a transliterated language L, where L = {Bn, Gu, Hi, Kn, enough for accurate classification of other languages. Thus, in Ml, Mr, Ta, Te}, adopting the Roman script, from sources such as addition to these word lists, mining of data from other sources was tweets, blogs and user-generated messages and button down the necessary to account for various spelling variations [5] and also to language every word belongs to. capture the commonly used words of each language. These The philosophy of this approach was inspired partially by how secondary sources include song lyrics, common SMS messages, humans identify languages of words. First, if the word is a part of and 'learn to speak' websites found online. Even shorthand their vocabulary, then they know the language of the word. If the notations of various words were effectively captured from these word is unfamiliar to them, then they tend to make a guess, based sources. on the structure of the word. Finally, if they are given a sentence For example, consider Gu. 'che' is also sometimes spelt as '6e' . and have managed to decode the language of a few words, then they can make a fairly accurate guess about the language of the We manually extracted language words in Roman form from unknown words as well. A close analogy can be drawn between these secondary sources, cleaned them and keyed them into the the above and the approach suggested in this paper; the human dictionaries. Table 2 lists these secondary sources. language vocabulary is equivalent to the language dictionaries and Comprehensive dictionaries were hence manually formed for each the guess made based on the features of the word is performed by language. Table 3 lists the final sizes of all language dictionaries. the Naïve Bayes classifier, using n-gram as features. A logical method for disambiguation is suggested in this paper. 1 https://translate.google.com/ 2 http://www.baraha.com 43 Organization of dictionaries: process of dictionary lookup. For example, all tokens of a Each language dictionary was divided into sub-dictionaries based language that started with 'a' would be grouped together. on the starting character, sorted alphabetically, to speed up the Table 2. Secondary sources used to prepare dictionaries. Class Source Song Lyrics Kn, Mr, Te: http://www.hindilyrics.net/ Gu: http://songslyricsever.blogspot.com/p/blog-page_9289.html Bn, Gu, Kn, Ml, Mr, Ta, Te Ml: http://www.malayalamsonglyrics.net Bn: http://www.lyricsbangla.com Ta: http://www.paadalvarigal.com/ SMS messages and 'learn to speak' websites. Bn, Gu, Kn, Ml, Mr, Ta, Te http://www.funbull.com/sms/sms-jokes.asp http://www.omniglot.com/language/phrases/langs.html Commonly used SMS abbreviations. X http://www.connexin.net/internet-acronyms.html Common names of people, places, organizations and brands. https://bitbucket.org/happyalu/corpus_indian_names/downloads NE http://simhanaidu.blogspot.in/2013/01/text-list-of-indian-cities-alphabetical.html http://www.elections.in/political-parties-in-india/ http://business.mapsofindia.com/top-brands-india/ Table 3. Final sizes of all language dictionaries 3.1 Classification of Tokens Language Dictionary Size (in words) The system built to demonstrate this approach was written entirely En 97271 in Python, using the NLTK package3 for processing and classification. The test file provided consisted of utterances Hi 26094 (sentences or queries). The system read the input file utterance by Ta 23992 utterance, and each utterance was tagged token (word) by token, sequentially. Section 3.1.1 explains the tagging of X tokens with Te 25472 regular expressions, Section 3.1.2 explains process of tagging of Bn 19573 language tokens. At the end of the process, an annotated output file was generated. Mr 10564 Gu 20729 3.1.1 Regular Expression based Tagging Regular Expressions were used to match X tokens [6]. Table 4 Ml 22219 shows the expressions used and their class. The X dictionary was Kn 32479 also referenced in case none of the expressions matched the token. 3.1.2 Language Tagging 3. APPROACH To tag language tokens, the combination of dictionary lookup and Problem Statement: Naïve Bayes classifier were used. The subsections below explain Suppose that a query is given in Roman script, the task is to label the process of tagging language tokens. The techniques were the words as En or a member of L. Assumptions to be made: combined and used, sequentially. 1. The words of a single query usually come from 1 or 2 languages and very rarely from 3 languages. 3.1.2.1 Dictionary Lookup and Tagging Dictionaries of all language were looked up each time, if the token 2. In case of mixed language queries, one of the languages has not been already tagged as X, MIX or NE. Three cases could is either En or Hi. arise: The approach is divided into two sections; Section 3.1 explains Case 1: the process of classification of tokens, while Section 3.2 elaborates on the process of disambiguation. Figure 1 depicts the The token belongs to exactly one language. Hence tag overall process. as this language. Case 2: 3 http://www.nltk.org 44 The token belongs to more than one language. Tag as The token is not found in any of the language ambiguous, along with the set of languages causing the dictionaries. Use the Naïve Bayes classifier to guess the ambiguity. language, as explained in Section 3.1.2.2. Case 3: After all tokens had been tagged by the dictionary, an aggregation of the number of occurrences of each language tag was performed. This is used later while trying to resolve ambiguity. Tagging. Tag Language tokens by Tag X tokens of performing their respective For all untagged Resolve Tag MIX and Read utterance[i] with language dictionary lookups. tokens of ambiguity using Input Tokenize NE tokens with Output utterance[i], regular utterance[i], tag the algorithm file their respective file where i = 1 to n. expressions and Tag as ambiguous, if same using Naïve specified in dictionaries. dictionary. token is present in multiple Bayes classifier. Section 3.2. language dictionaries. Repeat till i = n. Figure 1. Overall process of tagging, from input to output. Table 4. Regular Expressions used to tag X the classifier for that particular language. These were added with the previously computed values for each language while Regular Expression Class performing language dictionary lookups in Section 3.1.2.1 r'[\.\=\:\;\,\#\@\(\)\`\~\$\*\!\?\"\+\- X \\\/\|\{\}\[\]\_\<\>\%\&]+' 3.2 Further Processing and Disambiguation r'[0-9]+' X Disambiguation of words belonging to multiple languages tends to be a challenge, unless the context of the utterance is known. In r'[a-zA-Z]+[\@]+[a-zA-Z\.]*' X cases where utterances were bilingual, based on observation of the r'http+' X training set, we concluded that it is more probable for En to be a part of the bilingual utterance. r'www.[A-Za-z0-9]+.com' X To begin the process, we perform yet another count, but this time r'[A-Za-z0-9]+.com' X exclusively for ambiguous tokens. A count of the number of r'[0-9]+[tT][hH]' X occurrences for each language was computed and was multiplied by a weight. Let this weight be sizel for any given language l in L, r'[0-9]*[1]+[sS][tT]' X where r'[0-9]*[2][nN][dD]' X sizel r'[0-9]*[3][rR][dD]' X - r'[^a-zA-z]' & length of token = 1 X En was not taken into account while computing the total sum, because of the large size of the dictionary. These newly computed 3.1.2.2 Naïve Bayes Classifier and Tagging scores were added to the scores computed previously in Section An inherently multiclass Naïve Bayes classifier, from the NLTK 3.1.2.2 for each language and were used determine the package was trained specifically for language identification. Each language(s) of the utterance. The language with the maximum language l L is a class. While training, the frequencies of co- score is ranked highest. occurrences of character n-grams in the language dictionaries Hence the challenge was to be able to identify either a language or prepared in Section 2 were analyzed. An n-gram is an n-character a pair of languages for each utterance. This was done by slice of a longer string [7]. A frequency distribution of character identifying the most frequently occurring Indian language, say 2-gram, 3-gram, 4-gram and 5-gram was studied and used for the lang, in an utterance, and the count of En in this utterance, as purpose of training the classifier. computed previously. The steps involved in resolving ambiguity in an utterance is as follows. ( P(t | l ) P(l )) Step 1: lang arg max lL P(t ) All those unambiguous tokens that belonged to neither lang nor En were converted to lang. This assumption where lang is the language of a given token, t is the token and l is was made given the strength of the En dictionary, as the a language in L. probability of a new word belonging to En, given that it Those tokens that were not tagged after the dictionary look up is not in the En dictionary, is low. were tagged by the Naïve Bayes classifier. After all tokens had Step 2: been tagged by the classifier, an aggregation of the number of All ambiguous tokens, where the ambiguity was occurrences of each language tag was performed. But this time, between En and another language or a set of languages, the number of occurrences of each language was multiplied by a and lang is absent, were converted to En. certain specific weight. This weight was based on the accuracy of 45 Step 3: If the token is not the first token in the utterance and the All ambiguous tokens, where the ambiguity was previous token is a language token, then the token will between lang and another language or a set of be of the same language as the previous token. languages, were converted to lang. Else If the next token is a language token, then the Step 4: current token will be of the same language. For all ambiguous tokens that were not disambiguated Else, tag as En. in the previous steps, the following was followed: This scheme worked by identifying the overall language(s) of the Here praan, antim and yatra are all Hi words too. utterance and then narrowing it down to the language of the It fails to tag mix words in the test dataset due to the presence of individual token, for disambiguation. MIX tokens in specific language dictionaries in the training data. 4. RESULTS For example, account-la, where account is En, la is Ta, is in the A single run was submitted for the subtask and the results are Ta dictionary. This explains the low scores for MIX. summarized in the Table 5 and Table 6. 6. CONCLUSION AND FUTURE SCOPE Table 5. Summary of the scores obtained for each class In this paper, we present the brief synopsis of a methodology to Strict Strict Strict f-measure classify query words into their respective languages. The Class methodology involves a dictionary lookup networked with a Precision Recall Naïve Bayes classifier to accomplish the task. Usage of word MIX 0 0 0 level n-grams as a feature to the Naïve Bayes classifier can be NE 0.645 0.326 0.433 experimented with. A new approach to identify and tag MIX tokens will have to be devised. Furthermore, the accuracy of Gu X 0.952 0.941 0.947 and the overall accuracy of the system can be upgraded by Bn 0.795 0.921 0.853 devising a new technique to handle the indeterminateness between En 0.898 0.852 0.874 Hi and Gu. Gu 0.270 0.490 0.349 7. REFERENCES Hi 0.713 0.841 0.771 [1] Umair Z. Ahmed, Kalika Bali, Monojit Choudury, Sowmya VB. Challenges in Designing Input Method Editors for Kn 0.937 0.814 0.871 Indian Languages: The Role of Word-Origin and Context. In Ml 0.675 0.830 0.744 Proceedings of the WTIM, pages 1-9, 2011. Mr 0.808 0.774 0.791 [2] FIRE 2013 Dataset. Datasets for FIRE 2013. URL: Ta 0.912 0.872 0.891 http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/index.htm l. Last accessed: October 5, 2015. Te 0.774 0.778 0.777 [3] first20hours. google-10000-english/20k.txt. URL: https://github.com/first20hours/google-10000- Table 6. Summary of the overall scores obtained english/blob/master/20k.txt. Last accessed: October 5, 2015. Measure Run-1 [4] Kevin Knight, Jonathan Graehl. "Machine Transliteration". Computational Linguistics, pages 599-612, 1998. TokensAccuracy 82.715 [5] Royal Denzil Sequiera, Shashank S. Rao, Shambavi B R. UtterancesAccuracy 26.389 Word - Level Language Identification and Back Average F-measure 0.692 Transliteration of Romanized Text: A Shared Task Report by Weighted F-measure 0.829 BMSCE. Shared Task System Description in MSRI FIRE Working Notes, 2014. 5. ERROR ANALYSIS [6] Navneet Sinha, Gowri Srinivasa. Hindi-English Language This system yields very promising results for word level language Identification, Named Entity Recognition and Back identification and named entity recognition. Bn, En, Kn, Ta all Transliteration: Shared Task System Description. Shared have f-measures above 85%. Similarly, the remaining languages Task System Description in MSRI FIRE Working Notes, with the exception of Gu have f-measures above 74%. 2014. Errors during translation and transliteration are to be accounted [7] William B. Cavnar, John M. Trenkle. N-Gram-Based Text for. The accuracy of Gu was comparatively low. Upon detailed Categorization. In Proceedings of SDAIR-94, 3rd Annual analysis, it was observed that various spelling variations could not be accounted for, neither in the dictionaries, nor while training. Symposium on Document Analysis and Information Also, much ambiguity existed between Hi and Gu. Because Hi Retrieval, pages 161-169, 1994. words are more frequently occurring, the system is biased towards Hi in such ambiguous situations. This made it particularly very difficult to identify correctly Gu in utterances of short length. For example, from the training set provided: praan ni antim yatra 46