=Paper=
{{Paper
|id=Vol-1587/T2-8
|storemode=property
|title=Adaptive Voting in Multiple Classifier Systems for Word Level Language Identification
|pdfUrl=https://ceur-ws.org/Vol-1587/T2-8.pdf
|volume=Vol-1587
|authors=Soumik Mandal,Somnath Banerjee,Sudip Kumar Naskar,Paolo Rosso,Sivaji Bandyopadhyay
|dblpUrl=https://dblp.org/rec/conf/fire/MandalBNRB15
}}
==Adaptive Voting in Multiple Classifier Systems for Word Level Language Identification==
Adaptive Voting in Multiple Classifier Systems for Word Level Language Identification Soumik Mandal Somnath Banerjee Sudip Kumar Naskar Jadavpur University, India Jadavpur University, India Jadavpur University, India mandal.soumik@gmail.com sb.cse.ju@gmail.com sudip.naskar@cse.jdvu.ac.in Paolo Rosso Sivaji Bandyopadhyay UPV, Spain Jadavpur University, India prosso@dsic.upv.es sivaji_cse_ju@yahoo.com ABSTRACT fine-grained language identification from more than one lan- In social media communication, code switching has become guage is still very challenging and error prone when the quite a common phenomenon especially for multilingual speak- spans of text are smaller. Unsupervised and supervised ap- ers. Automatic language identification becomes both a nec- proaches were investigated for the detection of four language essary and challenging task in such an environment. In this pairs, Spanish-English, Modern Standard Arabic and Arabic work, we describe a CRF based system with voting approach dialects, Chinese-English and Nepalese-English, at the word for code-mixed query word labeling at word-level as part of level in code-switching data. The results of the task re- our participation in the shared task on Mixed Script Infor- vealed that language identification in code-switching is still mation Retrieval at Forum for Information Retrieval Eval- far from solved and warrants further natural language pro- uation (FIRE) in 2015. Our method uses character n-gram, cessing research. Shared tasks on language identification simple lexical features and special character features, and have been organized in FIRE since 2013 and various at- therefore, can easily be replicated across languages. The tempts [6],[7],[1],[3],[5],[9] were carried out to address lan- performance of the system was evaluated against the test guage identification task. sets provided by the FIRE 2015 shared task on mixed script information retrieval. Experimental results show encourag- 2. TASK DEFINITION ing performance across the language pairs. A query or utterance q : < w1 w2 w3 ... wn > is written in Roman script. The words or tokens, w1,w2,w3 etc., could CCS Concepts be standard English (en) words or transliterated from any of the eight Indian languages, namely Bengali (bn), Hindi •Computer systems organization → Embedded sys- (hi), Gujrati (gu), Kannada (kn), Malayalam (ml), Marathi tems; Redundancy; Robotics; •Networks → Network reli- (mr), Tamil (ta), Telugu (te) under consideration in this ability; subtask. The main objective of this task is to perform word- level language identification (WLL), i.e. to label each token Keywords with single tag belongs to one of the five categories shown in Table 1. Though some of the categories have also finer ACM proceedings; LATEX; text tagging subcategories, the identification of such subcategories is not mandatory. 1. INTRODUCTION Though South and South East Asian languages have their 3. DATA own indigenous scripts, these languages are mostly written This section describes the training and test dataset that using Roman script in the social media such as tweets, blogs, were provided to the task participants by the task organiz- etc., due to various socio-cultural and technological reasons. ers. The training dataset was provided in the form of set of The use of Roman script for such languages presents serious sentences and respective tags for each token of the sentences. challenges to understanding, search and language identifica- The training dataset consists of 2908 utterances, whereas the tion. Abundant use of Roman script on the Web not only test dataset contains 792. Apart from the dataset provided for documents as well as for user queries to search the doc- by the task organizers we did not use any external dataset uments needs to be addressed. Although language identifi- or resources to either train or fine-tune our system. cation at document level is a well-studied natural language An empirical study on the development data reveals the problem [4], the different aspects of this problem of labeling following facts: a) the average length of all the tokens is the language of individual words within a multilingual doc- greater than 5 and b) majority of the tokens belong to the ument were addressed in [10], [8]. They proposed language English language. identification at the word level in mixed language documents instead of sentence level identification. Recently, language identification problem in code-mixed data has been revis- 4. SYSTEM DESCRIPTION ited in the First Workshop on Computational Approaches Our word identification process involves three steps- At to Code Switching in EMNLP-2014. It was mentioned that first we have independently applied multiple classifiers which 47 Table 1: Tagset of different categories Category Possible Tags Subcategory en, bn, gu, hi, kn, Language ml, mr, ta or te Person (NE P), Location (NE L), Organization (NE O), Named Entity NE Abbreviation (NE PA, NE LA), Inflectional form (NE-Ls, where Ls is the language of the suffix) or none of the above (NE X) Mixed MIX Mix Lr Ls: Lr and Ls are root and suffix language respectively. Punctuation X Others O have been developed using CRF. Then voting approach has storm, @timesnow or it may contain such symbols within, been employed over the outputs of the classifiers which are e.g. a***a, bari-r etc. Sometimes the entire word is built up applied in first step. Finally, we have employed a classifier of a symbol, e.g. ”, ?. which deals with NE and MIX tags. Also we have tackled the conflict situations those come up through voting (discussed 1 if token has symbols has symbol(token) = in section 4.2). 0 otherwise 4.1 WLL classification Features 4.1.3 Links We have developed in total nine classifiers. Eight differ- This feature was used as a binary feature. If a token is ent IL(N) where N=BN,GU,HI,KN,ML,MR, TA,TE classi- a link, i.e. if it starts with ”http://”, ”https://” or ”www.” fication models were built for eight Indian languages (ILs), then the value is set to 1, otherwise it is set to 0. namely BN-classifier, GU-classifier, HI-classifier, KN-classifier, 1 if token is a link ML-Classifier, MR-classifier, TA-classifier and TE-classifier. is link(token) = 0 otherwise While training a IL(N) classifier, tokens of the type NE, MIX, Others and all other ILs were assigned R tag.The out- 4.1.4 Presence of Digit put of IL(N) classifier could be one of the four- i) N ii) X In case of chat dialogue the use of digit(s) in a word often (for punctuation) iii) en (for English) and iv) R (for any IL means different than their traditional use. For example, ‘n8’ except N, NE, MIX and Others). Apart from eight IL(N) could mean ‘night’, ‘2’ could mean ‘to’ or ‘too’. It is also classifiers, we have trained another classifier (namely ALL- found that most of the cases such words contain numerical classifier) using all the existing tags in the supplied training digits in single position. Therefore, in our system we have dataset. The ALL-classifier has dealt with the NE, MIX and used the presence of single digit in any alphanumeric word Others tokens as well as served as tie breaker (discussed in as binary feature. section 4.2). In this work, Conditional Random Field (CRF) has been 1 if token has numerical digits employed to build all of the classifier models. We used has digit(token) = 0 otherwise CRF++ toolkit1 which is a simple, customizable, and open source implementation of CRF. All of these nine classifiers 4.1.5 Word suffix used the same set of features listed below in the following It is an established fact that any language dependent fea- subsections. ture increases the accuracy of language identification sys- tems for that particular language. Also recent studies on 4.1.1 Character n-grams fixed length suffix feature had been carried out and were Recent studies [6],[8] had shown that the character n-gram successfully used by [2] in the Bangla named entity recog- feature can produce reasonable success in language identi- nition task. Following these facts, we decided to create a fication problem. Therefore, following them, we also used small set of most frequent suffixes for the en words present character n-grams as features in our system. Keeping the in the training dataset based on our own automated suffix average token length of training set in mind, we decided to extractor algorithm. The list of most frequent en suffixes ex- consider up to 6-grams. Other than the n-grams, the en- tracted in this method were -ed, -ly, -’s, -’t, -’ll and -’ing and tire token was also considered as a feature in the system. the presence of these suffixes was marked as binary features However due to fixed length vector constraint, we decided in the classifiers, i.e. to consider on the maximum length of a token to be 10 for generating the character n-grams. So, if the length of a 1 if token has the suffix has suf f ix(token) = particular token is greater than 10 then only first 10 char- 0 otherwise acters of that token were used to generate the n-grams and the rest of the characters were ignored. Thus irrespective 4.2 Voting Approach of the token length, the system always generates a total of Once the outputs of all the classifiers are gathered, a vot- 46 n-grams i.e. the token itself, 10 unigrams, 9 bigrams, 8 ing mechanism is applied to decide the final label of each trigrams, 7 four-grams, 6 five-grams and 5 six-grams. token. The voting approach is based on some rules, which are listed below: 4.1.2 Symbol character A token might either start with some symbol, e.g. #aap- 4.2.1 No conflict situation 48 This case is straight forward, i.e. no conflict between the dle the NE or MIX tokens, we have depended entirely on the outputs of all the eight IL(N) classifiers for a single token, All-classifier to mark the NE and MIX tokens. So, if a token meaning all the IL(N) classifiers agree on the tag of that is marked as NE by the All-classifier then the final tag of the token. token becomes NE, irrespective of the outputs of the eight Rule 1: This rule is applicable for only En and X tags. language classifiers for the same token. The same procedure If the output of all the classifiers for a particular token is is applied to mask MIX tokens. same and either EN or X, then that particular tag is chosen as the final tag for the given token. For example, the token #aapsweep is labeled as X by all eight classifiers. Thus, the 5. RESULT AND ERROR ANALYSIS final tag of this token becomes X. Table 2 represents the results obtained by our language #aapsweep X X X X X X X X ⇒ X identification system in different categories other than Lan- Rule 2: If all the tags are same but other than EN or guage. As the table depicts, our system has achieved best X, then we consider the output of the ALL-classifier for the accuracy of 0.9293 in case of punctuation category, whereas said token as the final tag. This phenomenon only occurs the results for MIX category is too low to report. Out of 24 when all the eight IL(N) classifiers identified the token as MIX-tagged token only 2 are correct (precision and recall R. For example, in the following example, the token ‘saaf’ values of the Mix category are not provided by the orga- is marked as R by all the eight IL(N) classifiers. Since the nizer). Even, in case of NEs the accuracy is too low at label generated by the ALL-classifier for ‘saaf’ is HI, so the 0.4136 when compared to that of punctuation category; still final tag of the token becomes HI. it is the best score obtained in the NE category among all saaf R R R R R R R ⇒ HI the teams participated in the subtask as per the task orga- nizers. To be noted is our system has not marked any token 4.2.2 Conflict between two tags as O category. In this scenario, output of all the classifiers for a given to- ken is limited between two tags. Based on the tags involved in such conflicts this situation is further classified into sub- Table 2: Token level accuracy category-wise categories which are discussed in the following subsections. Category Precision Recall F-measure Rule 3: If conflict is between R and any other language Punctuation 0.8883 0.9742 0.9293 tag including EN, then the tag other than R marked by the Named Entity 0.3316 0.5494 0.4136 classifier is selected as the final tag of the token. In the Mix following example, the token doctor is marked as either EN or IL by the language classifiers. Therefore, the final tag of In case of language category maximum accuracy is achieved doctor is EN. for en tokens, which is 0.7838. Whereas, the accuracy is doctor R R R R R R EN EN ⇒ EN pretty low for Gujrati, Malayalam and Kannada languages Rule 4: If the classifiers differ in between two tags other (shown in Table 3). We have dwelled upon the result and than R, then a voting is counted in support of each of the observed that it is due to the lower amount of tokens pres- two tags. Finally the tag with maximum votes is assigned ence in the development set for these three languages. For as the final tag for the given token. In the example, the no. example, the number of gu tokens present in the develop- of votes in favor of EN tag for the token take is greater than ment set is only 890, which is very few when compared to the no. of votes supporting BN. that of en tokens, i.e. 17957. take BN EN EN EN EN EN EN EN ⇒ EN 4.2.3 Conflict between three tags Table 3: Token level Language Accuracy Rule 5: If the conflict involves a) R, b) EN or X and c) Category Precision Recall F-measure any of the eight Indian Language tags, then we first replace Bengali 0.75 0.8208 0.7838 all the R tags with the other Indian Language tag involved English 0.9506 0.6147 0.7466 in the conflict, thus reducing the conflict between three tags Gujrati 0.1622 0.3704 0.2256 scenario into conflict between two. Finally Rule 4 is applied Hindi 0.5 0.8186 0.6208 to decide the final tag. For example; ore BN EN R R R R R R Kannada 0.2876 0.7713 0.419 ⇓ Malayalam 0.1991 0.6667 0.3067 ore BN EN BN BN BN BN BN BN ⇒ BN Marathi 0.5815 0.7586 0.6584 Rule 6: If the conflict involves three tags and none of Tamil 0.7514 0.7757 0.7633 those three are R, then simple majority voting was applied Telugu 0.3473 0.657 0.4544 to choose the final tag. Overall, our system achieved the accuracy (weighted f- 4.2.4 Conflict between more than three tags measure) of 0.700373312. Out of 11999 tokens in the testset Rule 7: In case there is disagreement between more than 8582 tokens were marked correctly. However, as our sys- three language classifier for a single token, the final label of tem didn’t consider any contextual information, the accu- that token is decided by the All-classifier. The occurrence racy achieved at the utterance level was expectedly very low of such cases is very rare. at 0.128788. Only in 102 occasions all the tokens of an en- tire utterance was labeled with correct tags. More detailed 4.3 Handling NE and MIX tags analysis of the result can be done once the gold standard Since we have not included any feature specifically to han- data is shared by the task organizers. 49 6. CONCLUSIONS AND FUTURE WORKS of codeswitching. In Annual Meeting of the Linguistic In this paper, we have presented a brief overview of our Society of America, 2013. hybrid approach to address the automatic WLL identifica- [10] A. K. Singh and J. Gorla. Identification of languages tion problem. We have observed that the voting approach on and encodings in a multilingual document. In multiple classifiers output provides better results than use ACL-SIGWAC’s Web As Corpus3, page 95. Presses of a single classifier system. For our participation in Query univ. de Louvain, 2007. Word Labeling subtask, we have submitted two runs: the first one, i.e. Run1 using the system as described above and the other, i.e. Run2 using only the ALL-classifier without the need of any voting mechanism, and the obtained results confirm that the overall accuracy of Run1 is more than 10% higher when compared to Run2. As future work, we would like to explore more sophis- ticated features to handle NE or O tags and better post- processing heuristics for handling MIX tags in the WLL identification task and try to improve the performance of system by using context modelling. We also plan to incor- porate more language specific feature in our future work to improve the accuracy of the system. 7. ACKNOWLEDGMENTS We acknowledge the support of the Department of Elec- tronics and Information Technology (DeitY), Government of India, through the project “CLIA System Phase III”. The research work of the second last author was carried out in the framework of WIQ-EI IRSES (Grant No. 269180) within the FP 7 Marie Curie, DIANA-APPLICATIONS (TIN2012- 38603-C02-01) projects and the VLC/CAMPUS Microclus- ter on Multimodal Interaction in Intelligent Systems. 8. REFERENCES [1] S. Banerjee, A.Kuila, A. Roy, S. N. P. Rosso, and S. Bandyopadhyay. A hybrid approach for transliterated word-level language identification: Crf with post-processing heuristics. In FIRE. ACM Digital Publishing, 2014. [2] S. Banerjee, S. Naskar, and S. Bandyopadhyay. Bengali named entity recognition using margin infused relaxed algorithm. In TSD, pages 125–132. Springer International Publishing, 2014. [3] U. Barman, J. Wagner, G. Chrupala, and J.Foster. Identification of languages and encodings in a multilingual document. page 127. EMNLP, 2014. [4] K. R. Beesley. Language identifier: A computer program for automatic natural-language identification of on-line text. pages 47–54. ATA, 1988. [5] M. Carpuat. Mixed-language and code-switching in the canadian hansard. page 107. EMNLP, 2014. [6] G. Chittaranjan, Y. Vyas, K. Bali, and M. Choudhury. Word-level language identification using crf: Code-switching shared task report of msr india system. pages 73–79. EMNLP, 2014. [7] A. Das and B. GambÃd’ck. Code-mixing in social media text:the last language identification frontier? Traitement Automatique des Langues (TAL): Special Issue on Social Networks and NLP, 54(3/2013):41–64, 2014. [8] B. King and S. Abney. Labeling the languages of words in mixed-language documents using weakly supervised methods. pages 1110–1119. NAACL-HLT, 2013. [9] C. Lignos and M. Marcus. Toward web-scale analysis 50