=Paper= {{Paper |id=Vol-1587/T2-8 |storemode=property |title=Adaptive Voting in Multiple Classifier Systems for Word Level Language Identification |pdfUrl=https://ceur-ws.org/Vol-1587/T2-8.pdf |volume=Vol-1587 |authors=Soumik Mandal,Somnath Banerjee,Sudip Kumar Naskar,Paolo Rosso,Sivaji Bandyopadhyay |dblpUrl=https://dblp.org/rec/conf/fire/MandalBNRB15 }} ==Adaptive Voting in Multiple Classifier Systems for Word Level Language Identification== https://ceur-ws.org/Vol-1587/T2-8.pdf
     Adaptive Voting in Multiple Classifier Systems for Word
                 Level Language Identification

                Soumik Mandal                       Somnath Banerjee                       Sudip Kumar Naskar
            Jadavpur University, India            Jadavpur University, India               Jadavpur University, India
         mandal.soumik@gmail.com sb.cse.ju@gmail.com     sudip.naskar@cse.jdvu.ac.in
                         Paolo Rosso         Sivaji Bandyopadhyay
                                       UPV, Spain                       Jadavpur University, India
                                prosso@dsic.upv.es                    sivaji_cse_ju@yahoo.com

ABSTRACT                                                               fine-grained language identification from more than one lan-
In social media communication, code switching has become               guage is still very challenging and error prone when the
quite a common phenomenon especially for multilingual speak-           spans of text are smaller. Unsupervised and supervised ap-
ers. Automatic language identification becomes both a nec-             proaches were investigated for the detection of four language
essary and challenging task in such an environment. In this            pairs, Spanish-English, Modern Standard Arabic and Arabic
work, we describe a CRF based system with voting approach              dialects, Chinese-English and Nepalese-English, at the word
for code-mixed query word labeling at word-level as part of            level in code-switching data. The results of the task re-
our participation in the shared task on Mixed Script Infor-            vealed that language identification in code-switching is still
mation Retrieval at Forum for Information Retrieval Eval-              far from solved and warrants further natural language pro-
uation (FIRE) in 2015. Our method uses character n-gram,               cessing research. Shared tasks on language identification
simple lexical features and special character features, and            have been organized in FIRE since 2013 and various at-
therefore, can easily be replicated across languages. The              tempts [6],[7],[1],[3],[5],[9] were carried out to address lan-
performance of the system was evaluated against the test               guage identification task.
sets provided by the FIRE 2015 shared task on mixed script
information retrieval. Experimental results show encourag-             2.   TASK DEFINITION
ing performance across the language pairs.                                A query or utterance q : < w1 w2 w3 ... wn > is written
                                                                       in Roman script. The words or tokens, w1,w2,w3 etc., could
CCS Concepts                                                           be standard English (en) words or transliterated from any
                                                                       of the eight Indian languages, namely Bengali (bn), Hindi
•Computer systems organization → Embedded sys-                         (hi), Gujrati (gu), Kannada (kn), Malayalam (ml), Marathi
tems; Redundancy; Robotics; •Networks → Network reli-                  (mr), Tamil (ta), Telugu (te) under consideration in this
ability;                                                               subtask. The main objective of this task is to perform word-
                                                                       level language identification (WLL), i.e. to label each token
Keywords                                                               with single tag belongs to one of the five categories shown
                                                                       in Table 1. Though some of the categories have also finer
ACM proceedings; LATEX; text tagging
                                                                       subcategories, the identification of such subcategories is not
                                                                       mandatory.
1.   INTRODUCTION
   Though South and South East Asian languages have their              3.   DATA
own indigenous scripts, these languages are mostly written                This section describes the training and test dataset that
using Roman script in the social media such as tweets, blogs,          were provided to the task participants by the task organiz-
etc., due to various socio-cultural and technological reasons.         ers. The training dataset was provided in the form of set of
The use of Roman script for such languages presents serious            sentences and respective tags for each token of the sentences.
challenges to understanding, search and language identifica-           The training dataset consists of 2908 utterances, whereas the
tion. Abundant use of Roman script on the Web not only                 test dataset contains 792. Apart from the dataset provided
for documents as well as for user queries to search the doc-           by the task organizers we did not use any external dataset
uments needs to be addressed. Although language identifi-              or resources to either train or fine-tune our system.
cation at document level is a well-studied natural language               An empirical study on the development data reveals the
problem [4], the different aspects of this problem of labeling         following facts: a) the average length of all the tokens is
the language of individual words within a multilingual doc-            greater than 5 and b) majority of the tokens belong to the
ument were addressed in [10], [8]. They proposed language              English language.
identification at the word level in mixed language documents
instead of sentence level identification. Recently, language
identification problem in code-mixed data has been revis-              4.   SYSTEM DESCRIPTION
ited in the First Workshop on Computational Approaches                    Our word identification process involves three steps- At
to Code Switching in EMNLP-2014. It was mentioned that                 first we have independently applied multiple classifiers which


                                                                 47
                                           Table 1: Tagset of different categories
            Category            Possible Tags     Subcategory
                              en, bn, gu, hi, kn,
            Language
                              ml, mr, ta or te
                                                  Person (NE P), Location (NE L), Organization (NE O),
            Named Entity              NE          Abbreviation (NE PA, NE LA), Inflectional form (NE-Ls, where
                                                  Ls is the language of the suffix) or none of the above (NE X)
            Mixed                    MIX          Mix Lr Ls: Lr and Ls are root and suffix language respectively.
            Punctuation               X
            Others                    O


have been developed using CRF. Then voting approach has               storm, @timesnow or it may contain such symbols within,
been employed over the outputs of the classifiers which are           e.g. a***a, bari-r etc. Sometimes the entire word is built up
applied in first step. Finally, we have employed a classifier         of a symbol, e.g. ”, ?.
which deals with NE and MIX tags. Also we have tackled the                                      
conflict situations those come up through voting (discussed                                        1     if token has symbols
                                                                          has symbol(token) =
in section 4.2).                                                                                   0     otherwise

4.1     WLL classification Features                                   4.1.3    Links
   We have developed in total nine classifiers. Eight differ-           This feature was used as a binary feature. If a token is
ent IL(N) where N=BN,GU,HI,KN,ML,MR, TA,TE classi-                    a link, i.e. if it starts with ”http://”, ”https://” or ”www.”
fication models were built for eight Indian languages (ILs),          then the value is set to 1, otherwise it is set to 0.
namely BN-classifier, GU-classifier, HI-classifier, KN-classifier,
                                                                                                  
                                                                                                     1     if token is a link
ML-Classifier, MR-classifier, TA-classifier and TE-classifier.                 is link(token) =
                                                                                                     0     otherwise
While training a IL(N) classifier, tokens of the type NE,
MIX, Others and all other ILs were assigned R tag.The out-            4.1.4    Presence of Digit
put of IL(N) classifier could be one of the four- i) N ii) X
                                                                        In case of chat dialogue the use of digit(s) in a word often
(for punctuation) iii) en (for English) and iv) R (for any IL
                                                                      means different than their traditional use. For example, ‘n8’
except N, NE, MIX and Others). Apart from eight IL(N)
                                                                      could mean ‘night’, ‘2’ could mean ‘to’ or ‘too’. It is also
classifiers, we have trained another classifier (namely ALL-
                                                                      found that most of the cases such words contain numerical
classifier) using all the existing tags in the supplied training
                                                                      digits in single position. Therefore, in our system we have
dataset. The ALL-classifier has dealt with the NE, MIX and
                                                                      used the presence of single digit in any alphanumeric word
Others tokens as well as served as tie breaker (discussed in
                                                                      as binary feature.
section 4.2).                                                                               
   In this work, Conditional Random Field (CRF) has been                                       1    if token has numerical digits
employed to build all of the classifier models. We used                has digit(token) =
                                                                                               0    otherwise
CRF++ toolkit1 which is a simple, customizable, and open
source implementation of CRF. All of these nine classifiers           4.1.5    Word suffix
used the same set of features listed below in the following              It is an established fact that any language dependent fea-
subsections.                                                          ture increases the accuracy of language identification sys-
                                                                      tems for that particular language. Also recent studies on
4.1.1    Character n-grams                                            fixed length suffix feature had been carried out and were
   Recent studies [6],[8] had shown that the character n-gram         successfully used by [2] in the Bangla named entity recog-
feature can produce reasonable success in language identi-            nition task. Following these facts, we decided to create a
fication problem. Therefore, following them, we also used             small set of most frequent suffixes for the en words present
character n-grams as features in our system. Keeping the              in the training dataset based on our own automated suffix
average token length of training set in mind, we decided to           extractor algorithm. The list of most frequent en suffixes ex-
consider up to 6-grams. Other than the n-grams, the en-               tracted in this method were -ed, -ly, -’s, -’t, -’ll and -’ing and
tire token was also considered as a feature in the system.            the presence of these suffixes was marked as binary features
However due to fixed length vector constraint, we decided             in the classifiers, i.e.
to consider on the maximum length of a token to be 10 for                                         
generating the character n-grams. So, if the length of a                                             1    if token has the suffix
                                                                          has suf f ix(token) =
particular token is greater than 10 then only first 10 char-                                         0    otherwise
acters of that token were used to generate the n-grams and
the rest of the characters were ignored. Thus irrespective            4.2     Voting Approach
of the token length, the system always generates a total of             Once the outputs of all the classifiers are gathered, a vot-
46 n-grams i.e. the token itself, 10 unigrams, 9 bigrams, 8           ing mechanism is applied to decide the final label of each
trigrams, 7 four-grams, 6 five-grams and 5 six-grams.                 token. The voting approach is based on some rules, which
                                                                      are listed below:
4.1.2    Symbol character
  A token might either start with some symbol, e.g. #aap-             4.2.1    No conflict situation

                                                                 48
   This case is straight forward, i.e. no conflict between the            dle the NE or MIX tokens, we have depended entirely on the
outputs of all the eight IL(N) classifiers for a single token,            All-classifier to mark the NE and MIX tokens. So, if a token
meaning all the IL(N) classifiers agree on the tag of that                is marked as NE by the All-classifier then the final tag of the
token.                                                                    token becomes NE, irrespective of the outputs of the eight
   Rule 1: This rule is applicable for only En and X tags.                language classifiers for the same token. The same procedure
If the output of all the classifiers for a particular token is            is applied to mask MIX tokens.
same and either EN or X, then that particular tag is chosen
as the final tag for the given token. For example, the token
#aapsweep is labeled as X by all eight classifiers. Thus, the             5.    RESULT AND ERROR ANALYSIS
final tag of this token becomes X.                                           Table 2 represents the results obtained by our language
   #aapsweep X X X X X X X X ⇒ X                                          identification system in different categories other than Lan-
   Rule 2: If all the tags are same but other than EN or                  guage. As the table depicts, our system has achieved best
X, then we consider the output of the ALL-classifier for the              accuracy of 0.9293 in case of punctuation category, whereas
said token as the final tag. This phenomenon only occurs                  the results for MIX category is too low to report. Out of 24
when all the eight IL(N) classifiers identified the token as              MIX-tagged token only 2 are correct (precision and recall
R. For example, in the following example, the token ‘saaf’                values of the Mix category are not provided by the orga-
is marked as R by all the eight IL(N) classifiers. Since the              nizer). Even, in case of NEs the accuracy is too low at
label generated by the ALL-classifier for ‘saaf’ is HI, so the            0.4136 when compared to that of punctuation category; still
final tag of the token becomes HI.                                        it is the best score obtained in the NE category among all
   saaf R R R R R R R ⇒ HI                                                the teams participated in the subtask as per the task orga-
                                                                          nizers. To be noted is our system has not marked any token
4.2.2    Conflict between two tags                                        as O category.
   In this scenario, output of all the classifiers for a given to-
ken is limited between two tags. Based on the tags involved
in such conflicts this situation is further classified into sub-               Table 2: Token level accuracy category-wise
categories which are discussed in the following subsections.                     Category   Precision Recall F-measure
   Rule 3: If conflict is between R and any other language                      Punctuation   0.8883    0.9742     0.9293
tag including EN, then the tag other than R marked by the                      Named Entity   0.3316    0.5494     0.4136
classifier is selected as the final tag of the token. In the                       Mix
following example, the token doctor is marked as either EN
or IL by the language classifiers. Therefore, the final tag of
                                                                            In case of language category maximum accuracy is achieved
doctor is EN.
                                                                          for en tokens, which is 0.7838. Whereas, the accuracy is
   doctor R R R R R R EN EN ⇒ EN
                                                                          pretty low for Gujrati, Malayalam and Kannada languages
   Rule 4: If the classifiers differ in between two tags other
                                                                          (shown in Table 3). We have dwelled upon the result and
than R, then a voting is counted in support of each of the
                                                                          observed that it is due to the lower amount of tokens pres-
two tags. Finally the tag with maximum votes is assigned
                                                                          ence in the development set for these three languages. For
as the final tag for the given token. In the example, the no.
                                                                          example, the number of gu tokens present in the develop-
of votes in favor of EN tag for the token take is greater than
                                                                          ment set is only 890, which is very few when compared to
the no. of votes supporting BN.
                                                                          that of en tokens, i.e. 17957.
   take BN EN EN EN EN EN EN EN ⇒ EN

4.2.3    Conflict between three tags
                                                                                 Table 3: Token level Language Accuracy
   Rule 5: If the conflict involves a) R, b) EN or X and c)                     Category Precision Recall F-measure
any of the eight Indian Language tags, then we first replace                    Bengali       0.75     0.8208   0.7838
all the R tags with the other Indian Language tag involved
                                                                                English      0.9506    0.6147   0.7466
in the conflict, thus reducing the conflict between three tags
                                                                                Gujrati      0.1622    0.3704   0.2256
scenario into conflict between two. Finally Rule 4 is applied
                                                                                Hindi          0.5     0.8186   0.6208
to decide the final tag. For example;
                   ore BN EN R R R R R R                                        Kannada      0.2876    0.7713    0.419
                               ⇓                                                Malayalam    0.1991    0.6667   0.3067
ore BN EN BN BN BN BN BN BN ⇒ BN                                                Marathi      0.5815    0.7586   0.6584
   Rule 6: If the conflict involves three tags and none of                      Tamil        0.7514    0.7757   0.7633
those three are R, then simple majority voting was applied                      Telugu       0.3473     0.657   0.4544
to choose the final tag.
                                                                             Overall, our system achieved the accuracy (weighted f-
4.2.4    Conflict between more than three tags                            measure) of 0.700373312. Out of 11999 tokens in the testset
   Rule 7: In case there is disagreement between more than                8582 tokens were marked correctly. However, as our sys-
three language classifier for a single token, the final label of          tem didn’t consider any contextual information, the accu-
that token is decided by the All-classifier. The occurrence               racy achieved at the utterance level was expectedly very low
of such cases is very rare.                                               at 0.128788. Only in 102 occasions all the tokens of an en-
                                                                          tire utterance was labeled with correct tags. More detailed
4.3     Handling NE and MIX tags                                          analysis of the result can be done once the gold standard
  Since we have not included any feature specifically to han-             data is shared by the task organizers.


                                                                     49
6.   CONCLUSIONS AND FUTURE WORKS                                         of codeswitching. In Annual Meeting of the Linguistic
   In this paper, we have presented a brief overview of our               Society of America, 2013.
hybrid approach to address the automatic WLL identifica-             [10] A. K. Singh and J. Gorla. Identification of languages
tion problem. We have observed that the voting approach on                and encodings in a multilingual document. In
multiple classifiers output provides better results than use              ACL-SIGWAC’s Web As Corpus3, page 95. Presses
of a single classifier system. For our participation in Query             univ. de Louvain, 2007.
Word Labeling subtask, we have submitted two runs: the
first one, i.e. Run1 using the system as described above and
the other, i.e. Run2 using only the ALL-classifier without
the need of any voting mechanism, and the obtained results
confirm that the overall accuracy of Run1 is more than 10%
higher when compared to Run2.
   As future work, we would like to explore more sophis-
ticated features to handle NE or O tags and better post-
processing heuristics for handling MIX tags in the WLL
identification task and try to improve the performance of
system by using context modelling. We also plan to incor-
porate more language specific feature in our future work to
improve the accuracy of the system.

7.   ACKNOWLEDGMENTS
  We acknowledge the support of the Department of Elec-
tronics and Information Technology (DeitY), Government of
India, through the project “CLIA System Phase III”.
  The research work of the second last author was carried
out in the framework of WIQ-EI IRSES (Grant No. 269180)
within the FP 7 Marie Curie, DIANA-APPLICATIONS (TIN2012-
38603-C02-01) projects and the VLC/CAMPUS Microclus-
ter on Multimodal Interaction in Intelligent Systems.

8.   REFERENCES
 [1] S. Banerjee, A.Kuila, A. Roy, S. N. P. Rosso, and
     S. Bandyopadhyay. A hybrid approach for
     transliterated word-level language identification: Crf
     with post-processing heuristics. In FIRE. ACM Digital
     Publishing, 2014.
 [2] S. Banerjee, S. Naskar, and S. Bandyopadhyay.
     Bengali named entity recognition using margin infused
     relaxed algorithm. In TSD, pages 125–132. Springer
     International Publishing, 2014.
 [3] U. Barman, J. Wagner, G. Chrupala, and J.Foster.
     Identification of languages and encodings in a
     multilingual document. page 127. EMNLP, 2014.
 [4] K. R. Beesley. Language identifier: A computer
     program for automatic natural-language identification
     of on-line text. pages 47–54. ATA, 1988.
 [5] M. Carpuat. Mixed-language and code-switching in
     the canadian hansard. page 107. EMNLP, 2014.
 [6] G. Chittaranjan, Y. Vyas, K. Bali, and
     M. Choudhury. Word-level language identification
     using crf: Code-switching shared task report of msr
     india system. pages 73–79. EMNLP, 2014.
 [7] A. Das and B. GambÃd’ck. Code-mixing in social
     media text:the last language identification frontier?
     Traitement Automatique des Langues (TAL): Special
     Issue on Social Networks and NLP, 54(3/2013):41–64,
     2014.
 [8] B. King and S. Abney. Labeling the languages of words
     in mixed-language documents using weakly supervised
     methods. pages 1110–1119. NAACL-HLT, 2013.
 [9] C. Lignos and M. Marcus. Toward web-scale analysis


                                                                50