=Paper= {{Paper |id=Vol-1737/T3-6 |storemode=property |title= Ensemble Classifier based approach for Code-Mixed Cross-Script Question Classification |pdfUrl=https://ceur-ws.org/Vol-1737/T3-6.pdf |volume=Vol-1737 |authors=Debjyoti Bhattacharjee,Paheli Bhattacharya |dblpUrl=https://dblp.org/rec/conf/fire/BhattacharjeeB16 }} == Ensemble Classifier based approach for Code-Mixed Cross-Script Question Classification== https://ceur-ws.org/Vol-1737/T3-6.pdf
       Ensemble Classifier based approach for Code-Mixed
             Cross-Script Question Classification
                                                        [Team IINTU]
                    Debjyoti Bhattacharjee                                        Paheli Bhattacharya
         School of Computer Science and Engineering                    Dept. of Computer Science and Engineering
              Nanyang Technological University                          Indian Institute of Technology Kharagpur
                         Singapore                                                  West Bengal, India
                   debjyoti001@ntu.edu.sg                                          paheli@iitkgp.ac.in

ABSTRACT                                                              “Kharagpur theke Howrah cab fare koto?” (gloss : What is
With an increasing popularity of social-media, people post            the cab fare from Kharagpur to Howrah?) has words from
updates that aid other users in finding answers to their ques-        a single script (Roman) but from two different languages -
tions. Most of the user-generated data on social-media are            English (cab , fare) and Bengali (theke , koto). The words in
in code-mixed or multi-script form, where the words are rep-          Bengali have been transliterated to the Roman Script. Intu-
resented phonetically in a non-native script. We address the          itively this a very easy form of writing for people not as well-
problem of Question-Classfication on social-media data. We            versed with English as for their native language tend to use
propose an ensemble classifier based approach towards ques-           them for conversing in social media. With the rise in popu-
tion classification when the questions are written in mixed-          larity of social-media, people constantly post updates from
script, specifically, the Roman script for the Bengali lan-           their daily lives ranging from, but not limited to, sports and
guage. We separately train Random Forests, One-Vs-Rest                score updates, travel updates to food, hotel, transport and
and k-NN classifiers and then build an ensemble classifier            movie reviews, providing user feedbacks for Customer Sup-
that combines the best from the three worlds. We achieve an           port Systems through tweets and blogs. Although Question
accuracy of 82% approximately, suggesting that the method             Answering (QA) is a well-addressed research problem with
works well in the task.                                               systems providing reasonable accuracy, QA on social-media
                                                                      text in mixed script is a challenging problem mainly due
                                                                      to the fact that there is no standardization of spellings for
CCS Concepts                                                          words written in non-native script. For instance, the Bengali
•Information systems → Question answering; •Computingword “ekhon” (meaning, now) may have multiple spellings −
methodologies → Machine learning; Cross-validation;                   “akhan”, “ekhon”, “ekhan”, “akon” etc. Categorizing a ques-
                                                                      tion into a specific set of classes and then dealing with each
Keywords                                                              class separately is an efficient method for QA systems. This
                                                                      is called Question Classification. Question classification aids
Mixed Script Information Retrieval; Question Answering
                                                                      in reducing the number of candidate answers and also can
System; Question classification
                                                                      be used in effectively determining answer selection strate-
                                                                      gies [8]. In this paper, we deal with the problem of question
1. INTRODUCTION                                                       classification for Multi-Script or code-mixed data. We have
   With the increase in popularity of the Web, users from             experimented with three machine learning based classifiers
all over the world now opt to write in their native language          - Random Forests, One-vs-Rest and k-NN and then built an
instead of English. A large number of South and South-                ensemble of these classifiers to achieve an higher accuracy.
East Asian languages are written in a transliterated form
(phonetically representing words in a non-native script) us-
ing the Roman Script. These texts are said to be written in           2. RELATED WORK
Mixed-Script. Since there are font-encoding issues in using              Jamatia et al. [7] experiments with code-mixed English-
the original script (for example, Devnagari for Hindi) peo-           Hindi social-media text for Part-of-Speech tagging. They
ple tend to transliterate or phonetically represent the words         use both coarse and fine-grained tagsets for the task. Four
in the original language using the Roman script. To define            machine learning algorithms − Conditional Random Fields,
Mixed Script Information Retrieval formally [6] we consider           Sequential Minimal Optimization, Naı̈ve Bayes and Ran-
a set of natural languages L = {l1 , l2 , · · · , ln } and a set of   dom Forests, reporting highest accuracy with Random For-
scripts S = {s1 , s2 , · · · , sn } such that si is the native script est based classifier. Information Retrieval on Multi-Script
for the language li . Given a word w, we represent it as              data has also been looked into [6]. Recent works on question
two tuples hli , sj i to imply w is in language li and written        classification include a machine learning based approach [8]
using script sj . When i = j, we say that the word is writ-           towards question class. A hierarchial classifier is first used
ten in its native script. Else, it has been transliterated into       to classify the question into coarse-grained classes and then
another script sj . In practice, when textual content is a            into fine-grained classes. The feature space consisted of
mixture of words from various languages or scripts or both,           primitive ones like pos tags, chunks, named entities and
it is called Multi-Script (MS) or Code-Mixing. For instance,          also complex features such as conjunctive n-gram features
and relational features. Question-Answering corpus acqui-                (k-NN) classifier, followed by building an ensemble classifier
sition using social-media content and question acquisition               using these three classifiers for the classification task.
with human involvement have been reported in [2]. In FIRE                   In k-NN classification, a sample is classified by a majority
2015, the Transliterated Search track introduced three sub-              vote of its neighbours, with the object being assigned to the
tasks — language labelling of words in code-mixed text frag-             most common class among its k-nearest neighbours (k >
ments, ad-hoc retrieval of Hindi film lyrics, movie reviews              0, k ∈ I). k-NN classification is a lazy learning method
and astrology documents and transliterated question an-                  which defers computation till the classification is performed.
swering where the documents as well as questions were in                 k-NN is one of the simplest classifiers.
Bangla script or Roman transliterated Bangla [4].                           One-vs–Rest strategy uses one classifier per class for fit-
                                                                         ting. Each classifier is trained against all the classes. The
3.    TASK DESCRIPTION                                                   approach allows information regarding each class by inspect-
                                                                         ing the classifier trained for that class. In OvR, each classi-
   Question Answering systems are a classic application of               fier is trained with the entire data set while in RF, samples
natural language processing, where the retrieval task has                drawn from the original data set are used for training.
to find a concise and accurate answer to a given question.                  A Random Forest is a ensemble learning method which
Question classification is one of the subtasks of QA system,             can be classification [3]. Random Forest fits a number of
required to determine the type of the answer corresponding               decision trees on various sub-samples of the dataset, with
to a question.                                                           the samples drawn from the original dataset with or with-
   The Code-Mixed Cross-Script Question Classification task              out replacement. Random Forests overcome the problem of
can be described as follows. Given a question Q written in               overfitting of decision tree of their training set [5].
Romanized Bengali, which can contain English words and                      Using the above three classifiers, we built an ensemble
phrases and a set C = {c1 , c2 , . . . , cn } of question classes, the   classifier (EC). The ensemble classifier takes the output label
task is to classify the question Q into one of these predefined          by each of the individual classifiers and gives the majority
classes.                                                                 label as output, otherwise any label is chosen at random
Example:                                                                 as output. Each of the individual classifiers is trained on
Question: airport theke howrah station distance koto ?                   a subset of the original training dataset, by sampling with
Question Class: DIST                                                     replacement.
                                                                            In the following section, we describe the details of imple-
                                                                         mentation of the classifiers and the obtained results.
3.1    Dataset description
   The training dataset consists of 330 questions and each
question is assigned to a single question class. There are 9             5.   EXPERIMENTAL SETUP AND RESULTS
question classes in all and the number of questions in each                We implemented the proposed approach using Python 3
class is shown in Table 1. The minimum and maximum                       and used the scikit-learn tool-kit for the classifiers. The fol-
number of words in a question is 2 and 11 respectively while             lowing instantiations were used for the first three classifiers,
each question on average has 5.3 words.                                  which were available in sckit-learn. We implemented the
                                                                         ensemble classifier on our own.
                                                                         rf = RandomForestClassifier(n_estimators=100)
      Table 1: Dataset classes and #Q per class                          ovr = OneVsRestClassifier(LinearSVC(random_state=0))
                     Class #Q                                            clf = neighbors.KNeighborsClassifier(30, weights=’uniform’)
                      DIST    24                                            We split the labelled data set into two parts — train-
                      LOC     26                                         ing set (90%) and validation set (10%). The RF classifier
                     MISC      5                                         performed the best, followed by EC, OvR and k-NN in de-
                      MNY     26                                         creasing order of classification accuracy. Thereafter, we used
                      NUM     45                                         these trained classifiers for classifying the test data set. Dur-
                       OBJ    21                                         ing classification, we marked the samples for which all the 4-
                      ORG     67                                         classifiers predicted the same label. We used these samples,
                      PER     55                                         in addition to the original labelled data set for retraining
                    TEMP      61                                         the classifiers.
                      Total 330                                             The results on the test data set for the classifiers RF, EC
                                                                         and OvR were submitted as final run and is shown summar-
                                                                         ily in Table 2 and Table 3. The classification results of kNN
4.    PROPOSED APPROACH                                                  classifier were not submitted as run and hence accuracy of
                                                                         the results is not available.
To build a classifier to classify the questions into the spec-
ified classes, we created a vector representation of the each
question which is used as input to the classifier. We con-               6.   CONCLUSION AND FUTURE WORK
sidered the top 2000 most frequently occurring words in the                 In this paper we have addressed the problem of question
supplied training dataset as features. Each question is rep-             classification for Bengali-English Code-Mixed social-media
resented as a 2000-element binary vector. Element ei = 1,                data. We have experimented with three machine learning
if the ith most frequently word is present in the question,              based classifiers - Random Forests, One-vs-Rest and k-NN
otherwise 0.                                                             and then built an ensemble of these classifiers to achieve the
   We used three separate classifiers namely Random Forests              best results. The method is scalable to other Code-Mixed
(RF), One-vs-Rest (OvR) classifier and k-Nearest Neighbour               languages mainly because it does not perform any language
                                                                 S. Banerjee, S. K. Naskar, S. Bandyopadhyay,
         Table 2: Results of individual classes                  G. Chittaranjan, A. Das, and K. Chakma. Overview of
     Classifier  Class    I   IC    P      R    F-1
                                                                 fire-2015 shared task on mixed script information
           EC    PER     24 20 0.83 0.74 0.78
                                                                 retrieval.
           RF            25 21 0.84 0.77 0.80
         OvR             23 19 0.82 0.70 0.76                [5] J. Friedman, T. Hastie, and R. Tibshirani. The
                                                                 elements of statistical learning, volume 1. Springer
           EC    LOC     26 21 0.80 0.91 0.85
                                                                 series in statistics Springer, Berlin, 2001.
           RF            26 22 0.84 0.95 0.89
         OvR             26 21 0.80 0.91 0.85                [6] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and
           EC    ORG     36 19 0.52 0.79 0.63                    P. Rosso. Query expansion for mixed-script information
           RF            34 19 0.55 0.79 0.65                    retrieval. In The 37th Annual ACM SIGIR Conference,
         OvR             40 19 0.47 0.79 0.59                    pages 677–686, 2014.
           EC    NUM     30 26 0.86        1    0.92         [7] A. Jamatia, B. Gambäck, and A. Das. Part-of-speech
           RF            29 26 0.89        1    0.94             tagging for code-mixed english-hindi twitter and
         OvR             29 26 0.89        1    0.94             facebook chat messages. In 10th Recent Advances of
           EC   TEMP     25 25      1      1      1              Natural Language Processing (RANLP), pages 239–248,
           RF            25 25      1      1      1              2015.
         OvR             25 25      1      1      1          [8] X. Li and D. Roth. Learning question classifiers. In
           EC MONEY 16 13 0.81 0.81 0.81                         Proceedings of the 19th international conference on
           RF            16 13 0.81 0.81 0.81                    Computational linguistics-Volume 1, pages 1–7.
         OvR             12 12      1    0.75 0.85               Association for Computational Linguistics, 2002.
           EC    DIST    20 20      1    0.95 0.97
           RF            20 20      1    0.95 0.97
         OvR             22 21 0.95        1    0.97
           EC    OBJ      3    3    1     0.3 0.46
           RF             5    4   0.8    0.4 0.53
         OvR              3    3    1     0.3 0.46
           EC    MISC     0    0   NA NA NA
           RF             0    0   NA NA NA
         OvR              0    0   NA NA NA


                Table 3: Overall Results
        Classifier Correct Incorrect Accuracy
              EC     147       33        81.66
              RF     150       30        83.33
            OvR      146       34        81.11


or script-based feature engineering.
  We would like to experiment with other multi-script data
where more than two languages have been mixed. We aim
to apply other machine learning algorithms with more lin-
guistic and syntactic features.

7.   REFERENCES
[1] S. Banerjee, K. Chakma, S. K. Naskar, A. Das,
    P. Rosso, S. Bandyopadhyay, and M. Choudhury.
    Overview of the Mixed Script Information Retrieval
    (MSIR) at FIRE. In Working notes of FIRE 2016 -
    Forum for Information Retrieval Evaluation, Kolkata,
    India, December 7-10, 2016, CEUR Workshop
    Proceedings. CEUR-WS.org, 2016.
[2] S. Banerjee, S. K. Naskar, P. Rosso, and
    S. Bandyopadhyay. The first cross-script code-mixed
    question answering corpus. In Modelling, Learning and
    mining for Cross/Multilinguality Workshop, 38th
    European Conference on Information Retrieval (ECIR),
    pages 56–65, 2016.
[3] Y. Bengio. Learning deep architectures for ai.
    Foundations and trends R in Machine Learning,
    2(1):1–127, 2009.
[4] M. Choudhury, P. Gupta, P. Rosso, S. Kumar,