Code Mixed Cross Script Question Classification
                                                               Anuj Saini
                                                        Sapient Global Markets
                                                          Gurgaon, Haryana
                                                      asaini13@sapient.com

ABSTRACT                                                               and fast to use. Eventually English along with regional language
                                                                       has becomes an integral part of communication for most of social
With the growth in our society, one of the most affected aspect of     media users. There are many words which people use in English
our routine life is language. We tend to mix our conversations in      only aapka mobile number kya hai, where no one actually knows
more than one language, often mixing up regional language with         what is corresponding Hindi word to mobile. Even google also
English language is a lot more common practice. This mixing of         supports code mixed queries and is able to translate and return
languages is referred as code mixing, where we mix different           results.
linguistic constituents such as phrases, proper nouns, morphemes                 But one of the next generation systems, automated
etc. to come up code mixed script. With exponential growth of          question answering systems, are still mostly catering one language
social media, we are using more and more code mixed cross script       such as English or French. One of the first building blocks for any
for our conversation on Facebook, WhatsApp, or Twitter. On the         question answering system is to understand and classify question.
other hand, the language should be understood by the automated         A question could be of many types such as When, Where, What,
question answering system which is one of the most import              Why etc. So it’s important to classify question what kind of
application of AI. And now the trend is code mixed languages but       answer is a question looking for. This shared task is about
current work is around a single language. At FIRE 2016, as a part      classifying question in one of these classes. This paper describes
of Shared Task1 CMCS (Code Mixed Cross Script Question                 our approach for the subtask 1 in the shared task on Mixed Script
Classification), we have worked on the problem of classify a code      Information Retrieval [9] in FIRE-2016. We have used basic text
mixed question into 9 given classes. Shared Task is focused on         preprocessing such Tokenizer, n-grmas, along with and
Indian regional languages, wherein we worked on Bengali-               CountVector and TFIDF vector to generate features and then fed
English code mixed cross script questions classification. As           training data of vectors into Machine Learning algorithms such as
scripting used in training data is English only, so all Bengali text   Naïve Bayes, SVM, and Random Forest etc.
was also written using English script only. We have used Machine
Learning for question classification and used ensemble based
Random Forest algorithm. As it’s a code-mixed script, so
                                                                       2. RELATED WORK
traditional NLP components may not work well, so worked on a
custom solution using own set of features for Classification.          A lot of similar work has been done in the past on mixed
                                                                       languages and question classification on various sources. Khyati
CCS Concepts                                                           and Manoj [5] did classification of Hindi-English mixed
• Theory of computation Random Forest                                 languages questions. They used SVM based classification with
• Computing methodologiesNatural language                             text preprocessing using transliteration and translation and then
 Processing                                                            using text features. Language Identification [6] is also another
                                                                       related preprocessing part of this system where one needs to know
Keywords                                                               language of individual word for better results.
Question Answering; Machine Learning; Code-Mixing; Code
switching; Classification; Random Forest; TFIDF; Stemming;
Natural Language Processing.                                           3. DATA SET DESCRIPTION

1. INTRODUCTION                                                        There were a total of 330 data points in the training set containing
                                                                       code mixed text of Bengali and English with 9 question classes to
                                                                       predict from. Each class describes type of answer it is expecting
According to Census of India of 2001, India has a total of 122
                                                                       for subTask1. A detail distribution of classes and its count for
major languages and 1599 other languages [1]. And with the
                                                                       CMCS task is mentioned in Table1.
advancement of technologies and social media, languages are now
getting mixed with other languages. A big chunk of population of
India is bilingual or even trilingual wherein Hindi and English                   Table 1. Classes & its count for SubTask1
along with Dravidian Languages are most spoken languages.              Class                                         Count
Code mixing is defined as mixing of more than one language
syntactically, wherein linguistic constituents such as syntax,         MNY                                             26
morphology, words/phrases etc. are mixed [2] [3].
          Social media such as WhatsApp, Facebook[8] etc. are          TEMP                                            61
widely used by most of the audience using mobile phone or smart
devices where communication is more often mix of English and           DIST                                            24
regional language. Even though autocorrect provide by most of
these devices is in English only. So it’s very common as well as
                                                                       LOC                                             26
convenient for users to communicate in code mixed which is easy
ORG                                            67                    provided. As we did not filter out any word except replacement of
                                                                     proper nouns, so in total, 1030 features have been generated by
MISC                                            5                    this vector. Out of which as many as 667 terms had a frequency of
                                                                     1, so we discarded those features and hence we are left with 363
OBJ                                            21                    features to be used. This is also called Term frequency and is
                                                                     denoted as below
NUM                                            45
                                                                                          tf(t,d) = ft,d
PER                                            55
                                                                     Where t is term in document d.
Each data point contains a code mixed question and a
                                                                     A snapshot of features generated by this vectorizer is mentioned
corresponding class in training data. Organizers have also
                                                                     in Figure 1.
provided a test dataset for evaluation purpose which had only
code mixed questions for which our system have to predict class
having a total of 181 questions.
                                                                           toiri
4. FEATURE GENERATION                                                     koiti
                                                                          time
As our solution is machine learning based, so we need to convert
                                                                            bus
out text into some vectors to train our model. Before converting
our text into vectors, we have applied some basic preprocessing           xx te
on the raw code mixed text using NLP based custom pipeline.                 run
4.1 Text Preprocessing                                                    ache
We have used NLTK tokenizer on all questions to tokenize
                                                                      xx theke
sentences into tokens.
                                                                               r
Further, we have also generated bigrams of tokens, as bigrams are
important to identify some important phrases to be used for               koto
feature generation such as koto run.                                               0              50               100              150
Also we analyzed that proper nouns entities such as person names,
locations names were causing a lot of noise, so we have identified            Figure 1. Term Features by Count Vectorizer
proper nouns and replaced them with XX.
We did not apply any stemming and stop words removal
knowingly, as we analyzed and tested that stop words like koto,      4.2.2 TFIDF Vectorizer
where, and who etc. are very important features for this
classification task. So we kept all the terms in our vocabulary      This vectorizer takes in all code mixed raw text and generate tf-idf
along with new set of bigram words.                                  of all the terms to get importance of all the words and bigrams.
.                                                                    Tfidf is very useful feature generation model in text applications
                                                                     and is denoted as follows.
          Table 2. Preprocessing on Code Mixed Script
    PreProcess           Input                 Preprocessed
Tokenization     prepaid taxi kokhon     [prepaid, taxi, kokhon,
                                                                     Where idf is inverse document frequency and is denoted as.
                 chalu hoi               chalu, hoi]
bigrams                                  [prepaid taxi, taxi
                 prepaid taxi kokhon
                                         kokhon, kokhon chalu,
                 chalu hoi
                                         chalu hoi]                  We have only used terms having term frequency more than 1 and
Proper Nouns     Hazarduari te koto                                  hence finally produced 163 features by this vectorizer having tf-
Replacements                             XX te koto dorja ache
                 dorja ache                                          idf values for each term. A snapshot of features generated by tfidf
                                                                     vectorizer is mentioned in Figure 2.

4.2 Text Vectors (Features)
A set of text features has been generated on word and bigrams
level using text to vector conversions. Each feature is assigned a
corresponding numeric value to train model.
4.2.1 Count Vectorizer
Count vector simply counts the frequency of each token
(unigrams and bigrams here) on Code Mixed questions corpus.
This vector produces a sparse representation of the counts and
number of features produced will be equal to the vocabulary size
                                                                       PER             0.85         0.91          0.88            55
   tuleche                                                             TEMP            0.93         0.97          0.95            59
   quarter
      mela                                                             avg       /
  ghumiye                                                              total           0.83         0.84          0.84           319
    capital
 sobcheye                                                              Location, temporal and numeric questions were best classified
    masjid                                                             classes and model failed to predict miscellaneous classes.
      sera
                                                                       We have submitted three submissions on test data and best
   prepaid                                                             accuracy score we have received is 81.11111 which is 2nd
        by                                                             amongst all teams. Detailed performance matrix on test data is
  distance                                                             given in Table 4.
         er
      koto                                                                     Table 4. Subtask 1 Scores Summary on Test Data
              0               2                4               6
                                                                         Class       Precision     Recall      F1-Score        Support
          Figure 2. Term Features by TFIDF Vectorizer                  DIST            0.87         0.83          0.85            24
Overall we have generated ~500 features using both these vectors       LOC             0.95         0.87          0.91            23
without removing any term as we found stop words relevant for
this task.                                                             MISC              0            0            0               5
5. CLASSIFICATION                                                      MNY             0.82         0.88          0.85            26
Text vectors as generated in section 3.2 has been used as training     NUM             0.93         0.89          0.91            45
data to train classifiers using Python Scikit. A number of different   OBJ             0.58         0.71          0.64            21
algorithms with different parameters have been tested before
coming up with the best algorithm and its parameters. Support          ORG             0.76         0.72          0.74            61
Vector Machines (SVM), Logistic Regression, Random Forest,
Gradient Boosting have been tested using Grid Search to come up        PER             0.85         0.91          0.88            55
with best parameters and model. Finally, Random forest with 100
n_estimators, max_depth of 10, min_samples_leaf of 4 and
                                                                       TEMP            0.93         0.97          0.95            59
min_samples_split of 4, were identified as the best set of             avg       /
parameters to be used with Random Forest. Overall training time
for model is less than 1 second on quad-core Machine with 8GB
                                                                       total           0.83         0.84          0.84           319
of RAM.


6. RESULTS                                                             7. CONCLUSION AND FUTURE SCOPE
We have used 10 fold cross validation to compute overall
accuracy for the system. For the subtask 1 CMCS we have got            In this paper, we have presented our approach for question
overall accuracy of 0.8495 with F1 score of 0.84. Detailed             classification on code mixing. Question classification is a first
performance matrix of the model is given as below in Table 3 for       step towards building a question answering system. Moreover
all the classes.                                                       work which has been done in code mixed languages is mostly in
                                                                       Hindi-English [5] language in India. And as most of the current
       Table 3. Subtask 1 Scores Summary on Train Data                 work in this domain is around single language which is unrealistic
                                                                       in countries like India where most of communication on social
  Class        Precision     Recall       F1-Score        Support      media is code mixed. This shared task is a milestone step towards
DIST              0.87        0.83          0.85             24        building such realistic applications for future. Future scope of
                                                                       Information Retrieval [7] systems is going to be question
LOC               0.95        0.87          0.91             23        answering systems where system could understand question. We
                                                                       have used very basic set of features here by simply calculating
MISC               0              0           0               5        importance of words/phrases within corpus itself. Using part of
MNY               0.82        0.88          0.85             26        speech tag which is un-touched area in our solution could
                                                                       definitely add a lot more to the solution. Also using regional
NUM               0.93        0.89          0.91             45        lexical dictionaries like WordNet, e.g. Hindi WordNet for Hindi
                                                                       code mixed problems will surely help a lot to build more
OBJ               0.58        0.71          0.64             21        sophisticated solution.
ORG               0.76        0.72          0.74             61
8. ACKNOWLEDGMENTS
We would like to thank organizers for conducting this shared task
and also building the training data. We also would like to thank
Sapient Corporation for giving us an opportunity to work and
explore the world of text analytics.


9. REFERENCES

[1] Vijayanunni, M. (26–29 August 1998). "Planning for the
    2001 Census of India based on the 1991 Census"(PDF). 18th
    Population Census Conference. Honolulu, Hawaii, USA:
    Association of National Census and Statistics Directors of
    America, Asia, and the Pacific. Archived from the
    original (PDF) on 19 November 2008. Retrieved 17
    December 2014.
[2] Muysken, Pieter. 2000. Bilingual Speech: A Typology of
    Code-mixing. Cambridge University Press. ISBN 0-521-
    77168-4
[3] Bokamba, Eyamba G. 1989. Are there syntactic constraints
    on code-mixing? World Englishes, 8(3), 277-292.
[4] Somnath Banerjee, Sudip Kumar Naskar, Paolo Rosso, and
    Sivaji Bandyopadhyay. 2016. The First Cross-Script Code-
    Mixed Question Answering Corpus. In: Modeling, Learning
    and Mining for Cross/Multilinguality Workshop, 38th
    European Conference on Information Retrieval (ECIR),
    pp.56-65.
[5] Khyathi Chandu Raghavi, Manoj Kumar Chinnakotla, and
    Manish Shrivastava. Answer ka type kya he?: Learning to
    classify questions in code-mixed language. In Proceedings of
    the 24th International Conference on World Wide Web
    Companion, pages 853–858. International World Wide Web
    Conferences Steering Committee, 2015. 6.
[6] G. Chittaranjan and Y. Vyas. Word-level Language
    Identification using CRF: Code-switching Shared Task
    Report of MSR India System. In EMNLP 2014, page 73.
[7] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and P.
    Rosso. Query Expansion for Mixed-Script Information
    Retrieval. In SIGIR ’14, pages 677–686, ACM, 2014.
[8] T. Hidayat. An analysis of Code Switching used by
    Facebookers. 2012.
[9] S. Banerjee, K. Chakma, S. K. Naskar, A. Das, P. Rosso, S.
    Bandyopadhyay, and M. Choudhury. Overview of the Mixed
    Script Information Retrieval at FIRE. In Working notes of
    FIRE 2016 - Forum for Information Retrieval Evaluation,
    Kolkata, India, December 7-10, 2016, CEUR
    Workshop Proceedings. CEUR-WS.org, 2016.