Modeling Classifier for Code Mixed Cross Script Questions

              Rupal Bhargava1                   Shubham Khandelwal2                      Akshit Bhatia3

                                                Yashvardhan Sharma4

                                       WiSoc Lab, Department of Computer Science
                                 Birla Institute of Technology and Science, Pilani Campus
                                                        Pilani-333031
                 {rupal.bhargava1 , f20131312 , f20137223 ,yash4 } @pilani.bits-pilani.ac.in

ABSTRACT                                                          Recently, Banerjee et al. [2] formally introduced the code-
With a boom in the internet, the social media text had been       mixed cross-script QA problem. The first step of under-
increasing day by day and the user generated content (such        standing a question is to perform a question analysis. Ques-
as tweets and blogs) in Indian languages are written using        tion classification is an important task of question analy-
Roman script due to various socio-cultural and technological      sis which detects the answer to the type of the question.
reasons. A majority of these posts are multilingual in nature     Question classification helps not only filter out a wide range
and many involve code mixing where lexical items and gram-        of candidate answers but also determine answer selection
matical features from two languages appear in one sentence.       strategies [6]. Furthermore, it has been observed that the
Focusing on this current multilingual scenario, code-mixed        performance of question classification has significant influ-
cross-script (i.e., non-native script) data gives rise to a new   ence on the overall performance of a QA system.
problem and presents serious challenges to automatic Ques-           The Subtask 1 in the shared task on Mixed Script Infor-
tion Answering (QA) and for this question classification will     mation Retrieval in FIRE-2016 addresses the task of code
be required which is an important step towards QA. This           mixed cross script question classification where ’Q’ repre-
paper proposes an approach to handle cross script question        sents set of factoid questions written in Romanized Bengali
classification as it is an important task of question analysis    along with English. The task is to classify each given ques-
which detects the category of the question.                       tion into one of the predefined coarse-grained classes. This
                                                                  paper proposes an algorithm for solving question classifica-
                                                                  tion task proposed by MSIR, FIRE 2016 Subtask 1 organiz-
CCS Concepts                                                      ers, using different machine learning algorithms.
•Information systems → Information integration; Data                 Rest of the paper is organized as follows. Section 2 ex-
analytics; Data mining; Social recommendation; Query rep-         plains the related work that has been done in the past few
resentation; Query intent;                                        years. Section 3 presents the analysis of dataset provided
                                                                  by MSIR 2016 Task Organizers [1]. Section 4 explains the
                                                                  methodology that have been performed for the task with
Keywords                                                          flowcharts to explain the flow. Section 5 describes the algo-
Code Mixing, Code Switching, Question Classification, Ma-         rithm proposed for question classification. Section 6 elabo-
chine Learning                                                    rates the evaluation and experimental results and error anal-
                                                                  ysis. Section 7 concludes the paper and presents future work.
1.   INTRODUCTION
   With the proliferation of social network large volumes of      2.   RELATED WORK
text is being generated daily. Traditional machine learn-            Today social media platforms are flooded by millions of
ing algorithms used for text analysis such as Named En-           posts everyday on various topics resulting in code mixing in
tity Recognition (NER) or POS Tagging or parsing, are             multilingual countries like India. A lot of work had been
language dependent.These algorithms usually achieve their         done in FIRE 2015 for language identification in cross script
objective using co-occurrence patterns of features. Due to        information retrieval. Bhattu et al. [4] proposed a two stage
such language dependence, it has been observed by many            algorithm, where in the first stage sentence level n-grams
studies that a variety of problems related to social media        based classifiers and in the second stage word level n-grams
text are hindered. One such problem is Question Answer-           classifiers were used. Bhargava et al. [3] proposed a hybrid
ing (QA). Being a classic application of NLP, Question An-        approach to do query labeling by generating char n-grams as
swering (QA) has practical applications in various domains        features and using logistic regression for language labeling.
such as education, health care, personal assistance etc. QA       For question analysis of such data, Question classification
is a retrieval task which is more challenging than the task       is done to understand the question that allows determin-
of common search engine because the purpose of QA is to           ing some constraints the question imposes on a possible an-
find accurate and concise answer to a question rather than        swer. Zhang et al. [8] used bag of words and bag of n-grams
just retrieving relevant documents containing the answer [6].     as features and applied K-NN, SVM, Naive Bayes to au-
tomate question classification and concluded that with sur-        particular organization, team or any other group of people
face text features the SVM outperforms the other classifiers.      and these questions mainly comprises of words like ’which’,
Banerjee et al. [2] proposed a QA system which takes cross-        ’what’, ’team’ etc. Class type ’MISC’ stands for Miscella-
script (non-native) code-mixed questions and provides a list       neous; this class has the minimum representation in the data
of information response to automate the question answering.        set and relates to a variety of questions.
Corpus acquisition was done from social media, question ac-          The entire data set has sentences in a code-mixed format,
quisition using a cloud based service without getting bias,        consisting of words which either belong to Bengali or English
corpus annotations and an evaluation scheme suitable to the        language. The data set does not contain any code-mixing
corpus annotation. Li et al. [6] proposed question classifi-       done at word level. Also there are no punctuation in the
cation using the role of semantic information developing a         data set except the question mark (?) while there are a lot
hierarchical classifier guided by a layered semantic hierarchy     of named entities (belonging to both English and Bengali)
of answer types.                                                   present in it.

3.   DATA ANALYSIS                                                 4.     PROPOSED TECHNIQUE
  The training data provided [1], consisted of 330 questions         A word-level n-gram based approach is used to classify
labeled with its specific coarse-grained question type classes.    code mixed cross script question records (comprising of words
In total there are 9 different question type classes in the data   belonging to both English and Bengali languages), into nine
set. As shown in Figure 1, class-types ’ORG’ and ’TEMP’,           different coarse grained question type classes. The proposed
                                                                   methodology involves a pipelined deployment of different
                                                                   techniques as mentioned in Figure 2. Proposed technique
                                                                   can be majorly divided into the following four phases:
                                                                        1. Pre-processing
                                                                        2. Named-entity recognition and removal
                                                                        3. Translation
                                                                        4. Classification

                                                                   4.1      Pre-processing
                                                                     Data is pre-processed for label separation and case con-
                                                                   version for the efficient application of the classifiers. The
                                                                   pre-processing techniques were deployed as follows:
                                                                        1. Separation of Class labels and Training Set Entries:
                                                                           The data set comprised of mixed script question records
                                                                           labeled with specific class type. These question entries
                                                                           and the respective class labels were segregated, on the
                                                                           basis of position of question mark symbol in the data
                                                                           entries. This segregation was done so that separate fea-
                                                                           ture vectors of question records and class labels could
                                                                           be formed, as per the deployment requirements of the
                                                                           classifiers.
Figure 1: Class Distribution of Training Data Set
                                                                        2. Case Conversion: For the purpose of normalization,
comprises majority of the instances. Each of these classes                 all the data entries were converted into lower case.
represents a particular type of question related to specific               This technique involved identification and replacement
entities. Class type ’MNY’ stands for Money related ques-                  of the upper case letters with their lower case coun-
tions and the instances comprises of words like ’fare’, ’price’            terparts by means of manipulation of the ASCII code
and helping words like ’koto’ (bn) and how much etc. Class                 values.
type ’PER’ stands for Person related questions mostly com-
prising of words like ’who’, ’whom’ etc. implying for the          4.2      Named-entity recognition and removal
subject of the sentence being a person. Class type ’TEMP’             The pre-processed data set comprised of entries which
implies time related questions mainly comprising of words          had a large number of named-entities. Named-entities in
like ’when’, ’at’ etc. Class type ’OBJ’ stands for the En-         a text can be referred to pre-defined categories such as the
tity/Object implying that subject of the sentence is an en-        names of persons, organizations, locations, expressions of
tity and mainly comprising of words like ’what’, ’kon’ etc.        times, quantities, monetary values, percentages etc. In En-
Class type ’NUM’ stands for Numeric entity related ques-           glish language named entities occur in certain manner at
tions and mainly involves usage of words like ’how many’,          certain positions according to sentence structure. But when
’koto’ etc. Class type ’DIST’ stands for Distance and implies      it comes to multi-lingual sentences, sentence structure varies
that question is related to distance between places. Class         a lot. Named Entities are identified using a dictionary based
type ’LOC’ stands for Location and thereby mainly com-             approach. The data set used for NER mainly comprised of
prises of words like ’where’, ’jabe’ etc. Class type ’ORG’         the entries from FIRE 2015 Subtask1’s data set [5]. This
stands for Organization and relates to questions centered on       data set contained entries belonging to both Bengali and
English languages. For the purpose of classification of the
question records into one of the class types, the presence
of these named-entities was irrelevant, as these entities did
not contribute in building question structure for class-type
determination, and hence their removal was mandatory.

4.3      Translation
   After the initial two phases, the remaining Bengali words
were transliterated into their native scripts and then fur-
ther translated to their respective English counterparts us-
ing the Google translation API 1 . This technique helped to
create a monolingual, single-script data set from the mixed
script data set provided so that the efficient application of
classifiers could take place. Using this approach, different
code mixed cross script variants (each variant using different
combination of words belonging to either Bengali or English
languages) of the same question record were translated and
hence standardized to only one question record (in English
language). For example the question record ”Hazarduari te
koto dorja ache?” and the record ”Hazarduari te how many
dorja ache?”, both refer to the same question but use dif-
ferent combination of words, and hence standardizing this
to the English translation, would lead to an increase in the
accuracy.

4.4      Classification
   The proposed approach uses the data set obtained from
translation phase and deploys the technique of n-grams to
form the feature vectors for each record in the data set. The
approach follows a word-level implementation of n-grams
with ’n’ being varied in the range 2 to 4, and thereby gen-
eration of feature vectors for each question record in the
training set. The transposed matrix of these feature vec-
tors along with the numerically encoded class label matrix
is then used as inputs to classifiers [7]. For the three runs,
the following different classifiers are used:

     1. Gaussian Naive Bayes Classifiers

     2. Logistic Regression Classifier

     3. Random Forest Classifier with Random State = 1

                                                                 Figure 2: Block Diagram for Proposed Technique
5.     ALGORITHM
   Algorithm 1 explains the proposed technique. Data set
comprising of mixed script question records along with their
respective class labels, is used as input. This input is pre-    This vector is then used to generate another callable (by
processed by deploying the techniques for separation of class    function: Build Analyzer()) which is used to produce n-
labels from data entries (implemented by the function: La-       grams tokens when called on each data set rows (imple-
bel Separation()) and case conversion of data records into       mented by callable: Analyser()) and word level n-grams cor-
lower cases (using function: Case Conversion()). Named           responding to each row is appended (by means of function
entities (NE) are removed from this pre-processed data (im-      append()) to the n-gram list. Feature vectors for the data
plemented by the function: NE Removal()). The remaining          entries are generated using this n-gram list (implemented
Bengali words present in the data set are then translated to     by the function: Create Feature Vector()). The class labels
their respective English equivalents by using Google Trans-      for the training purpose are numerically encoded (by means
lation API 1 (by means of the function: Translation()). The      of function: Encode Class()) and then corresponding fea-
technique of n-grams is then applied on this data set to         ture vectors for these class labels are generated. These two
form the corresponding feature vectors. First a vector for       sets of feature vectors are then used as inputs to the classi-
converting the textual entries into a matrix of n-gram token     fier (implemented by the function: Classifier()). Classifier()
counts (word level n-grams with n in the range 2 to 4) is        function can be replaced by functions of different classifier
created (implemented by the function Count Vectorizer()).        like GaussianNB, LogisticRegression or RandomForestClas-
                                                                 sifier. The classifier is then used to predict the class labels
1
    https://translate.google.com/                                generated as output.
Algorithm 1 Algorithm for Detecting paraphrases
 1: Input: Mixed Script (bn+en) Question records, S, Train-
    ing class labels, T
 2: Output: Predicted class labels, P
 3: Initialization: P=[], n grams=[]
 4: for i=0 to S.length do
 5:     Label Separation(S[i])
 6:     Case Conversion(S[i])
 7:     NE Removal(S[i])
 8:     Translation(S)
 9: end for
10: Vectorizer = Count Vectorizer(ngram range=(2,4))
11: Analyzer = Vectorizer.build analyzer()
12: for i=0 to S.length do
13:      row=Analyzer(S[i])
14:      for j=0 to row.length do
15:         n grams.append(row[j])
16:      end for
17: end for
18: Matrix Data=Create Feature Vector(n grams)
19: Class List=Encode Class(T)
20: Matrix Class= Create Feature Vector(Class List)
                                                                Figure 3: Highest Accuracy achieved by 7 teams
21: clf =Classifier(Matrix Data, Matrix Class)
                                                                that participated in MSIR,FIRE 2016
22: clf.fit(Matrix Data, Matrix Class)
23: P=clf.predict(Matrix Test)
                                                                is highly parallelized even at the level of scoring. It was
                                                                also observed from the results, that the proposed algorithm
6.    EXPERIMENTS                                               generated highest F-measure scores for the classes of Orga-
                                                                nization (ORG), Money (MNY) and Miscellaneous (MISC).
   MSIR, FIRE 2016, Subtask 1 involved classification of
                                                                Figure 4 shows the comparison of the different f-measure
mixed-script (Bengali and English) questions into nine dif-
ferent coarse grained question type classes as discussed in
Section 3. The training dataset comprised of 330 records
(along with class labels) and it was used to classify a test
dataset comprising of 180 mixed script question records. To-
tal seven teams from different institutes of the country par-
ticipated in the process and each team used three different
approaches for classification and generated results as men-
tioned in Figure 3. Approach proposed in this paper used
machine learning for classification and three runs were sub-
mitted for the same. Runs submitted varied from each other
in terms of classifiers used (Gaussian Naive Bayes, Logistic
Regression and Random Forest Classifiers). Using the ap-
proach of Gaussian Naive Bayes classifier, an accuracy of
81.12 % was obtained, using Logistic Regression an accu-
racy of 80% was obtained and using Random Forest Clas-
sifiers an accuracy of 72.78% was obtained. The results in
details, analysis and comparison for the same are discussed
further.

6.1   Evaluation and Discussion
  The MSIR, FIRE 2016, Subtask 1 organizer evaluated the
results which gave a comparison of accuracy achieved by the
7 teams that participated as shown in Figure 3. The pro-        Figure 4: Comparison of F-Measure for Organiza-
posed approach (team BITS PILANI) got ranked as 2nd             tion and Money class among different teams
with an accuracy of 81.12% for run1 while the highest ac-
curacy achieved was 83.34% (by the team IINTU). Choice          scores of the teams obtained for the class Organization. The
of Gaussian Naive Bayes classifier leads to the maximum         proposed algorithm (implemented by team BITS PILANI)
accuracy attainment, as the proposed algorithm deals with       got the highest scores of 0.74418 using Gaussian Naive Bayes
the problem involving continuous attributes. Usage of Naive     approach. This implies that the questions relating to a par-
Bayes helps in building simplistic and highly scalable models   ticular organization mainly being framed with words like
which are fast and scale linearly with number of predictors     ”which”, ”what” etc. could be efficiently classified by means
and rows. Also the process of building a naive bayes model      of this approach. These scores can be attributed to the fact
that the instances of the class ORG were maximum in the             mented by the team BITS PILANI) for the three different
data set (67 out of 330 as discussed in Section 3). Also the        runs submitted.
proposed algorithm involves the formation of word level n-
grams due to which words and phrases like ”which”, ”team”,
”series”, ”sponsor” etc. got associated, and thus might have        Table 1: Class wise score for all the runs submitted
contributed to an increase in the scores.
   Figure 4 also shows the comparison of the different f-
measure scores of the teams obtained for the class Money.
Using the proposed algorithm (team BITS PILANI) achieved
the highest scores of 1 using Logistic Regression as a classi-
fier (run 2). Hence all the questions relating to money being
framed with words like ”how much”, ”price”, ”fare” etc. could
be efficiently classified by means of the proposed approach.
These high f-scores could be attributed to the efficient de-
ployment of the word level n-gram techniques which in a
way linked the words like ”fare”, ”how”, ”much”, ”price” etc.
and thus might contributed to an increase in accuracy.
   The evaluated results also showed that only two teams
(team BITS PILANI and team NLP-NITMZ) were able to
identify instances belonging to Miscellaneous (MISC) class.
This can be attributed to the fact that there were only 5 out
of 330 instances of MISC class in the training data set. The
proposed approach (team BITS PILANI) got the highest
scores of 0.2 using Gaussian Naive Bayes classifier, which
again attributes for the simplistic approach of GaussianNB
classifiers and the efficient deployment of the word level n-
grams technique.


                                                                    6.2   Error Analysis
                                                                       There are a few phases at which proposed approach could
                                                                    have attributed to the mis-classification of a few records.
                                                                    The proposed approach involves a dictionary based method
                                                                    for named entity recognition for which the corpus used had
                                                                    only limited entries due to which some of the entities might
                                                                    not have been recognized and removed. Also the data set
                                                                    had a large number of instances of named-entities which
                                                                    referred to the same name but had similar but different
                                                                    spellings. For instance, in the data set, words ”masjid” and
                                                                    ”mosjid” both referred to the same word implying ”mosque”
                                                                    but had different spelling. Since the proposed approach used
                                                                    a corpus for NER these entities couldn’t be removed unless
                                                                    all the spellings of these words were added to the corpus.
                                                                       The proposed approach also involves the usage of a trans-
Figure 5: Comparison of F-Measure among different                   lation system (Google API1 ) for translating words of Ben-
teams for various classes                                           gali to English, but since the translation system did not
                                                                    consider the semantics of the sentence where the word was
  Figure 5 shows a comparison of the accuracy obtained              being used, it may have happened that the particular Ben-
(taken the best accuracy obtained out of the three runs for         gali word would have been incorrectly translated. The given
each team) for classifying each of the nine classes. As evi-        data set did not have a uniform distribution of class in-
dent from the figure, the proposed approach (implemented            stances, as shown in Figure 1 the data set comprised only of
by team BITS PILANI) was able to obtain satisfactory re-            1.51% of MISC class instances while ORG class comprises
sults in identifying the correct class labels particularly in the   20% of the entries in the data set due to which the model
cases of MISC, ORG, MNY, NUM and OBJ classes with an                trained could be biased. Also as mentioned before, not even
f-measure score of 1 obtained for the class Money. Table            a single instance of MISC class from the test data set could
1 shows the scores of precision, recall and F-measure for           be identified by most of the teams, and even the proposed
each of the nine different classes, as evaluated by the FIRE        system was able to get an f-measure score of only 0.2 because
2016 task organizers [1], for the proposed algorithm (imple-        of lesser number of instances of the class.
7.   CONCLUSIONS AND FUTURE WORK                                  [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
   In this paper, a word-level n-gram based approach of clas-         B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
sification of code mixed cross script question records into           R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learn-
nine different coarse grained question type classes has been          ing in python. Journal of Machine Learning Research,
presented for Subtask 1 of MSIR, FIRE 2016. Presented ap-             12(Oct):2825–2830, 2011.
proach uses a pipelined stages to classify questions using var-
                                                                  [8] D. Zhang and W. S. Lee. Question classification using
ious machine learning algorithms(Gaussian Naive Bayes, Lo-
                                                                      support vector machines. In Proceedings of the 26th an-
gistic Regression and Random Forest). Proposed approach
                                                                      nual international ACM SIGIR conference on Research
obtained highest accuracy of 81.12% using Gaussian Naive
                                                                      and development in informaion retrieval, pages 26–32.
Bayes approach among all the three runs submitted. Future
                                                                      ACM, 2003.
work could be an improvisation of dictionaries for named-
entity recognition for Bengali and English languages. Differ-
ent Named Entity Recognizer and taggers along with trained
models for Name Entity Recognition could be deployed. It
would be interesting to find approaches by which implicit
features about the code-mixed cross script data set could be
efficiently trained using deep learning algorithms. Machine
learning based models for language identification along with
appropriate transliteration and translation tools (which take
into consideration of the correct semantics) could be im-
proved further.

References
[1] S. Banerjee, K. Chakma, S. K. Naskar, A. Das, P. Rosso,
    S. Bandyopadhyay, and M. Choudhury. Overview of the
    Mixed Script Information Retrieval (MSIR) at FIRE.
    In Working notes of FIRE 2016 - Forum for Informa-
    tion Retrieval Evaluation, Kolkata, India, December 7-
    10, 2016, CEUR Workshop Proceedings. CEUR-WS.org,
    2016.
[2] S. Banerjee, S. K. Naskar, P. Rosso, and S. Bandyopad-
    hyay. The first cross-script code-mixed question answer-
    ing corpus. In First Workshop on Modeling, Learning
    and Mining for Cross/Multilinguality (MultiLingMine
    2016) co-located with the 38th European Conference on
    Information Retrieval (ECIR 2016), volume 1589, pages
    56–65, 2015.
[3] R. Bhargava, Y. Sharma, S. Sharma, and A. Baid. Query
    labelling for indic languages using a hybrid approach.
    In Working notes of FIRE 2015 - Forum for Informa-
    tion Retrieval Evaluation, Gandhinagar, India, Decem-
    ber, 2015, volume 1587 of CEUR Workshop Proceedings,
    pages 40–42. CEUR-WS.org, 2015.
[4] S. N. Bhattu and V. Ravi. Language identification in
    mixed script social media text. In Working notes of
    FIRE 2015 - Forum for Information Retrieval Evalua-
    tion, Gandhinagar,India, December, 2015, volume 1587
    of CEUR Workshop Proceedings, pages 37–39. CEUR-
    WS.org, 2015.
[5] M. Choudhury, P. Gupta, P. Rosso, S. Kumar, S. Baner-
    jee, S. K. Naskar, S. Bandyopadhyay, G. Chittaranjan,
    A. Das, and K. Chakma. Overview of fire-2015 shared
    task on mixed script information retrieval. In Working
    notes of FIRE 2015 - Forum for Information Retrieval
    Evaluation, Gandhinagar, India, December, 2015, pages
    19–25. CEUR-WS.org, 2015.
[6] X. Li and D. Roth. Learning question classifiers. In Pro-
    ceedings of the 19th international conference on Compu-
    tational linguistics-Volume 1, pages 1–7. Association for
    Computational Linguistics, 2002.