NLP-NITMZ @ MSIR 2016 System for Code–Mixed
               Cross–Script Question Classification

                       Goutam Majumder                                               Partha Pakray
            National Institute of Technology Mizoram                    National Institute of Technology Mizoram
             Deptt. of Computer Science & Engg.                          Deptt. of Computer Science & Engg.
                          Mizoram, India                                              Mizoram, India
                   goutam.nita@gmail.com                                    parthapakray@gmail.com

ABSTRACT
                                                                         Table 1: Example of Question Classes
This paper describes our approach on Code–Mixed Cross–              Question
Script Question Classification task, which is a subtask 1 of                                   Examples
                                                                     Class
MSIR 2016. MSIR is a Mixed Script Information Retrieval                       Airport theke Howrah Station volvo bus
event in conjunction with FIRE 2016, which is the 8th meet-          MNY
                                                                              fare koto?
ing of Forum for Information Retrieval Evaluation. For this          TMP      Volvo bus howrah station jete koto time nei?
task, our team NLP–NITMZ submitted three system runs                 DIST     Airport theke howrah station distance koto?
such as: i) using a direct feature set; ii) using direct and de-     LOC      Airport theke textit kothai jabar bus nei?
pendent feature set and iii) using Naive Bayes classifier. The
                                                                     ORG      Prepaid taxi counter naam ki?
first system is our baseline system, which is based direct fea-
                                                                     OBJ      Murshidabad kon nodir tire obosthito?
ture sets and we used a group of keywords to generate this
                                                                     NUM      Hazarduari te koto dorja ache?
direct feature set. To identify question classes our baseline
system falls in ambiguity (means one question is tagged with         PER      Ke Hazarduari toiri kore?
multiple classes). To deal with this ambiguity, we developed         MISC     Early morning journey hole kon service valo?
another set of feature and we consider this feature set as de-
pendent feature set, because keywords from this set is worked
with direct feature set. The highest accuracy of our system        personal assistance, etc. QA is a retrieval task, which is more
is 78.88% using method–2 and we submitted as run–3. Our            challenging than the task of common search engine because
other two runs have same accuracy as 74.44%.                       the purpose of QA is to find out accurate and concise answer
                                                                   to a question rather than just retrieving relevant documents
                                                                   containing the answer [5].
Keywords                                                              In this paper, the participation of subtask 1 is reported,
Natural Language Processing; Question Answering; Infor-            which is a code-mixed cross-script Question Classification
mation Retrieval; Question Analysis                                task [2] of MSIR 2016 (Shared Task on Mixed Script In-
                                                                   formation Retrieval) [1]. The first step of understanding
                                                                   a question is to perform question analysis (QA). Question
1.   INTRODUCTION                                                  classification is an important task of QA, which detects the
   Question Answering (QA) concerned with the building             answer type of the question. Question classification not only
system, which can answer the questions automatically posed         helps to filter out a wide range of candidate answers but also
by human. The QA is a common discipline within the fields          determines answer selection strategies [3], [5].
of Information Retrieval (IR) and Natural Language Pro-               In this subtask 1, given two sets as Q = {q1 , q2 , ..., qn } and
cessing (NLP). It is a computer program, querying a struc-         C = {c1 , c2 , ..., cn }, where Q be a set of factoid questions
tured or unstructured database of knowledge or information         written in Romanized Bengali along with English and C be
and constructs its answer [6].                                     the set of question classes. The task is to classify each given
   The current research of QA deals with a wide range of           questions into one of the predefined coarse-grained classes.
question type such as fact-based, hypothetical, semantically       Total of 9 question classes are given as classification task
constrained, and cross-lingual questions. The need and im-         and example of each question class with specific tag is listed
portance of a QA system was first introduced in 1999, by           in Table 1.
the first QA task in TREC 8 (Text REtrieval Conference).              Rest of the paper is organized as follows: in Section 2
It was revealed the need of a sophisticated search engines,        we have discussed the three methods in detail. Performance
which is able to retrieve the specific piece of information        of the three systems is analysed in Section 3 and we also
that could be considered as the best possible answer to a          compared with other submitted systems. Finally, conclusion
user question. In present days such QA system works as a           of the report is drawn in Section 4.
backbone for successful of any E-Commerce business. In this
type of systems, many frequently asked question (FAQ) files
are generated based on most frequently asked use questions         2.     THE PROPOSED METHOD
and types of those questions [4].                                    Three methods are developed for MSIR16 subtask1 to
   Being a classic application of NLP, QA has practical ap-        identify the question classes. Two systems are based on fea-
plications in various domains such as education, health care,      ture sets and identification stages for questions are depen-
  Table 2: Features of MNY Class with Example                    Table 4: Features of ORG Class with Example
  Features        Questions as MNY Class                             Features    Questions as ORG Class
    charge   Semi–official guide koto taka charge nei?                   ki     Prepaid taxi counter naam ki?
     daam    Fuchhka r koto daam Darjeeling e?                         kara     World champion kara?
     price   Darjeeling e momo r price koto?                            kon     kon team Ashes hereche?
     dam     Chicken momo r dam koto Darjeeling e?                       ke     Ashes hereche ke?
   khoroch Pykare te boating e koto khoroch hobe?
       fee   Wasef Manzil er entry fee koto?
      tax    Koto travel tax pore India border e?                   sidered. The direct set of features with examples are
     pore    Koto travel tax pore India border e?                   listed below:
      fare   Darjeeling e dedicated taxi fare koto?                   • time–Koto time lage Bangalore theke Ooty by
     taka    Digha te Veg meal koto taka?                               road ?
                                                                      • kobe–Kobe Jorbangla Temple build kora
                                                                        hoyechilo ?
  Table 3: Features of DIST Class with Example
 Features           Questions as DIST Class                           • kokhon–Shyamrai Temple toiri hoi kokhon ?
 distance Kolkata theke bishnupur er distance koto?              4. LOC: To tag the location class only one direct feature
  duroto    Bangalore theke Ooty r by road duroto koto?             as kothai is identified and this keyword is also used
   area     Ooty botanical garden er area koto hobe?                in second method. Examples for this class are given
  height    Susunia Pahar er height koto hobe?                      below:
   dure     Puri Bhubaneshwar theke koto dure?
                                                                      • kothai ras mela hoi ?
  uchute    Ooty sea level theke koto uchute?
    km      Kolkata theke bishnupur koto km?                          • train r jonno kothai advice nite hobe ?
                                                                 5. ORG: For organization class four direct features are
                                                                    identified and these features with questions are listed
dent on these features. Two feature sets are identified first       in Table 4. Among these four features, the ki fea-
using the training dataset and we consider one set as direct        ture has ambiguity with other question classes such as
and other set as dependent. For the third system we com-            ’OBJ’ and ’PER’. Examples of questions with multiple
bined these two sets and machine learning features are build        tags using ki feature is listed below:
using Naive Bayes classifier. Details of the three methods
are discussed next.                                                   • OBJ–Ekhon ki museum hoye geche ?
                                                                      • PER–Rabindranath er babar naam ki ?
2.1    Method–1 (using direct feature set)
                                                                    The kon feature of ’ORG’ class also having ambiguity
  1. MNY: To identify the ’MNY’ class 10 features/ key-             issues with other classes such as ’TEMP’ and ’OBJ’.
     words from the training dataset is identified and ques-        Questions with kon feature of other classes are listed
     tion contains these keywords are tagged as ’MNY’ class.        below:
     In Table 2, we have listed out all of these features with        • TEMP–Kon month e vasanta utsob hoi shan-
     questions. With these 10 features we also identified an-           tiniketan e ?
     other keyword koto, to tag questions as ’MNY’ class.             • OBJ–Kon mountain er upor Ooty ache ?
     But it was analysed that, if we consider koto as direct
     feature for ’MNY’ tag, then questions of other classes         Issues related to ambiguity are addressed in Section
     are also tagged as ’MNY’ class. Such as ’Shankarpur            2.2 with the help of dependent feature set.
     Digha theke koto dure?’, which is a ’DIST’ class type
     question.                                                   6. NUM: Two direct features such as koiti and koto are
                                                                    identified to tag the questions as ’NUM’ class. But koto
      Like koto, taka feature is also unable to tag some ’MNY’      keyword having ambiguity and questions of ’MNY’
      questions. So we identified two other keywords as de-         classes are tagged as ’NUM’ . So a dependent feature
      pendent feature set, which is discussed in Section 2.2.       set is identified, which merge with koto feature and this
                                                                    issues are discussed in Section 2.2. So in this method,
  2. DIST: Seven keywords are identified as direct feature          to identified questions as ’NUM’ we only consider the
     set to tag the ’DIST’ question class. We also con-             keyword koiti as feature.
     sider same set of features for second method. All the
     identified keywords having the meaning as distance in            • Bishnupur e koiti gate ache ?
     Bengali as well as in English language. For this class           • Leie koiti wicket niyeche ?
     no keyword is found for dependent set and in Table 3,
     we have listed out all the features with questions.         7. PER: To identify the ’PER’ class five direct features
                                                                    are identified. Among these three features such as ke,
  3. TEMP: Question contains any temporal unit such as              kake, and ki are worked with dependent feature sets,
     somoi, time, month, year etc. are tagged as a TEMP             which is discussed in Section 2.2 and two other fea-
     class. For the temporal question class eight keywords          tures such as kar and kader consider for direct feature
     are identified and all of these keywords are considered        set. These direct features with example are listed bel-
     as direct feature set and no dependent features are con-       low:
        • Kar wall e sri krishna er life dekte paoya jabe ?
                                                                   Table 5: Ambiguity Classes with NUM Class
        • Jagannath temple e kader dekha paben ?                  Ambiguity
                                                                                    Questions in Ambiguity
                                                                   Classes
  8. OBJ: Two direct rules are found, but these rules are           DIST     Kolkata theke bishnupur er distance koto?
     not able to identify the questions of ’OBJ’ class. These
                                                                   TEMP      Bishnupur e jete bus e koto time lagbe?
     rules are as follows:
                                                                   MNY       Indian citizen der entry fee koto taka?
        • Ekhon ki museum hoye geche ?
        • Hazarduari er opposit e kon masjid ache ?
                                                                      In questions, if tokens ends with the format eche and
      From these questions it is clearly understood that, the         questions also have ’ORG’ features then those ques-
      ’OBJ’ class is in ambiguity with ’ORG’ class. So these          tions are classified as ’PER’ class. The third feature
      two features are used with other dependent features for         are identified not to deal with ambiguity, these set is
      question classification.                                        worked with kon keyword of direct feature set used in
                                                                      method–1. In this set, we explicitly identified those
  9. MISC: If no rules are satisfied then, questions are              words, which means an organization such as shop, ho-
     classified as ’MISC’ class.                                      tel, city, town etc. and examples are listed below:

2.2    Method–2 (using direct and dependent fea-                        • shope–kon shop e tea kena jete pare ?
       ture set)                                                        • town–rat 9 PM kon town ghumiye pore ?
   This dependent feature set is identified to improve the ef-
ficiency of the first method. The name of the second set is        6. NUM: The koto keyword of direct feature set for ’NUM’
given as dependent, because some features of the direct sets          class, is also in ambiguity with other classes such as
are not able to identify the questions and those features work        ’DIST’, ’TEMP’, and ’MNY’. So the direct features
correctly when the dependent feature set is also available in         are not used here to tag the questions of NUM class,
the questions.                                                        but are used to check rules, present in the questions or
                                                                      not, if yes then questions will not tag or else is tagged
  1. MNY: To identify the MNY class two dependent fea-                with ’NUM’ class. Example of each ambiguity is listed
     tures such as charge and koto along with ten direct              in Table 5.
     features are identified. If any questions contains any
     of these two features, it also look for the direct feature    7. PER: We have used two dependent features to predict
     such as taka else it will not consider the ’MNY’ class           the questions of ’PER’ class. In this method, depen-
     for this question and examples are listed below:                 dent features are also worked with direct features for
                                                                      question prediction. Examples of questions of ’PER’
        • koto–Digha te Veg meal koto taka ?                          class using direct and dependent features is listed be-
                                                                      low:
        • charge–Semi-official guide koto taka charge nei ?
                                                                        • Chilka lake jaber tour conduct ke kore ?
  2. DIST: Same as method–1 using all set of direct fea-
     tures only.                                                        • Woakes kake run out koreche ?
                                                                        • Bangladesh r leading T20 wicket-taker r naam
  3. TEMP: All direct features are used to tagged the
                                                                          ki ?
     ’TEMP’ class.
  4. LOC: No dependent features, same set of direct fea-           8. OBJ: A dependent feature set is identified, which con-
     tures of method–1 is used.                                       tains all such words those qualified as a object name to
                                                                      handle the ambiguity issues of ’OBJ’ class with ’ORG’
  5. ORG: To handle the ambiguity issues with ’ORG’                   class. These object names are combined with two di-
     class questions three sets of dependent features are             rect features such as ki and kon. Example of such
     identified, which improves the system accuracy. In               ambiguity issues are listed below:
     the first feature set, ambiguity with ’OBJ’ class is
     addressed. By identifying a term such as museum,                   • ekhon ki museum hoye geche ?
     mondir, mosque which can qualify a question as ’OBJ’               • bengal r sobcheye boro mosjid ki ?
     class. Examples of these features are as follows:                  • Nawab Wasef Ali Mirza r residence ki chilo ?
        • museum–ekhon ki museum hoye geche ?                         From these dependent features, it was clear that, the
        • lake–murshidabad e ki lake ache ?                           direct features look for a token in the questions those
                                                                      have an entity of object type and these entities are
      All such questions are not identified as ’ORG’ class, in-       represented as bold face in the examples. So for kon
      stead of these questions are forwarded to the other fea-        direct feature, same set of dependent features are used
      ture set for prediction. In the second set, features are        to identify the ’OBJ’ class. Examples are as follows:
      identified to handle the issues related to ’PER’ class
      and the features are as follows:                                  • Berhampore-Lalgola Road e kon mosjid ache ?

        • (*eche)–ke jiteche, ke hereche, ke hoyeche                    • Murshidabad kon nodir tire obosthito ?

        • team–kon team Ashes hereche ?                            9. MISC same as method–1.
        Table 6: Statistics of Training Data set                         Table 7: Success rates using rules of Direct set
         Sl.No. Question Class           Total                              Classes Precision Recall F–1 Score
            1    Money as MNY              26                                MNY         0.80      0.50       0.62
            2    Temporal as TEMP          61                                DIST        0.95      0.90       0.92
            3    Distance as DIST          24                                TEMP          1       0.96       0.98
            4    Location as LOC           26                                 LOC        0.86      0.85       0.84
            5    Object as OBJ             21                                 ORG        0.47      0.75       0.57
            6    Organization as ORG       67                                NUM         0.72        1        0.85
            7    Number as NUM             45                                 PER        0.82      0.67       0.73
            8    Person as PER             55                                 OBJ         0.5       0.2       0.29
            9    Miscellaneous as MISC      5                                MISC        0.17      0.13       0.14


2.3     Method–3 (Using Naive Bayes Classifier)                     Table 8: Success rates using Naive Bayes Classifier
  In this method, Naive Bayes classifier is used to train the            Classes Precision Recall F–1 Score
model. For training, a feature matrix with probable class                  MNY        0.79       0.69     0.73
tags is input to the Bayes classifier. For each question in                DIST        1         0.95     0.98
training set, one feature is considered and the last column               TEMP        0.68       0.96     0.80
of the feature matrix represents the question classes. This                LOC        0.86       0.85     0.84
feature matrix is generated using the sets of direct and de-               ORG        0.47       0.71     0.57
pendent features used in Method–1 and 2.                                   NUM        0.99        1       0.96
                                                                           PER        0.83       0.89     0.86
3.    EXPERIMENT RESULTS                                                   OBJ        0.75       0.30     0.45
                                                                          MISC        0.17      0.125     0.14
3.1     Data and Resources
   Two datasets as training and testing data set are released
                                                                    classes. 78.89% is the success rate achieved in this run using
for this task [1]. It was allowed that, participants can use
                                                                    Method–2 and the accuracy is listed in Table 3.2.3.
any number of resource for this task. Each entry in dataset
has the following format as q no q string q class and is re-        3.3     Comparative Analysis
ferred as question number, code–mixed cross–script question
                                                                      In this subtask-1, total of 20 system runs is submitted
string and the class of the question respectively. Training
                                                                    by 7 teams and as an average 140 questions are successfully
data set contains a total no. of 330 questions and is tagged
                                                                    tagged by these teams with an average of 40 unsuccessful
among 9 question classes and details of the training data set
                                                                    tags. For this task, IINTU team achieved 83.33333% as
with question classes is given in Table 6.
                                                                    highest accuracy and our team NLP–NITMZ got the high-
3.2     Results                                                     est accuracy as 78.88889%. Among these 9 question classes,
                                                                    ’DIST’ class has the highest precession value as 0.9903 and
  For this task, NLP–NITMZ team submitted 3 system runs.
                                                                    ’NUM’ class got the highest recall value as 0.9961 and tem-
Among these runs, two of them are feature/ keyword based
                                                                    poral class achieved 0.9612 as highest F–1 score.
and the third run is based on machine learning features.
For the first run, different set of rules are identified for each
question classes.                                                   4.     CONCLUSIONS
                                                                      We submitted 3 system runs and accuracy of our systems
3.2.1    Run–1                                                      are 74.44%, 78.89%, and 74.44% respectively using three
   The first run, is conducted using method–1, which is used        methods. For this subtask–1 of MSIR16, our system has
all the direct rules. In this run, questions are identified         given the best performance using Method–2 and we submit-
and tagged with the class based on these direct rules. We           ted the output of this method as run–3. For this run, two
achieved a success rate of 74.44% using Method–1 and the
accuracy of identification of question classes are listed in
Table 7 using the performance parameters.
                                                                    Table 9: Success rates using Direct and Dependent
3.2.2    Run–2                                                      rules
                                                                          Classes Precision Recall F–1 Score
   Naive Bayes classifier is used for this run. After being                MNY        0.91      0.63      0.74
trained the model using training dataset, model is tested                  DIST       0.95      0.90      0.93
with the test data and class levels are predicted as classifier
                                                                          TEMP         1        0.92      0.96
output. For this run it has the accuracy of 74.44%, which
                                                                           LOC        0.77      0.87      0.82
is same as run–1. In Table 3.2.2, the precision, recall and
                                                                           ORG        0.88      0.58      0.70
F–1 score for each class labels is listed.
                                                                           NUM        0.81       1        0.90
3.2.3    Run–3                                                             PER        0.83      0.89      0.86
  For this run the direct and dependent feature set is used                OBJ        0.38      0.30      0.33
together to address the ambiguity issues among the question                MISC       0.17      0.25      0.20
types of features are working together and we have given di-
rect and dependent as the name of two sets. Between these
two sets, dependent features are mainly worked for tagging
the questions without ambiguity and we got the 7th and 9th
rank with the system run 3 and 2.

5.   ACKNOWLEDGMENTS
  This work presented here under the research project Grant
No. YSS/2015/000988 and supported by the Department of
Science & Technology (DST) and Science and Engineering
Research Board (SERB), Govt. of India. Authors have also
acknowledged the Department of Computer Science & Engi-
neering of National Institute of Technology Mizoram, India
for proving infrastructural facilities.

6.   REFERENCES
[1] S. Banerjee, K. Chakma, S. K. Naskar, A. Das,
    P. Rosso, S. Bandyopadhyay, and M. Choudhury.
    Overview of the Mixed Script Information Retrieval
    (MSIR) at FIRE. In Working notes of FIRE 2016 -
    Forum for Information Retrieval Evaluation, Kolkata,
    India, December 7-10, 2016, CEUR Workshop
    Proceedings. CEUR-WS.org, 2016.
[2] S. Banerjee, S. K. Naskar, P. Rosso, and
    S. Bandyopadhyay. The First Cross-Script Code-Mixed
    Question Answering Corpus. Proceedings of the
    workshop on Modeling, Learning and Mining for
    Cross/Multilinguality (MultiLingMine 2016), co-located
    with The 38th European Conference on Information
    Retrieval (ECIR), 2016.
[3] B. Gambäck and A. Das. Comparing the level of
    code-switching in corpora. In Proceedings of the Tenth
    International Conference on Language Resources and
    Evaluation (LREC 2016), May 2016.
[4] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and
    P. Rosso. Query expansion for mixed-script information
    retrieval. In The 37th Annual ACM SIGIR Conference,
    SIGIR-2014, pages 677–686, Gold Coast, Australia,
    June 2014.
[5] X. Li and D. Roth. Learning question classifiers. In
    Proceedings of the 19th international conference on
    Computational linguistics-Volume 1, pages 1–7.
    Association for Computational Linguistics, 2002.
[6] P. Pakray and G. Majumder. Nlp–nitmz:part-of-speech
    tagging on italian social media text using hidden
    markov model. techreport, (Acceptd) In the SHARED
    TASK ON PoSTWITA – POS tagging for Italian Social
    Media Texts, EVALITA -2016.