=Paper= {{Paper |id=Vol-1737/T3-7 |storemode=property |title= Amrita-CEN@MSIR-FIRE2016:Code-Mixed Question Classification using BoWs and RNN Embeddings |pdfUrl=https://ceur-ws.org/Vol-1737/T3-7.pdf |volume=Vol-1737 |authors=Anand Kumar M,Soman K P |dblpUrl=https://dblp.org/rec/conf/fire/MP16 }} == Amrita-CEN@MSIR-FIRE2016:Code-Mixed Question Classification using BoWs and RNN Embeddings == https://ceur-ws.org/Vol-1737/T3-7.pdf
          Amrita_CEN@MSIR-FIRE2016: Code-Mixed Question
           Classification using BoWs and RNN Embeddings
                           Anand Kumar M                                                        Soman K P
            Center for Computational Engineering and                            Center for Computational Engineering and
                        Networking(CEN)                                                     Networking(CEN)
            Amrita School of Engineering, Coimbatore                            Amrita School of Engineering, Coimbatore
                  Amrita Vishwa Vidyapeetham                                          Amrita Vishwa Vidyapeetham
                        Amrita University                                                   Amrita University
               m_anandkumar@cb.amrita.edu                                               kp_soman@amrita.edu

ABSTRACT                                                                advantage of Recurrent Neural Network (RNN) based embedding
      Question classification is a key task in many question            is that RNN captures the contextual information in a better way.
answering applications. Nearly all previous work on question
classification has used machine learning and knowledge-based
methods. This working note presents an embedding based Bag-of-
Words method and Recurrent Neural Network to achieve an
automatic question classification in the code-mixed Bengali-            2. RELATED WORKS ON QUESTION
English text. We build two systems that classify questions mostly       CLASSIFICATION
at the sentence level. We used a recurrent neural network for                 Basically, there are two different methods commonly used in
extracting features from the questions and Logistic regression for      question classification: knowledge-based and machine learning
classification.    We conduct experiments on Mixed Script               based. There are also some combined approaches which connect
Information Retrieval (MSIR) Task 1 dataset at FIRE20161. The           rule-based and the machine learning approaches (Huang et. al.,
experimental result shows that the proposed method is appropriate       2008; Silva et. al., 2011; Ray et. al., 2010) [1,2,7]. Rule-based
for the question classification task.                                   methods classify the questions with hand-crafted rules (Hull,
                                                                        1999; Prager et. al., 1999) [3,4]. However, these approaches
CCS Concepts                                                            affected from too many rules (Li and Roth, 2004) [5] and only
• Information systems~Clustering and classification                     perform well on a particular dataset. Recent NLP research for
• Information systems~Question answering      • Computing               Indian languages moving towards social media content which is
methodologies~Learning latent representations                           informal and often code-mixed. Researchers focused on
                                                                        developing conventional Natural Language Processing (NLP)
                                                                        applications for handling Social media content. Standard shared
                                                                        tasks and workshops like FIRE and ICON2 Tools contest are
Keywords                                                                giving preferences to this new genre text. The large-scale use of
Question Classification ; BoW ; Recurrent Neural Nets ;                 code-mixed style in social media platforms motivates the
Embeddings ; Code-Mixed Script                                          researchers to carry out this type of research in Indian languages.
                                                                        The significant number of research is going on in social media
                                                                        text and code-mixed text. Notable areas are, language
1. INTRODUCTION                                                         identification [8] , question answering [11] , POS tagging [15],
Question answering systems can be viewed as an inevitable               polarity detection [13] and entity extraction for Indian languages
element of information retrieval systems, allowing users to ask         [12, 14]. Barman et. al. [9] presented the challenges of Language
questions in a natural language text and receive brief answers.         Identification in code-mixed text and they claimed that code-
Earlier research has shown explicitly that the correct classification   mixing is common among users who are multilingual. Vyas et. al.
of questions to the expected answer type is necessary to any            [15] discussed the efforts taken to POS tag social media content
successful question answering system. Question classification is        from English-Hindi code-mixed text while trying to address the
to recognize the answer-type automatically to a given query             complexities of code-mixing. The impact of code-mixing on the
written in the natural language text. For example, the query,           effectiveness of information retrieval has been discussed by Gupta
“What is the Capital of India?”, the task of a question                 et. al. [16] in query expansion for mixed-script and code-mixed
classification system is to recognize the type “Location” to this       queries. Recently, Banerjee et. al. (2015) [17, 18] formally
question because the expected answer to this query is a named           introduced the code-mixed cross-script question answering as a
entity of type “Location”. Classification of queries is also treated    research problem. Banerjee et. al. [19] explains the use of growing
as an answer type prediction since the type of the answer is            user generated content to serve as information collection source
predicted. Many existing question answering systems used                for the question answering task on a low-resource language for the
manually built sets of rules to map a question to a correct type,       first time and explained their cross-script code-mixed question
which is the language specific, not efficient in maintaining and        answering corpus
upgrading. Machine learning approaches are often used to
identify the expected answer types. The motivation of using the

1                                                                       2
    http://fire.irsi.res.in/fire/2016/home                                  http://amitavadas.com/Code-Mixing.html
3. TASK DESCRIPTION                                                       using the categorical variable function in TensorFlow [10]. In the
The      code-mixed      cross-script  question     classification        second run, we tried with on Recurrent Neural Network
is subtask-1 in shared task on Mixed Script Information Retrieval         embedding with logistic regression. Since the dataset is a very
(MSIR3) at FIRE 2016 [23].                                                small the RNN based method trails traditional methods. Even
                                                                          though RNN based method accuracy is less compared with other
                                                                          methods, the performance of RNN based embedding is
Let, Q = {q1, q2 , . . . , qn} be a set of factoid questions written in   significant for the very limited data. This gives an anticipation for
Code-mixed Bengali-English Text (Romanized Bengali along                  applying RNN for code-mixed NLP related task.
with English). Let T = {t1, t2,…,tn} be the set of question types.
The task is to classify each given question q ∈ Q into one of the         4.1 Bag-of-Words Model for Question
predefined coarse-grained question type t ∈ T. Example for code-          Classification (Run1)
mixed question classification task is given below,                        We developed a question classification system with a BoW model
                                                                          using TensorFlow [10]. Here the maximum word length is fixed
Question:            last volvo bus kokhon chare ?                        as 15 and embedding size as 50. Each word-type in the query is
                    [When is the last Volvo bus..]                        converted into 50-dimensional vectors. For the given 330 queries
                                                                          in the training set, we formed an input matrix of size 330 x 15,
Question Type:        TEMPORAL                                            and for each word we substitute the random word embeddings
                                                                          (categorical word representation) and finally the size of the input
The number of queries, the total number of words and average              tensor is 330 x 15 x 50. We used the max pooling concept and
words per query in Training and testing data are illustrated in           choose the maximum value across the max word length of 15.
Table 1. Totally, 9 different coarse-grained question types are           This reduced the tensor to the matrix of size 330 x 50 which is
used in this question classification task. The various question           considered as query embeddings and given to logistic regression
types and their corresponding frequency in training data are              classifier with default parameters. Finally, we used Arg-max
shown in Table 2. This table also reveals the percentage of each          function to choose the best question type.
question type in training data. More than 65% of the training data
set belongs to 4 primary query types which are Organization,              4.2 Recurrent Neural Net based Question
Temporal, Person, and Number.                                             Classification System (Run2)
              Table 1. MSIR Subtask-1 data facts                               Recurrent Neural Networks (RNNs) are successful models
                                                          Average         that have shown prominent improvement in many NLP
      Model         Queries         Total Words                           applications. The idea behind RNNs is to make use of sequential
                                                          Words
                                                                          information [21]. If you want to predict the subsequent word in a
    Training          330                 1756             5.321
                                                                          sentence you completely know which words appeared before it.
     Testing          120                 858              7.15           RNNs are called recurrent because they carry out the same task
                                                                          for every element of a sequence, with the output being depended
                                                                          on the previous computations.
               Table 2. Question types and their counts
                                                                                In our second submission, we developed a Recurrent Neural
                   Types          Count      Percentage                   Network based question classification system using TensorFlow
                   ORG             67                                     [10]. We followed the same produce of Run1 for creating the
                                                 20.3                     input tensor of size (330 x 15 x 50). This initial 15 x 50 matrix
                  TEMP             61            18.5                     embedding of each query is reduced to 50-dimensional embedding
                   PER             55                                     vectors. This initial embedding vector is given to Gated Recurrent
                                                 16.7                     Unit, or GRU, a slightly variation on the LSTM introduced by
                   NUM             45            13.6                     [22].The resulting model is simpler than standard LSTM models,
                   LOC             26                                     and has been growing increasingly popular. Finally, take encoding
                                                 7.9                      of the last step and pass it as features for logistic regression for
                   MNY             26            7.9                      training.
                   DIST            24            7.3                      5. EXPERIMENTS AND RESULTS
                    OBJ            21            6.4                           In this section, detailed cross-validation results and the
                                                                          accuracy has been given by the task organizers are elucidated.
                   MISC             5            1.5
                                   330            1                       5.1 Cross-validation Results
                                                                          We randomly split the 330 queries in training set into 281 and 49
                                                                          and named as training and development set respectively. This data
4. QUESTION CLASSIFICATION FOR                                            set used for validating our methods with two different parameters,
                                                                          embedding size, and maximum query length. We varied the
CODE-MIXED BENGALI ENGLISH TEXT                                           maximum document size to 10, 15, 20, 25, and 30. We used only
We have submitted two runs in the question classification for
                                                                          two different embedding sizes, 50 and 100. We tried BoW and
code-mixed text. In the first run, we used the traditional BoW
                                                                          RNN based methods for developing the code-mixed question
model with logistic regression. In order to apply regression, we
                                                                          classification system. Figure 1 explains the comparison between
represent each word-type to random vectors of floating numbers
                                                                          the BoW and RNN based methods with different query length and
                                                                          embedding size. We fixed the query length as 15 and embedding
3
    https://msir2016.github.io/                                           size as 50 in our experiments.
                                                                                        Figure 2. Query types and accuracies.

   0.9
                                                                             Table 4. Top accuracies of team irrespective of run
  0.85
                                                                               Runs              Correct       Incorrect      Accuracy
   0.8                                                                       AmritaCEN            145             35          80.55556
                                                                           AMRITA-CEN-
  0.75                                                                                             143             37         79.44444
                                                                              NLP
                                 BoW-50
                                 BoW-100                                         Anuj              146             34         81.11111
   0.7                           RNN-50
                                 RNN-100                                    BITS_PILANI            146             34         81.11111
  0.65
              10            15             20   25         30                   IINTU              150             30         83.33333
                                                                             IIT(ISM)D*            144             36            80
         Figure 1. Cross-validated results with different query             NLP-NITMZ              142             38         78.88889
                    length and embedding size


5.2 MSIR Sub Task-1 Results                                            6. CONCLUSION
Here, the accuracy has been given by the task organizers are           Question classification is an inevitable module in the question
explained. Organizers evaluated submitted systems based on the         answering system. This working note presents code-mixed
accuracy. Overall performance and in-depth accuracy per question       question classification system using BoWs and RNN embeddings.
type are also released by the organizers [20]. The overall accuracy    To our knowledge, this is the first time that RNN embedding is
of our submission is shown in Table 3. The highest accuracies of       applied to question classification task. Since the training corpus is
other teams are shown in Table 4. IINTU team positioned to first       small and unavailability of unsupervised code-mixed data, the
followed by Anuj, BITS, and our Team (Amrita_CEN). Figure 2            performance of the RNN based system trails the traditional BoWs
explains the query types and their corresponding accuracy for our      method. The performance of the RNN based embedding is not
submissions. It is interesting to note that RNN based model            that poor and paves the way in future to apply for code-mixed
outperforms the BoW in the ORGANIZATION type questions                 script analysis. It is exciting to note that RNN based model
which count is higher in the training dataset. At the same time, the   outperforms the BoWs in the ORGANIZATION type questions
OBJ and MISC type, which are less in a count, accuracies are           which occurrence is high in the training dataset. At the same time
comparably low in RNN based model.                                     for OBJ and MISC type queries, which are less in a count,
         Table 3. Overall Accuracy of our two submissions              accuracies are comparably low in RNN based model. Finally, our
                                                                       team (Amrita_CEN) positioned third place in the overall
             Runs                Run1 (BoW)     Run2 (RNN)             performance.
            Correct                  145             133
           Incorrect                 35              47
                                                                       7. REFERENCES
           Accuracy               80.55556      73.8888889             [1] Zhiheng Huang, Marcus Thint, and Zengchang Qin. Question
                                                                           classification using headwords and their hypernyms. In
                                                                           Proceedings of the Conference on Empirical Methods in
                                                                           Natural Language Processing, (EMNLP ’08), pages 927–
     1
                                                                           936,2008.
   0.9
                                                                       [2] Joao Silva, Luısa Coheur, Ana Mendes, and Andreas
   0.8                                                                     Wichert. From symbolic to sub-symbolic information in
   0.7                                                                     question classification. Artificial Intelligence Review,
   0.6                                                                     35(2):137–154, 2011.
   0.5                                                                 [3] E. Voorhees. The TREC-8 Question Answering Track
                                                                           Report. In Proceedings of the 8thText Retrieval Conference
   0.4
                                                                           (TREC8), pp. 77-82, NIST, Gaithersburg, MD, 1999.
   0.3
                       R1                  R2                          [4] John Prager, Dragomir Radev, Eric Brown, and Anni Coden.
   0.2                                                                     The use of predictive annotation for question answering in
   0.1                                                                     trec8. In NIST Special Publication 500-246:The Eighth Text
     0
                                                                           Retrieval Conference (TREC), pages 399–411. NIST, 1999.
                                                                       [5] Xin Li and Dan Roth. 2004.Learning question classifiers:
                                                                           The role of semantic information. COLING,pp. 556-562.
                                                                       [6] Zhiheng Huang, Marcus Thint, and Asli Celikyilmaz.2009
                                                                           Investigation of question classifier in question answering .
                                                                           EMNLP , pp. 543-550.
[7] Santosh Kumar Ray, Shailendra Singh, and B. P. Joshi. A        [16] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and P.
    semantic approach for question classification using wordnet         Rosso. Query Expansion for Mixed-Script Information
    and Wikipedia. Pattern Recogn. Lett.,31:1935–1943, 2010.            Retrieval. In SIGIR ’14, pages 677–686, ACM, 2014
[8] Rahul Venkatesh Kumar, R.M., Anand Kumar, M., Soman,           [17] Banerjee, S., Bandyopadhyay, S.: Ensemble Approach for
    K.P. AmritaCEN-NLP @ FIRE 2015 language identification              Fine-Grained Question Classification in Bengali. In:
    for Indian languages in social media text (2015) CEUR               Proceedings of 27th Pacific Asia Conference on Language,
    Workshop Proceedings, 1587, pp. 26-28.                              Information, and Computation (PACLIC), Taiwan, pp. 75–84
[9] Barman, A. Das, J. Wagner, and J. Foster, “Code Mixing: A           (2013).
    Challenge for Language Identification in the Language of       [18] Banerjee, S., Bandyopadhyay, S.: An Empirical Study of
    Social Media,” in First Workshop on Computational                   Combining Multiple Models in Bengali Question
    Approaches to Code Switching, 2014, pp. 21–3                        Classification. In: Proceedings of International Joint
[10] Abadi, Martın, Ashish Agarwal, Paul Barham, Eugene                 Conference on Natural Language Processing (IJCNLP),
     Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado et al.          Japan, pp. 892–896 (2013).
     "Tensorflow: Large-scale machine learning on heterogeneous    [19] Somnath Banerjee, Sudip Kumar Naskar, Paolo Rosso, and
     distributed systems." arXiv preprint arXiv:1603.04467              Sivaji Bandyopadhyay. The first cross-script code-mixed
     (2016).                                                            question answering corpus. In Modelling, Learning and
[11] Khyathi Chandu Raghavi, Manoj Kumar Chinnakotla, and               mining for Cross/Multilinguality Workshop, 38th European
     Manish Shrivastava. 2015. "Answer ka type kya he?":                Conference on Information Retrieval (ECIR), pages 56-65,
     Learning to Classify Questions in Code-Mixed Language. In          2016.
     Proceedings of the 24th International Conference on World     [20] Somnath Banerjee and Sudip Naskar and Paolo Rosso and
     Wide Web (WWW '15 Companion). ACM, New York, NY,                   Sivaji Bandyopadhyay and Kunal Chakma and Amitava Das
     USA, 853-858.                                                      and Monojit Choudhury, MSIR@FIRE: Overview of the
[12] Devi, G.R., Veena, P.V., Kumar, M.A., Soman, K.P. Entity           Mixed Script Information Retrieval, Working notes of FIRE
     Extraction for Malayalam Social Media Text Using                   2016 - Forum for Information Retrieval Evaluation, Kolkata,
     structured Skip-gram Based Embedding Features from                 India, December 7-10, 2016, CEUR Workshop proceedings,
     Unlabeled Data (2016) Procedia Computer Science, 93, pp.           CEUR-WS.org, 2016.
     547-553.                                                      [21] http://www.wildml.com/2015/09/recurrent-neural-networks-
[13] Nivedhitha, E., Sanjay, S.P., Anand Kumar, M., Soman, K.P.         tutorial-part-1-introduction-to-rnns/
     Unsupervised word embedding based polarity detection for      [22] Cho, Bahdanau, Dzmitry, Kyunghyun, and Bengio, Yoshua.
     Tamil tweets (2016) International Journal of Control Theory        Neural machine translation by jointly learning to align and
     and Applications, 9 (10), pp. 4631-4638.                           translate.arXiv:1409.0473 [cs.CL], September 2014.
[14] Anand Kumar, M., Se, S., Soman, K.P. AMRITA-                  [23] S. Banerjee, K. Chakma, S. K. Naskar, A. Das, P. Rosso, S.
     CEN@FIRE 2015: Extracting entities for social media texts          Bandyopadhyay, and M. Choudhury. Overview of the Mixed
     in Indian languages (2015) CEUR Workshop Proceedings,              Script Information Retrieval at FIRE. In Working notes of
     1587, pp. 85-88.                                                   FIRE 2016 - Forum for Information Retrieval Evaluation,
[15] Y. Vyas, S. Gella, J. Sharma, K. Bali, and M. Choudhury.           Kolkata, India, December 7-10, 2016, CEUR Workshop
     POS Tagging of English-Hindi Code-Mixed Social Media               Proceedings. CEUR-WS.org, 2016.
     Content. In EMNLP 2014 pages 974–979, October 2014