=Paper=
{{Paper
|id=Vol-1737/T3-7
|storemode=property
|title= Amrita-CEN@MSIR-FIRE2016:Code-Mixed Question Classification using BoWs and RNN Embeddings
|pdfUrl=https://ceur-ws.org/Vol-1737/T3-7.pdf
|volume=Vol-1737
|authors=Anand Kumar M,Soman K P
|dblpUrl=https://dblp.org/rec/conf/fire/MP16
}}
== Amrita-CEN@MSIR-FIRE2016:Code-Mixed Question Classification using BoWs and RNN Embeddings ==
Amrita_CEN@MSIR-FIRE2016: Code-Mixed Question Classification using BoWs and RNN Embeddings Anand Kumar M Soman K P Center for Computational Engineering and Center for Computational Engineering and Networking(CEN) Networking(CEN) Amrita School of Engineering, Coimbatore Amrita School of Engineering, Coimbatore Amrita Vishwa Vidyapeetham Amrita Vishwa Vidyapeetham Amrita University Amrita University m_anandkumar@cb.amrita.edu kp_soman@amrita.edu ABSTRACT advantage of Recurrent Neural Network (RNN) based embedding Question classification is a key task in many question is that RNN captures the contextual information in a better way. answering applications. Nearly all previous work on question classification has used machine learning and knowledge-based methods. This working note presents an embedding based Bag-of- Words method and Recurrent Neural Network to achieve an automatic question classification in the code-mixed Bengali- 2. RELATED WORKS ON QUESTION English text. We build two systems that classify questions mostly CLASSIFICATION at the sentence level. We used a recurrent neural network for Basically, there are two different methods commonly used in extracting features from the questions and Logistic regression for question classification: knowledge-based and machine learning classification. We conduct experiments on Mixed Script based. There are also some combined approaches which connect Information Retrieval (MSIR) Task 1 dataset at FIRE20161. The rule-based and the machine learning approaches (Huang et. al., experimental result shows that the proposed method is appropriate 2008; Silva et. al., 2011; Ray et. al., 2010) [1,2,7]. Rule-based for the question classification task. methods classify the questions with hand-crafted rules (Hull, 1999; Prager et. al., 1999) [3,4]. However, these approaches CCS Concepts affected from too many rules (Li and Roth, 2004) [5] and only • Information systems~Clustering and classification perform well on a particular dataset. Recent NLP research for • Information systems~Question answering • Computing Indian languages moving towards social media content which is methodologies~Learning latent representations informal and often code-mixed. Researchers focused on developing conventional Natural Language Processing (NLP) applications for handling Social media content. Standard shared tasks and workshops like FIRE and ICON2 Tools contest are Keywords giving preferences to this new genre text. The large-scale use of Question Classification ; BoW ; Recurrent Neural Nets ; code-mixed style in social media platforms motivates the Embeddings ; Code-Mixed Script researchers to carry out this type of research in Indian languages. The significant number of research is going on in social media text and code-mixed text. Notable areas are, language 1. INTRODUCTION identification [8] , question answering [11] , POS tagging [15], Question answering systems can be viewed as an inevitable polarity detection [13] and entity extraction for Indian languages element of information retrieval systems, allowing users to ask [12, 14]. Barman et. al. [9] presented the challenges of Language questions in a natural language text and receive brief answers. Identification in code-mixed text and they claimed that code- Earlier research has shown explicitly that the correct classification mixing is common among users who are multilingual. Vyas et. al. of questions to the expected answer type is necessary to any [15] discussed the efforts taken to POS tag social media content successful question answering system. Question classification is from English-Hindi code-mixed text while trying to address the to recognize the answer-type automatically to a given query complexities of code-mixing. The impact of code-mixing on the written in the natural language text. For example, the query, effectiveness of information retrieval has been discussed by Gupta “What is the Capital of India?”, the task of a question et. al. [16] in query expansion for mixed-script and code-mixed classification system is to recognize the type “Location” to this queries. Recently, Banerjee et. al. (2015) [17, 18] formally question because the expected answer to this query is a named introduced the code-mixed cross-script question answering as a entity of type “Location”. Classification of queries is also treated research problem. Banerjee et. al. [19] explains the use of growing as an answer type prediction since the type of the answer is user generated content to serve as information collection source predicted. Many existing question answering systems used for the question answering task on a low-resource language for the manually built sets of rules to map a question to a correct type, first time and explained their cross-script code-mixed question which is the language specific, not efficient in maintaining and answering corpus upgrading. Machine learning approaches are often used to identify the expected answer types. The motivation of using the 1 2 http://fire.irsi.res.in/fire/2016/home http://amitavadas.com/Code-Mixing.html 3. TASK DESCRIPTION using the categorical variable function in TensorFlow [10]. In the The code-mixed cross-script question classification second run, we tried with on Recurrent Neural Network is subtask-1 in shared task on Mixed Script Information Retrieval embedding with logistic regression. Since the dataset is a very (MSIR3) at FIRE 2016 [23]. small the RNN based method trails traditional methods. Even though RNN based method accuracy is less compared with other methods, the performance of RNN based embedding is Let, Q = {q1, q2 , . . . , qn} be a set of factoid questions written in significant for the very limited data. This gives an anticipation for Code-mixed Bengali-English Text (Romanized Bengali along applying RNN for code-mixed NLP related task. with English). Let T = {t1, t2,…,tn} be the set of question types. The task is to classify each given question q ∈ Q into one of the 4.1 Bag-of-Words Model for Question predefined coarse-grained question type t ∈ T. Example for code- Classification (Run1) mixed question classification task is given below, We developed a question classification system with a BoW model using TensorFlow [10]. Here the maximum word length is fixed Question: last volvo bus kokhon chare ? as 15 and embedding size as 50. Each word-type in the query is [When is the last Volvo bus..] converted into 50-dimensional vectors. For the given 330 queries in the training set, we formed an input matrix of size 330 x 15, Question Type: TEMPORAL and for each word we substitute the random word embeddings (categorical word representation) and finally the size of the input The number of queries, the total number of words and average tensor is 330 x 15 x 50. We used the max pooling concept and words per query in Training and testing data are illustrated in choose the maximum value across the max word length of 15. Table 1. Totally, 9 different coarse-grained question types are This reduced the tensor to the matrix of size 330 x 50 which is used in this question classification task. The various question considered as query embeddings and given to logistic regression types and their corresponding frequency in training data are classifier with default parameters. Finally, we used Arg-max shown in Table 2. This table also reveals the percentage of each function to choose the best question type. question type in training data. More than 65% of the training data set belongs to 4 primary query types which are Organization, 4.2 Recurrent Neural Net based Question Temporal, Person, and Number. Classification System (Run2) Table 1. MSIR Subtask-1 data facts Recurrent Neural Networks (RNNs) are successful models Average that have shown prominent improvement in many NLP Model Queries Total Words applications. The idea behind RNNs is to make use of sequential Words information [21]. If you want to predict the subsequent word in a Training 330 1756 5.321 sentence you completely know which words appeared before it. Testing 120 858 7.15 RNNs are called recurrent because they carry out the same task for every element of a sequence, with the output being depended on the previous computations. Table 2. Question types and their counts In our second submission, we developed a Recurrent Neural Types Count Percentage Network based question classification system using TensorFlow ORG 67 [10]. We followed the same produce of Run1 for creating the 20.3 input tensor of size (330 x 15 x 50). This initial 15 x 50 matrix TEMP 61 18.5 embedding of each query is reduced to 50-dimensional embedding PER 55 vectors. This initial embedding vector is given to Gated Recurrent 16.7 Unit, or GRU, a slightly variation on the LSTM introduced by NUM 45 13.6 [22].The resulting model is simpler than standard LSTM models, LOC 26 and has been growing increasingly popular. Finally, take encoding 7.9 of the last step and pass it as features for logistic regression for MNY 26 7.9 training. DIST 24 7.3 5. EXPERIMENTS AND RESULTS OBJ 21 6.4 In this section, detailed cross-validation results and the accuracy has been given by the task organizers are elucidated. MISC 5 1.5 330 1 5.1 Cross-validation Results We randomly split the 330 queries in training set into 281 and 49 and named as training and development set respectively. This data 4. QUESTION CLASSIFICATION FOR set used for validating our methods with two different parameters, embedding size, and maximum query length. We varied the CODE-MIXED BENGALI ENGLISH TEXT maximum document size to 10, 15, 20, 25, and 30. We used only We have submitted two runs in the question classification for two different embedding sizes, 50 and 100. We tried BoW and code-mixed text. In the first run, we used the traditional BoW RNN based methods for developing the code-mixed question model with logistic regression. In order to apply regression, we classification system. Figure 1 explains the comparison between represent each word-type to random vectors of floating numbers the BoW and RNN based methods with different query length and embedding size. We fixed the query length as 15 and embedding 3 https://msir2016.github.io/ size as 50 in our experiments. Figure 2. Query types and accuracies. 0.9 Table 4. Top accuracies of team irrespective of run 0.85 Runs Correct Incorrect Accuracy 0.8 AmritaCEN 145 35 80.55556 AMRITA-CEN- 0.75 143 37 79.44444 NLP BoW-50 BoW-100 Anuj 146 34 81.11111 0.7 RNN-50 RNN-100 BITS_PILANI 146 34 81.11111 0.65 10 15 20 25 30 IINTU 150 30 83.33333 IIT(ISM)D* 144 36 80 Figure 1. Cross-validated results with different query NLP-NITMZ 142 38 78.88889 length and embedding size 5.2 MSIR Sub Task-1 Results 6. CONCLUSION Here, the accuracy has been given by the task organizers are Question classification is an inevitable module in the question explained. Organizers evaluated submitted systems based on the answering system. This working note presents code-mixed accuracy. Overall performance and in-depth accuracy per question question classification system using BoWs and RNN embeddings. type are also released by the organizers [20]. The overall accuracy To our knowledge, this is the first time that RNN embedding is of our submission is shown in Table 3. The highest accuracies of applied to question classification task. Since the training corpus is other teams are shown in Table 4. IINTU team positioned to first small and unavailability of unsupervised code-mixed data, the followed by Anuj, BITS, and our Team (Amrita_CEN). Figure 2 performance of the RNN based system trails the traditional BoWs explains the query types and their corresponding accuracy for our method. The performance of the RNN based embedding is not submissions. It is interesting to note that RNN based model that poor and paves the way in future to apply for code-mixed outperforms the BoW in the ORGANIZATION type questions script analysis. It is exciting to note that RNN based model which count is higher in the training dataset. At the same time, the outperforms the BoWs in the ORGANIZATION type questions OBJ and MISC type, which are less in a count, accuracies are which occurrence is high in the training dataset. At the same time comparably low in RNN based model. for OBJ and MISC type queries, which are less in a count, Table 3. Overall Accuracy of our two submissions accuracies are comparably low in RNN based model. Finally, our team (Amrita_CEN) positioned third place in the overall Runs Run1 (BoW) Run2 (RNN) performance. Correct 145 133 Incorrect 35 47 7. REFERENCES Accuracy 80.55556 73.8888889 [1] Zhiheng Huang, Marcus Thint, and Zengchang Qin. Question classification using headwords and their hypernyms. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, (EMNLP ’08), pages 927– 1 936,2008. 0.9 [2] Joao Silva, Luısa Coheur, Ana Mendes, and Andreas 0.8 Wichert. From symbolic to sub-symbolic information in 0.7 question classification. Artificial Intelligence Review, 0.6 35(2):137–154, 2011. 0.5 [3] E. Voorhees. The TREC-8 Question Answering Track Report. In Proceedings of the 8thText Retrieval Conference 0.4 (TREC8), pp. 77-82, NIST, Gaithersburg, MD, 1999. 0.3 R1 R2 [4] John Prager, Dragomir Radev, Eric Brown, and Anni Coden. 0.2 The use of predictive annotation for question answering in 0.1 trec8. In NIST Special Publication 500-246:The Eighth Text 0 Retrieval Conference (TREC), pages 399–411. NIST, 1999. [5] Xin Li and Dan Roth. 2004.Learning question classifiers: The role of semantic information. COLING,pp. 556-562. [6] Zhiheng Huang, Marcus Thint, and Asli Celikyilmaz.2009 Investigation of question classifier in question answering . EMNLP , pp. 543-550. [7] Santosh Kumar Ray, Shailendra Singh, and B. P. Joshi. A [16] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and P. semantic approach for question classification using wordnet Rosso. Query Expansion for Mixed-Script Information and Wikipedia. Pattern Recogn. Lett.,31:1935–1943, 2010. Retrieval. In SIGIR ’14, pages 677–686, ACM, 2014 [8] Rahul Venkatesh Kumar, R.M., Anand Kumar, M., Soman, [17] Banerjee, S., Bandyopadhyay, S.: Ensemble Approach for K.P. AmritaCEN-NLP @ FIRE 2015 language identification Fine-Grained Question Classification in Bengali. In: for Indian languages in social media text (2015) CEUR Proceedings of 27th Pacific Asia Conference on Language, Workshop Proceedings, 1587, pp. 26-28. Information, and Computation (PACLIC), Taiwan, pp. 75–84 [9] Barman, A. Das, J. Wagner, and J. Foster, “Code Mixing: A (2013). Challenge for Language Identification in the Language of [18] Banerjee, S., Bandyopadhyay, S.: An Empirical Study of Social Media,” in First Workshop on Computational Combining Multiple Models in Bengali Question Approaches to Code Switching, 2014, pp. 21–3 Classification. In: Proceedings of International Joint [10] Abadi, Martın, Ashish Agarwal, Paul Barham, Eugene Conference on Natural Language Processing (IJCNLP), Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado et al. Japan, pp. 892–896 (2013). "Tensorflow: Large-scale machine learning on heterogeneous [19] Somnath Banerjee, Sudip Kumar Naskar, Paolo Rosso, and distributed systems." arXiv preprint arXiv:1603.04467 Sivaji Bandyopadhyay. The first cross-script code-mixed (2016). question answering corpus. In Modelling, Learning and [11] Khyathi Chandu Raghavi, Manoj Kumar Chinnakotla, and mining for Cross/Multilinguality Workshop, 38th European Manish Shrivastava. 2015. "Answer ka type kya he?": Conference on Information Retrieval (ECIR), pages 56-65, Learning to Classify Questions in Code-Mixed Language. In 2016. Proceedings of the 24th International Conference on World [20] Somnath Banerjee and Sudip Naskar and Paolo Rosso and Wide Web (WWW '15 Companion). ACM, New York, NY, Sivaji Bandyopadhyay and Kunal Chakma and Amitava Das USA, 853-858. and Monojit Choudhury, MSIR@FIRE: Overview of the [12] Devi, G.R., Veena, P.V., Kumar, M.A., Soman, K.P. Entity Mixed Script Information Retrieval, Working notes of FIRE Extraction for Malayalam Social Media Text Using 2016 - Forum for Information Retrieval Evaluation, Kolkata, structured Skip-gram Based Embedding Features from India, December 7-10, 2016, CEUR Workshop proceedings, Unlabeled Data (2016) Procedia Computer Science, 93, pp. CEUR-WS.org, 2016. 547-553. [21] http://www.wildml.com/2015/09/recurrent-neural-networks- [13] Nivedhitha, E., Sanjay, S.P., Anand Kumar, M., Soman, K.P. tutorial-part-1-introduction-to-rnns/ Unsupervised word embedding based polarity detection for [22] Cho, Bahdanau, Dzmitry, Kyunghyun, and Bengio, Yoshua. Tamil tweets (2016) International Journal of Control Theory Neural machine translation by jointly learning to align and and Applications, 9 (10), pp. 4631-4638. translate.arXiv:1409.0473 [cs.CL], September 2014. [14] Anand Kumar, M., Se, S., Soman, K.P. AMRITA- [23] S. Banerjee, K. Chakma, S. K. Naskar, A. Das, P. Rosso, S. CEN@FIRE 2015: Extracting entities for social media texts Bandyopadhyay, and M. Choudhury. Overview of the Mixed in Indian languages (2015) CEUR Workshop Proceedings, Script Information Retrieval at FIRE. In Working notes of 1587, pp. 85-88. FIRE 2016 - Forum for Information Retrieval Evaluation, [15] Y. Vyas, S. Gella, J. Sharma, K. Bali, and M. Choudhury. Kolkata, India, December 7-10, 2016, CEUR Workshop POS Tagging of English-Hindi Code-Mixed Social Media Proceedings. CEUR-WS.org, 2016. Content. In EMNLP 2014 pages 974–979, October 2014