=Paper=
{{Paper
|id=Vol-1737/T3-6
|storemode=property
|title= Ensemble Classifier based approach for Code-Mixed Cross-Script Question Classification
|pdfUrl=https://ceur-ws.org/Vol-1737/T3-6.pdf
|volume=Vol-1737
|authors=Debjyoti Bhattacharjee,Paheli Bhattacharya
|dblpUrl=https://dblp.org/rec/conf/fire/BhattacharjeeB16
}}
== Ensemble Classifier based approach for Code-Mixed Cross-Script Question Classification==
Ensemble Classifier based approach for Code-Mixed Cross-Script Question Classification [Team IINTU] Debjyoti Bhattacharjee Paheli Bhattacharya School of Computer Science and Engineering Dept. of Computer Science and Engineering Nanyang Technological University Indian Institute of Technology Kharagpur Singapore West Bengal, India debjyoti001@ntu.edu.sg paheli@iitkgp.ac.in ABSTRACT “Kharagpur theke Howrah cab fare koto?” (gloss : What is With an increasing popularity of social-media, people post the cab fare from Kharagpur to Howrah?) has words from updates that aid other users in finding answers to their ques- a single script (Roman) but from two different languages - tions. Most of the user-generated data on social-media are English (cab , fare) and Bengali (theke , koto). The words in in code-mixed or multi-script form, where the words are rep- Bengali have been transliterated to the Roman Script. Intu- resented phonetically in a non-native script. We address the itively this a very easy form of writing for people not as well- problem of Question-Classfication on social-media data. We versed with English as for their native language tend to use propose an ensemble classifier based approach towards ques- them for conversing in social media. With the rise in popu- tion classification when the questions are written in mixed- larity of social-media, people constantly post updates from script, specifically, the Roman script for the Bengali lan- their daily lives ranging from, but not limited to, sports and guage. We separately train Random Forests, One-Vs-Rest score updates, travel updates to food, hotel, transport and and k-NN classifiers and then build an ensemble classifier movie reviews, providing user feedbacks for Customer Sup- that combines the best from the three worlds. We achieve an port Systems through tweets and blogs. Although Question accuracy of 82% approximately, suggesting that the method Answering (QA) is a well-addressed research problem with works well in the task. systems providing reasonable accuracy, QA on social-media text in mixed script is a challenging problem mainly due to the fact that there is no standardization of spellings for CCS Concepts words written in non-native script. For instance, the Bengali •Information systems → Question answering; •Computingword “ekhon” (meaning, now) may have multiple spellings − methodologies → Machine learning; Cross-validation; “akhan”, “ekhon”, “ekhan”, “akon” etc. Categorizing a ques- tion into a specific set of classes and then dealing with each Keywords class separately is an efficient method for QA systems. This is called Question Classification. Question classification aids Mixed Script Information Retrieval; Question Answering in reducing the number of candidate answers and also can System; Question classification be used in effectively determining answer selection strate- gies [8]. In this paper, we deal with the problem of question 1. INTRODUCTION classification for Multi-Script or code-mixed data. We have With the increase in popularity of the Web, users from experimented with three machine learning based classifiers all over the world now opt to write in their native language - Random Forests, One-vs-Rest and k-NN and then built an instead of English. A large number of South and South- ensemble of these classifiers to achieve an higher accuracy. East Asian languages are written in a transliterated form (phonetically representing words in a non-native script) us- ing the Roman Script. These texts are said to be written in 2. RELATED WORK Mixed-Script. Since there are font-encoding issues in using Jamatia et al. [7] experiments with code-mixed English- the original script (for example, Devnagari for Hindi) peo- Hindi social-media text for Part-of-Speech tagging. They ple tend to transliterate or phonetically represent the words use both coarse and fine-grained tagsets for the task. Four in the original language using the Roman script. To define machine learning algorithms − Conditional Random Fields, Mixed Script Information Retrieval formally [6] we consider Sequential Minimal Optimization, Naı̈ve Bayes and Ran- a set of natural languages L = {l1 , l2 , · · · , ln } and a set of dom Forests, reporting highest accuracy with Random For- scripts S = {s1 , s2 , · · · , sn } such that si is the native script est based classifier. Information Retrieval on Multi-Script for the language li . Given a word w, we represent it as data has also been looked into [6]. Recent works on question two tuples hli , sj i to imply w is in language li and written classification include a machine learning based approach [8] using script sj . When i = j, we say that the word is writ- towards question class. A hierarchial classifier is first used ten in its native script. Else, it has been transliterated into to classify the question into coarse-grained classes and then another script sj . In practice, when textual content is a into fine-grained classes. The feature space consisted of mixture of words from various languages or scripts or both, primitive ones like pos tags, chunks, named entities and it is called Multi-Script (MS) or Code-Mixing. For instance, also complex features such as conjunctive n-gram features and relational features. Question-Answering corpus acqui- (k-NN) classifier, followed by building an ensemble classifier sition using social-media content and question acquisition using these three classifiers for the classification task. with human involvement have been reported in [2]. In FIRE In k-NN classification, a sample is classified by a majority 2015, the Transliterated Search track introduced three sub- vote of its neighbours, with the object being assigned to the tasks — language labelling of words in code-mixed text frag- most common class among its k-nearest neighbours (k > ments, ad-hoc retrieval of Hindi film lyrics, movie reviews 0, k ∈ I). k-NN classification is a lazy learning method and astrology documents and transliterated question an- which defers computation till the classification is performed. swering where the documents as well as questions were in k-NN is one of the simplest classifiers. Bangla script or Roman transliterated Bangla [4]. One-vs–Rest strategy uses one classifier per class for fit- ting. Each classifier is trained against all the classes. The 3. TASK DESCRIPTION approach allows information regarding each class by inspect- ing the classifier trained for that class. In OvR, each classi- Question Answering systems are a classic application of fier is trained with the entire data set while in RF, samples natural language processing, where the retrieval task has drawn from the original data set are used for training. to find a concise and accurate answer to a given question. A Random Forest is a ensemble learning method which Question classification is one of the subtasks of QA system, can be classification [3]. Random Forest fits a number of required to determine the type of the answer corresponding decision trees on various sub-samples of the dataset, with to a question. the samples drawn from the original dataset with or with- The Code-Mixed Cross-Script Question Classification task out replacement. Random Forests overcome the problem of can be described as follows. Given a question Q written in overfitting of decision tree of their training set [5]. Romanized Bengali, which can contain English words and Using the above three classifiers, we built an ensemble phrases and a set C = {c1 , c2 , . . . , cn } of question classes, the classifier (EC). The ensemble classifier takes the output label task is to classify the question Q into one of these predefined by each of the individual classifiers and gives the majority classes. label as output, otherwise any label is chosen at random Example: as output. Each of the individual classifiers is trained on Question: airport theke howrah station distance koto ? a subset of the original training dataset, by sampling with Question Class: DIST replacement. In the following section, we describe the details of imple- mentation of the classifiers and the obtained results. 3.1 Dataset description The training dataset consists of 330 questions and each question is assigned to a single question class. There are 9 5. EXPERIMENTAL SETUP AND RESULTS question classes in all and the number of questions in each We implemented the proposed approach using Python 3 class is shown in Table 1. The minimum and maximum and used the scikit-learn tool-kit for the classifiers. The fol- number of words in a question is 2 and 11 respectively while lowing instantiations were used for the first three classifiers, each question on average has 5.3 words. which were available in sckit-learn. We implemented the ensemble classifier on our own. rf = RandomForestClassifier(n_estimators=100) Table 1: Dataset classes and #Q per class ovr = OneVsRestClassifier(LinearSVC(random_state=0)) Class #Q clf = neighbors.KNeighborsClassifier(30, weights=’uniform’) DIST 24 We split the labelled data set into two parts — train- LOC 26 ing set (90%) and validation set (10%). The RF classifier MISC 5 performed the best, followed by EC, OvR and k-NN in de- MNY 26 creasing order of classification accuracy. Thereafter, we used NUM 45 these trained classifiers for classifying the test data set. Dur- OBJ 21 ing classification, we marked the samples for which all the 4- ORG 67 classifiers predicted the same label. We used these samples, PER 55 in addition to the original labelled data set for retraining TEMP 61 the classifiers. Total 330 The results on the test data set for the classifiers RF, EC and OvR were submitted as final run and is shown summar- ily in Table 2 and Table 3. The classification results of kNN 4. PROPOSED APPROACH classifier were not submitted as run and hence accuracy of the results is not available. To build a classifier to classify the questions into the spec- ified classes, we created a vector representation of the each question which is used as input to the classifier. We con- 6. CONCLUSION AND FUTURE WORK sidered the top 2000 most frequently occurring words in the In this paper we have addressed the problem of question supplied training dataset as features. Each question is rep- classification for Bengali-English Code-Mixed social-media resented as a 2000-element binary vector. Element ei = 1, data. We have experimented with three machine learning if the ith most frequently word is present in the question, based classifiers - Random Forests, One-vs-Rest and k-NN otherwise 0. and then built an ensemble of these classifiers to achieve the We used three separate classifiers namely Random Forests best results. The method is scalable to other Code-Mixed (RF), One-vs-Rest (OvR) classifier and k-Nearest Neighbour languages mainly because it does not perform any language S. Banerjee, S. K. Naskar, S. Bandyopadhyay, Table 2: Results of individual classes G. Chittaranjan, A. Das, and K. Chakma. Overview of Classifier Class I IC P R F-1 fire-2015 shared task on mixed script information EC PER 24 20 0.83 0.74 0.78 retrieval. RF 25 21 0.84 0.77 0.80 OvR 23 19 0.82 0.70 0.76 [5] J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning, volume 1. Springer EC LOC 26 21 0.80 0.91 0.85 series in statistics Springer, Berlin, 2001. RF 26 22 0.84 0.95 0.89 OvR 26 21 0.80 0.91 0.85 [6] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and EC ORG 36 19 0.52 0.79 0.63 P. Rosso. Query expansion for mixed-script information RF 34 19 0.55 0.79 0.65 retrieval. In The 37th Annual ACM SIGIR Conference, OvR 40 19 0.47 0.79 0.59 pages 677–686, 2014. EC NUM 30 26 0.86 1 0.92 [7] A. Jamatia, B. Gambäck, and A. Das. Part-of-speech RF 29 26 0.89 1 0.94 tagging for code-mixed english-hindi twitter and OvR 29 26 0.89 1 0.94 facebook chat messages. In 10th Recent Advances of EC TEMP 25 25 1 1 1 Natural Language Processing (RANLP), pages 239–248, RF 25 25 1 1 1 2015. OvR 25 25 1 1 1 [8] X. Li and D. Roth. Learning question classifiers. In EC MONEY 16 13 0.81 0.81 0.81 Proceedings of the 19th international conference on RF 16 13 0.81 0.81 0.81 Computational linguistics-Volume 1, pages 1–7. OvR 12 12 1 0.75 0.85 Association for Computational Linguistics, 2002. EC DIST 20 20 1 0.95 0.97 RF 20 20 1 0.95 0.97 OvR 22 21 0.95 1 0.97 EC OBJ 3 3 1 0.3 0.46 RF 5 4 0.8 0.4 0.53 OvR 3 3 1 0.3 0.46 EC MISC 0 0 NA NA NA RF 0 0 NA NA NA OvR 0 0 NA NA NA Table 3: Overall Results Classifier Correct Incorrect Accuracy EC 147 33 81.66 RF 150 30 83.33 OvR 146 34 81.11 or script-based feature engineering. We would like to experiment with other multi-script data where more than two languages have been mixed. We aim to apply other machine learning algorithms with more lin- guistic and syntactic features. 7. REFERENCES [1] S. Banerjee, K. Chakma, S. K. Naskar, A. Das, P. Rosso, S. Bandyopadhyay, and M. Choudhury. Overview of the Mixed Script Information Retrieval (MSIR) at FIRE. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. [2] S. Banerjee, S. K. Naskar, P. Rosso, and S. Bandyopadhyay. The first cross-script code-mixed question answering corpus. In Modelling, Learning and mining for Cross/Multilinguality Workshop, 38th European Conference on Information Retrieval (ECIR), pages 56–65, 2016. [3] Y. Bengio. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1–127, 2009. [4] M. Choudhury, P. Gupta, P. Rosso, S. Kumar,