Modeling Classifier for Code Mixed Cross Script Questions Rupal Bhargava1 Shubham Khandelwal2 Akshit Bhatia3 Yashvardhan Sharma4 WiSoc Lab, Department of Computer Science Birla Institute of Technology and Science, Pilani Campus Pilani-333031 {rupal.bhargava1 , f20131312 , f20137223 ,yash4 } @pilani.bits-pilani.ac.in ABSTRACT Recently, Banerjee et al. [2] formally introduced the code- With a boom in the internet, the social media text had been mixed cross-script QA problem. The first step of under- increasing day by day and the user generated content (such standing a question is to perform a question analysis. Ques- as tweets and blogs) in Indian languages are written using tion classification is an important task of question analy- Roman script due to various socio-cultural and technological sis which detects the answer to the type of the question. reasons. A majority of these posts are multilingual in nature Question classification helps not only filter out a wide range and many involve code mixing where lexical items and gram- of candidate answers but also determine answer selection matical features from two languages appear in one sentence. strategies [6]. Furthermore, it has been observed that the Focusing on this current multilingual scenario, code-mixed performance of question classification has significant influ- cross-script (i.e., non-native script) data gives rise to a new ence on the overall performance of a QA system. problem and presents serious challenges to automatic Ques- The Subtask 1 in the shared task on Mixed Script Infor- tion Answering (QA) and for this question classification will mation Retrieval in FIRE-2016 addresses the task of code be required which is an important step towards QA. This mixed cross script question classification where ’Q’ repre- paper proposes an approach to handle cross script question sents set of factoid questions written in Romanized Bengali classification as it is an important task of question analysis along with English. The task is to classify each given ques- which detects the category of the question. tion into one of the predefined coarse-grained classes. This paper proposes an algorithm for solving question classifica- tion task proposed by MSIR, FIRE 2016 Subtask 1 organiz- CCS Concepts ers, using different machine learning algorithms. •Information systems → Information integration; Data Rest of the paper is organized as follows. Section 2 ex- analytics; Data mining; Social recommendation; Query rep- plains the related work that has been done in the past few resentation; Query intent; years. Section 3 presents the analysis of dataset provided by MSIR 2016 Task Organizers [1]. Section 4 explains the methodology that have been performed for the task with Keywords flowcharts to explain the flow. Section 5 describes the algo- Code Mixing, Code Switching, Question Classification, Ma- rithm proposed for question classification. Section 6 elabo- chine Learning rates the evaluation and experimental results and error anal- ysis. Section 7 concludes the paper and presents future work. 1. INTRODUCTION With the proliferation of social network large volumes of 2. RELATED WORK text is being generated daily. Traditional machine learn- Today social media platforms are flooded by millions of ing algorithms used for text analysis such as Named En- posts everyday on various topics resulting in code mixing in tity Recognition (NER) or POS Tagging or parsing, are multilingual countries like India. A lot of work had been language dependent.These algorithms usually achieve their done in FIRE 2015 for language identification in cross script objective using co-occurrence patterns of features. Due to information retrieval. Bhattu et al. [4] proposed a two stage such language dependence, it has been observed by many algorithm, where in the first stage sentence level n-grams studies that a variety of problems related to social media based classifiers and in the second stage word level n-grams text are hindered. One such problem is Question Answer- classifiers were used. Bhargava et al. [3] proposed a hybrid ing (QA). Being a classic application of NLP, Question An- approach to do query labeling by generating char n-grams as swering (QA) has practical applications in various domains features and using logistic regression for language labeling. such as education, health care, personal assistance etc. QA For question analysis of such data, Question classification is a retrieval task which is more challenging than the task is done to understand the question that allows determin- of common search engine because the purpose of QA is to ing some constraints the question imposes on a possible an- find accurate and concise answer to a question rather than swer. Zhang et al. [8] used bag of words and bag of n-grams just retrieving relevant documents containing the answer [6]. as features and applied K-NN, SVM, Naive Bayes to au- tomate question classification and concluded that with sur- particular organization, team or any other group of people face text features the SVM outperforms the other classifiers. and these questions mainly comprises of words like ’which’, Banerjee et al. [2] proposed a QA system which takes cross- ’what’, ’team’ etc. Class type ’MISC’ stands for Miscella- script (non-native) code-mixed questions and provides a list neous; this class has the minimum representation in the data of information response to automate the question answering. set and relates to a variety of questions. Corpus acquisition was done from social media, question ac- The entire data set has sentences in a code-mixed format, quisition using a cloud based service without getting bias, consisting of words which either belong to Bengali or English corpus annotations and an evaluation scheme suitable to the language. The data set does not contain any code-mixing corpus annotation. Li et al. [6] proposed question classifi- done at word level. Also there are no punctuation in the cation using the role of semantic information developing a data set except the question mark (?) while there are a lot hierarchical classifier guided by a layered semantic hierarchy of named entities (belonging to both English and Bengali) of answer types. present in it. 3. DATA ANALYSIS 4. PROPOSED TECHNIQUE The training data provided [1], consisted of 330 questions A word-level n-gram based approach is used to classify labeled with its specific coarse-grained question type classes. code mixed cross script question records (comprising of words In total there are 9 different question type classes in the data belonging to both English and Bengali languages), into nine set. As shown in Figure 1, class-types ’ORG’ and ’TEMP’, different coarse grained question type classes. The proposed methodology involves a pipelined deployment of different techniques as mentioned in Figure 2. Proposed technique can be majorly divided into the following four phases: 1. Pre-processing 2. Named-entity recognition and removal 3. Translation 4. Classification 4.1 Pre-processing Data is pre-processed for label separation and case con- version for the efficient application of the classifiers. The pre-processing techniques were deployed as follows: 1. Separation of Class labels and Training Set Entries: The data set comprised of mixed script question records labeled with specific class type. These question entries and the respective class labels were segregated, on the basis of position of question mark symbol in the data entries. This segregation was done so that separate fea- ture vectors of question records and class labels could be formed, as per the deployment requirements of the classifiers. Figure 1: Class Distribution of Training Data Set 2. Case Conversion: For the purpose of normalization, comprises majority of the instances. Each of these classes all the data entries were converted into lower case. represents a particular type of question related to specific This technique involved identification and replacement entities. Class type ’MNY’ stands for Money related ques- of the upper case letters with their lower case coun- tions and the instances comprises of words like ’fare’, ’price’ terparts by means of manipulation of the ASCII code and helping words like ’koto’ (bn) and how much etc. Class values. type ’PER’ stands for Person related questions mostly com- prising of words like ’who’, ’whom’ etc. implying for the 4.2 Named-entity recognition and removal subject of the sentence being a person. Class type ’TEMP’ The pre-processed data set comprised of entries which implies time related questions mainly comprising of words had a large number of named-entities. Named-entities in like ’when’, ’at’ etc. Class type ’OBJ’ stands for the En- a text can be referred to pre-defined categories such as the tity/Object implying that subject of the sentence is an en- names of persons, organizations, locations, expressions of tity and mainly comprising of words like ’what’, ’kon’ etc. times, quantities, monetary values, percentages etc. In En- Class type ’NUM’ stands for Numeric entity related ques- glish language named entities occur in certain manner at tions and mainly involves usage of words like ’how many’, certain positions according to sentence structure. But when ’koto’ etc. Class type ’DIST’ stands for Distance and implies it comes to multi-lingual sentences, sentence structure varies that question is related to distance between places. Class a lot. Named Entities are identified using a dictionary based type ’LOC’ stands for Location and thereby mainly com- approach. The data set used for NER mainly comprised of prises of words like ’where’, ’jabe’ etc. Class type ’ORG’ the entries from FIRE 2015 Subtask1’s data set [5]. This stands for Organization and relates to questions centered on data set contained entries belonging to both Bengali and English languages. For the purpose of classification of the question records into one of the class types, the presence of these named-entities was irrelevant, as these entities did not contribute in building question structure for class-type determination, and hence their removal was mandatory. 4.3 Translation After the initial two phases, the remaining Bengali words were transliterated into their native scripts and then fur- ther translated to their respective English counterparts us- ing the Google translation API 1 . This technique helped to create a monolingual, single-script data set from the mixed script data set provided so that the efficient application of classifiers could take place. Using this approach, different code mixed cross script variants (each variant using different combination of words belonging to either Bengali or English languages) of the same question record were translated and hence standardized to only one question record (in English language). For example the question record ”Hazarduari te koto dorja ache?” and the record ”Hazarduari te how many dorja ache?”, both refer to the same question but use dif- ferent combination of words, and hence standardizing this to the English translation, would lead to an increase in the accuracy. 4.4 Classification The proposed approach uses the data set obtained from translation phase and deploys the technique of n-grams to form the feature vectors for each record in the data set. The approach follows a word-level implementation of n-grams with ’n’ being varied in the range 2 to 4, and thereby gen- eration of feature vectors for each question record in the training set. The transposed matrix of these feature vec- tors along with the numerically encoded class label matrix is then used as inputs to classifiers [7]. For the three runs, the following different classifiers are used: 1. Gaussian Naive Bayes Classifiers 2. Logistic Regression Classifier 3. Random Forest Classifier with Random State = 1 Figure 2: Block Diagram for Proposed Technique 5. ALGORITHM Algorithm 1 explains the proposed technique. Data set comprising of mixed script question records along with their respective class labels, is used as input. This input is pre- This vector is then used to generate another callable (by processed by deploying the techniques for separation of class function: Build Analyzer()) which is used to produce n- labels from data entries (implemented by the function: La- grams tokens when called on each data set rows (imple- bel Separation()) and case conversion of data records into mented by callable: Analyser()) and word level n-grams cor- lower cases (using function: Case Conversion()). Named responding to each row is appended (by means of function entities (NE) are removed from this pre-processed data (im- append()) to the n-gram list. Feature vectors for the data plemented by the function: NE Removal()). The remaining entries are generated using this n-gram list (implemented Bengali words present in the data set are then translated to by the function: Create Feature Vector()). The class labels their respective English equivalents by using Google Trans- for the training purpose are numerically encoded (by means lation API 1 (by means of the function: Translation()). The of function: Encode Class()) and then corresponding fea- technique of n-grams is then applied on this data set to ture vectors for these class labels are generated. These two form the corresponding feature vectors. First a vector for sets of feature vectors are then used as inputs to the classi- converting the textual entries into a matrix of n-gram token fier (implemented by the function: Classifier()). Classifier() counts (word level n-grams with n in the range 2 to 4) is function can be replaced by functions of different classifier created (implemented by the function Count Vectorizer()). like GaussianNB, LogisticRegression or RandomForestClas- sifier. The classifier is then used to predict the class labels 1 https://translate.google.com/ generated as output. Algorithm 1 Algorithm for Detecting paraphrases 1: Input: Mixed Script (bn+en) Question records, S, Train- ing class labels, T 2: Output: Predicted class labels, P 3: Initialization: P=[], n grams=[] 4: for i=0 to S.length do 5: Label Separation(S[i]) 6: Case Conversion(S[i]) 7: NE Removal(S[i]) 8: Translation(S) 9: end for 10: Vectorizer = Count Vectorizer(ngram range=(2,4)) 11: Analyzer = Vectorizer.build analyzer() 12: for i=0 to S.length do 13: row=Analyzer(S[i]) 14: for j=0 to row.length do 15: n grams.append(row[j]) 16: end for 17: end for 18: Matrix Data=Create Feature Vector(n grams) 19: Class List=Encode Class(T) 20: Matrix Class= Create Feature Vector(Class List) Figure 3: Highest Accuracy achieved by 7 teams 21: clf =Classifier(Matrix Data, Matrix Class) that participated in MSIR,FIRE 2016 22: clf.fit(Matrix Data, Matrix Class) 23: P=clf.predict(Matrix Test) is highly parallelized even at the level of scoring. It was also observed from the results, that the proposed algorithm 6. EXPERIMENTS generated highest F-measure scores for the classes of Orga- nization (ORG), Money (MNY) and Miscellaneous (MISC). MSIR, FIRE 2016, Subtask 1 involved classification of Figure 4 shows the comparison of the different f-measure mixed-script (Bengali and English) questions into nine dif- ferent coarse grained question type classes as discussed in Section 3. The training dataset comprised of 330 records (along with class labels) and it was used to classify a test dataset comprising of 180 mixed script question records. To- tal seven teams from different institutes of the country par- ticipated in the process and each team used three different approaches for classification and generated results as men- tioned in Figure 3. Approach proposed in this paper used machine learning for classification and three runs were sub- mitted for the same. Runs submitted varied from each other in terms of classifiers used (Gaussian Naive Bayes, Logistic Regression and Random Forest Classifiers). Using the ap- proach of Gaussian Naive Bayes classifier, an accuracy of 81.12 % was obtained, using Logistic Regression an accu- racy of 80% was obtained and using Random Forest Clas- sifiers an accuracy of 72.78% was obtained. The results in details, analysis and comparison for the same are discussed further. 6.1 Evaluation and Discussion The MSIR, FIRE 2016, Subtask 1 organizer evaluated the results which gave a comparison of accuracy achieved by the 7 teams that participated as shown in Figure 3. The pro- Figure 4: Comparison of F-Measure for Organiza- posed approach (team BITS PILANI) got ranked as 2nd tion and Money class among different teams with an accuracy of 81.12% for run1 while the highest ac- curacy achieved was 83.34% (by the team IINTU). Choice scores of the teams obtained for the class Organization. The of Gaussian Naive Bayes classifier leads to the maximum proposed algorithm (implemented by team BITS PILANI) accuracy attainment, as the proposed algorithm deals with got the highest scores of 0.74418 using Gaussian Naive Bayes the problem involving continuous attributes. Usage of Naive approach. This implies that the questions relating to a par- Bayes helps in building simplistic and highly scalable models ticular organization mainly being framed with words like which are fast and scale linearly with number of predictors ”which”, ”what” etc. could be efficiently classified by means and rows. Also the process of building a naive bayes model of this approach. These scores can be attributed to the fact that the instances of the class ORG were maximum in the mented by the team BITS PILANI) for the three different data set (67 out of 330 as discussed in Section 3). Also the runs submitted. proposed algorithm involves the formation of word level n- grams due to which words and phrases like ”which”, ”team”, ”series”, ”sponsor” etc. got associated, and thus might have Table 1: Class wise score for all the runs submitted contributed to an increase in the scores. Figure 4 also shows the comparison of the different f- measure scores of the teams obtained for the class Money. Using the proposed algorithm (team BITS PILANI) achieved the highest scores of 1 using Logistic Regression as a classi- fier (run 2). Hence all the questions relating to money being framed with words like ”how much”, ”price”, ”fare” etc. could be efficiently classified by means of the proposed approach. These high f-scores could be attributed to the efficient de- ployment of the word level n-gram techniques which in a way linked the words like ”fare”, ”how”, ”much”, ”price” etc. and thus might contributed to an increase in accuracy. The evaluated results also showed that only two teams (team BITS PILANI and team NLP-NITMZ) were able to identify instances belonging to Miscellaneous (MISC) class. This can be attributed to the fact that there were only 5 out of 330 instances of MISC class in the training data set. The proposed approach (team BITS PILANI) got the highest scores of 0.2 using Gaussian Naive Bayes classifier, which again attributes for the simplistic approach of GaussianNB classifiers and the efficient deployment of the word level n- grams technique. 6.2 Error Analysis There are a few phases at which proposed approach could have attributed to the mis-classification of a few records. The proposed approach involves a dictionary based method for named entity recognition for which the corpus used had only limited entries due to which some of the entities might not have been recognized and removed. Also the data set had a large number of instances of named-entities which referred to the same name but had similar but different spellings. For instance, in the data set, words ”masjid” and ”mosjid” both referred to the same word implying ”mosque” but had different spelling. Since the proposed approach used a corpus for NER these entities couldn’t be removed unless all the spellings of these words were added to the corpus. The proposed approach also involves the usage of a trans- Figure 5: Comparison of F-Measure among different lation system (Google API1 ) for translating words of Ben- teams for various classes gali to English, but since the translation system did not consider the semantics of the sentence where the word was Figure 5 shows a comparison of the accuracy obtained being used, it may have happened that the particular Ben- (taken the best accuracy obtained out of the three runs for gali word would have been incorrectly translated. The given each team) for classifying each of the nine classes. As evi- data set did not have a uniform distribution of class in- dent from the figure, the proposed approach (implemented stances, as shown in Figure 1 the data set comprised only of by team BITS PILANI) was able to obtain satisfactory re- 1.51% of MISC class instances while ORG class comprises sults in identifying the correct class labels particularly in the 20% of the entries in the data set due to which the model cases of MISC, ORG, MNY, NUM and OBJ classes with an trained could be biased. Also as mentioned before, not even f-measure score of 1 obtained for the class Money. Table a single instance of MISC class from the test data set could 1 shows the scores of precision, recall and F-measure for be identified by most of the teams, and even the proposed each of the nine different classes, as evaluated by the FIRE system was able to get an f-measure score of only 0.2 because 2016 task organizers [1], for the proposed algorithm (imple- of lesser number of instances of the class. 7. CONCLUSIONS AND FUTURE WORK [7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, In this paper, a word-level n-gram based approach of clas- B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, sification of code mixed cross script question records into R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learn- nine different coarse grained question type classes has been ing in python. Journal of Machine Learning Research, presented for Subtask 1 of MSIR, FIRE 2016. Presented ap- 12(Oct):2825–2830, 2011. proach uses a pipelined stages to classify questions using var- [8] D. Zhang and W. S. Lee. Question classification using ious machine learning algorithms(Gaussian Naive Bayes, Lo- support vector machines. In Proceedings of the 26th an- gistic Regression and Random Forest). Proposed approach nual international ACM SIGIR conference on Research obtained highest accuracy of 81.12% using Gaussian Naive and development in informaion retrieval, pages 26–32. Bayes approach among all the three runs submitted. Future ACM, 2003. work could be an improvisation of dictionaries for named- entity recognition for Bengali and English languages. Differ- ent Named Entity Recognizer and taggers along with trained models for Name Entity Recognition could be deployed. It would be interesting to find approaches by which implicit features about the code-mixed cross script data set could be efficiently trained using deep learning algorithms. Machine learning based models for language identification along with appropriate transliteration and translation tools (which take into consideration of the correct semantics) could be im- proved further. References [1] S. Banerjee, K. Chakma, S. K. Naskar, A. Das, P. Rosso, S. Bandyopadhyay, and M. Choudhury. Overview of the Mixed Script Information Retrieval (MSIR) at FIRE. In Working notes of FIRE 2016 - Forum for Informa- tion Retrieval Evaluation, Kolkata, India, December 7- 10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. [2] S. Banerjee, S. K. Naskar, P. Rosso, and S. Bandyopad- hyay. The first cross-script code-mixed question answer- ing corpus. In First Workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine 2016) co-located with the 38th European Conference on Information Retrieval (ECIR 2016), volume 1589, pages 56–65, 2015. [3] R. Bhargava, Y. Sharma, S. Sharma, and A. Baid. Query labelling for indic languages using a hybrid approach. In Working notes of FIRE 2015 - Forum for Informa- tion Retrieval Evaluation, Gandhinagar, India, Decem- ber, 2015, volume 1587 of CEUR Workshop Proceedings, pages 40–42. CEUR-WS.org, 2015. [4] S. N. Bhattu and V. Ravi. Language identification in mixed script social media text. In Working notes of FIRE 2015 - Forum for Information Retrieval Evalua- tion, Gandhinagar,India, December, 2015, volume 1587 of CEUR Workshop Proceedings, pages 37–39. CEUR- WS.org, 2015. [5] M. Choudhury, P. Gupta, P. Rosso, S. Kumar, S. Baner- jee, S. K. Naskar, S. Bandyopadhyay, G. Chittaranjan, A. Das, and K. Chakma. Overview of fire-2015 shared task on mixed script information retrieval. In Working notes of FIRE 2015 - Forum for Information Retrieval Evaluation, Gandhinagar, India, December, 2015, pages 19–25. CEUR-WS.org, 2015. [6] X. Li and D. Roth. Learning question classifiers. In Pro- ceedings of the 19th international conference on Compu- tational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics, 2002.