NLP-NITMZ @ MSIR 2016 System for Code–Mixed Cross–Script Question Classification Goutam Majumder Partha Pakray National Institute of Technology Mizoram National Institute of Technology Mizoram Deptt. of Computer Science & Engg. Deptt. of Computer Science & Engg. Mizoram, India Mizoram, India goutam.nita@gmail.com parthapakray@gmail.com ABSTRACT Table 1: Example of Question Classes This paper describes our approach on Code–Mixed Cross– Question Script Question Classification task, which is a subtask 1 of Examples Class MSIR 2016. MSIR is a Mixed Script Information Retrieval Airport theke Howrah Station volvo bus event in conjunction with FIRE 2016, which is the 8th meet- MNY fare koto? ing of Forum for Information Retrieval Evaluation. For this TMP Volvo bus howrah station jete koto time nei? task, our team NLP–NITMZ submitted three system runs DIST Airport theke howrah station distance koto? such as: i) using a direct feature set; ii) using direct and de- LOC Airport theke textit kothai jabar bus nei? pendent feature set and iii) using Naive Bayes classifier. The ORG Prepaid taxi counter naam ki? first system is our baseline system, which is based direct fea- OBJ Murshidabad kon nodir tire obosthito? ture sets and we used a group of keywords to generate this NUM Hazarduari te koto dorja ache? direct feature set. To identify question classes our baseline system falls in ambiguity (means one question is tagged with PER Ke Hazarduari toiri kore? multiple classes). To deal with this ambiguity, we developed MISC Early morning journey hole kon service valo? another set of feature and we consider this feature set as de- pendent feature set, because keywords from this set is worked with direct feature set. The highest accuracy of our system personal assistance, etc. QA is a retrieval task, which is more is 78.88% using method–2 and we submitted as run–3. Our challenging than the task of common search engine because other two runs have same accuracy as 74.44%. the purpose of QA is to find out accurate and concise answer to a question rather than just retrieving relevant documents containing the answer [5]. Keywords In this paper, the participation of subtask 1 is reported, Natural Language Processing; Question Answering; Infor- which is a code-mixed cross-script Question Classification mation Retrieval; Question Analysis task [2] of MSIR 2016 (Shared Task on Mixed Script In- formation Retrieval) [1]. The first step of understanding a question is to perform question analysis (QA). Question 1. INTRODUCTION classification is an important task of QA, which detects the Question Answering (QA) concerned with the building answer type of the question. Question classification not only system, which can answer the questions automatically posed helps to filter out a wide range of candidate answers but also by human. The QA is a common discipline within the fields determines answer selection strategies [3], [5]. of Information Retrieval (IR) and Natural Language Pro- In this subtask 1, given two sets as Q = {q1 , q2 , ..., qn } and cessing (NLP). It is a computer program, querying a struc- C = {c1 , c2 , ..., cn }, where Q be a set of factoid questions tured or unstructured database of knowledge or information written in Romanized Bengali along with English and C be and constructs its answer [6]. the set of question classes. The task is to classify each given The current research of QA deals with a wide range of questions into one of the predefined coarse-grained classes. question type such as fact-based, hypothetical, semantically Total of 9 question classes are given as classification task constrained, and cross-lingual questions. The need and im- and example of each question class with specific tag is listed portance of a QA system was first introduced in 1999, by in Table 1. the first QA task in TREC 8 (Text REtrieval Conference). Rest of the paper is organized as follows: in Section 2 It was revealed the need of a sophisticated search engines, we have discussed the three methods in detail. Performance which is able to retrieve the specific piece of information of the three systems is analysed in Section 3 and we also that could be considered as the best possible answer to a compared with other submitted systems. Finally, conclusion user question. In present days such QA system works as a of the report is drawn in Section 4. backbone for successful of any E-Commerce business. In this type of systems, many frequently asked question (FAQ) files are generated based on most frequently asked use questions 2. THE PROPOSED METHOD and types of those questions [4]. Three methods are developed for MSIR16 subtask1 to Being a classic application of NLP, QA has practical ap- identify the question classes. Two systems are based on fea- plications in various domains such as education, health care, ture sets and identification stages for questions are depen- Table 2: Features of MNY Class with Example Table 4: Features of ORG Class with Example Features Questions as MNY Class Features Questions as ORG Class charge Semi–official guide koto taka charge nei? ki Prepaid taxi counter naam ki? daam Fuchhka r koto daam Darjeeling e? kara World champion kara? price Darjeeling e momo r price koto? kon kon team Ashes hereche? dam Chicken momo r dam koto Darjeeling e? ke Ashes hereche ke? khoroch Pykare te boating e koto khoroch hobe? fee Wasef Manzil er entry fee koto? tax Koto travel tax pore India border e? sidered. The direct set of features with examples are pore Koto travel tax pore India border e? listed below: fare Darjeeling e dedicated taxi fare koto? • time–Koto time lage Bangalore theke Ooty by taka Digha te Veg meal koto taka? road ? • kobe–Kobe Jorbangla Temple build kora hoyechilo ? Table 3: Features of DIST Class with Example Features Questions as DIST Class • kokhon–Shyamrai Temple toiri hoi kokhon ? distance Kolkata theke bishnupur er distance koto? 4. LOC: To tag the location class only one direct feature duroto Bangalore theke Ooty r by road duroto koto? as kothai is identified and this keyword is also used area Ooty botanical garden er area koto hobe? in second method. Examples for this class are given height Susunia Pahar er height koto hobe? below: dure Puri Bhubaneshwar theke koto dure? • kothai ras mela hoi ? uchute Ooty sea level theke koto uchute? km Kolkata theke bishnupur koto km? • train r jonno kothai advice nite hobe ? 5. ORG: For organization class four direct features are identified and these features with questions are listed dent on these features. Two feature sets are identified first in Table 4. Among these four features, the ki fea- using the training dataset and we consider one set as direct ture has ambiguity with other question classes such as and other set as dependent. For the third system we com- ’OBJ’ and ’PER’. Examples of questions with multiple bined these two sets and machine learning features are build tags using ki feature is listed below: using Naive Bayes classifier. Details of the three methods are discussed next. • OBJ–Ekhon ki museum hoye geche ? • PER–Rabindranath er babar naam ki ? 2.1 Method–1 (using direct feature set) The kon feature of ’ORG’ class also having ambiguity 1. MNY: To identify the ’MNY’ class 10 features/ key- issues with other classes such as ’TEMP’ and ’OBJ’. words from the training dataset is identified and ques- Questions with kon feature of other classes are listed tion contains these keywords are tagged as ’MNY’ class. below: In Table 2, we have listed out all of these features with • TEMP–Kon month e vasanta utsob hoi shan- questions. With these 10 features we also identified an- tiniketan e ? other keyword koto, to tag questions as ’MNY’ class. • OBJ–Kon mountain er upor Ooty ache ? But it was analysed that, if we consider koto as direct feature for ’MNY’ tag, then questions of other classes Issues related to ambiguity are addressed in Section are also tagged as ’MNY’ class. Such as ’Shankarpur 2.2 with the help of dependent feature set. Digha theke koto dure?’, which is a ’DIST’ class type question. 6. NUM: Two direct features such as koiti and koto are identified to tag the questions as ’NUM’ class. But koto Like koto, taka feature is also unable to tag some ’MNY’ keyword having ambiguity and questions of ’MNY’ questions. So we identified two other keywords as de- classes are tagged as ’NUM’ . So a dependent feature pendent feature set, which is discussed in Section 2.2. set is identified, which merge with koto feature and this issues are discussed in Section 2.2. So in this method, 2. DIST: Seven keywords are identified as direct feature to identified questions as ’NUM’ we only consider the set to tag the ’DIST’ question class. We also con- keyword koiti as feature. sider same set of features for second method. All the identified keywords having the meaning as distance in • Bishnupur e koiti gate ache ? Bengali as well as in English language. For this class • Leie koiti wicket niyeche ? no keyword is found for dependent set and in Table 3, we have listed out all the features with questions. 7. PER: To identify the ’PER’ class five direct features are identified. Among these three features such as ke, 3. TEMP: Question contains any temporal unit such as kake, and ki are worked with dependent feature sets, somoi, time, month, year etc. are tagged as a TEMP which is discussed in Section 2.2 and two other fea- class. For the temporal question class eight keywords tures such as kar and kader consider for direct feature are identified and all of these keywords are considered set. These direct features with example are listed bel- as direct feature set and no dependent features are con- low: • Kar wall e sri krishna er life dekte paoya jabe ? Table 5: Ambiguity Classes with NUM Class • Jagannath temple e kader dekha paben ? Ambiguity Questions in Ambiguity Classes 8. OBJ: Two direct rules are found, but these rules are DIST Kolkata theke bishnupur er distance koto? not able to identify the questions of ’OBJ’ class. These TEMP Bishnupur e jete bus e koto time lagbe? rules are as follows: MNY Indian citizen der entry fee koto taka? • Ekhon ki museum hoye geche ? • Hazarduari er opposit e kon masjid ache ? In questions, if tokens ends with the format eche and From these questions it is clearly understood that, the questions also have ’ORG’ features then those ques- ’OBJ’ class is in ambiguity with ’ORG’ class. So these tions are classified as ’PER’ class. The third feature two features are used with other dependent features for are identified not to deal with ambiguity, these set is question classification. worked with kon keyword of direct feature set used in method–1. In this set, we explicitly identified those 9. MISC: If no rules are satisfied then, questions are words, which means an organization such as shop, ho- classified as ’MISC’ class. tel, city, town etc. and examples are listed below: 2.2 Method–2 (using direct and dependent fea- • shope–kon shop e tea kena jete pare ? ture set) • town–rat 9 PM kon town ghumiye pore ? This dependent feature set is identified to improve the ef- ficiency of the first method. The name of the second set is 6. NUM: The koto keyword of direct feature set for ’NUM’ given as dependent, because some features of the direct sets class, is also in ambiguity with other classes such as are not able to identify the questions and those features work ’DIST’, ’TEMP’, and ’MNY’. So the direct features correctly when the dependent feature set is also available in are not used here to tag the questions of NUM class, the questions. but are used to check rules, present in the questions or not, if yes then questions will not tag or else is tagged 1. MNY: To identify the MNY class two dependent fea- with ’NUM’ class. Example of each ambiguity is listed tures such as charge and koto along with ten direct in Table 5. features are identified. If any questions contains any of these two features, it also look for the direct feature 7. PER: We have used two dependent features to predict such as taka else it will not consider the ’MNY’ class the questions of ’PER’ class. In this method, depen- for this question and examples are listed below: dent features are also worked with direct features for question prediction. Examples of questions of ’PER’ • koto–Digha te Veg meal koto taka ? class using direct and dependent features is listed be- low: • charge–Semi-official guide koto taka charge nei ? • Chilka lake jaber tour conduct ke kore ? 2. DIST: Same as method–1 using all set of direct fea- tures only. • Woakes kake run out koreche ? • Bangladesh r leading T20 wicket-taker r naam 3. TEMP: All direct features are used to tagged the ki ? ’TEMP’ class. 4. LOC: No dependent features, same set of direct fea- 8. OBJ: A dependent feature set is identified, which con- tures of method–1 is used. tains all such words those qualified as a object name to handle the ambiguity issues of ’OBJ’ class with ’ORG’ 5. ORG: To handle the ambiguity issues with ’ORG’ class. These object names are combined with two di- class questions three sets of dependent features are rect features such as ki and kon. Example of such identified, which improves the system accuracy. In ambiguity issues are listed below: the first feature set, ambiguity with ’OBJ’ class is addressed. By identifying a term such as museum, • ekhon ki museum hoye geche ? mondir, mosque which can qualify a question as ’OBJ’ • bengal r sobcheye boro mosjid ki ? class. Examples of these features are as follows: • Nawab Wasef Ali Mirza r residence ki chilo ? • museum–ekhon ki museum hoye geche ? From these dependent features, it was clear that, the • lake–murshidabad e ki lake ache ? direct features look for a token in the questions those have an entity of object type and these entities are All such questions are not identified as ’ORG’ class, in- represented as bold face in the examples. So for kon stead of these questions are forwarded to the other fea- direct feature, same set of dependent features are used ture set for prediction. In the second set, features are to identify the ’OBJ’ class. Examples are as follows: identified to handle the issues related to ’PER’ class and the features are as follows: • Berhampore-Lalgola Road e kon mosjid ache ? • (*eche)–ke jiteche, ke hereche, ke hoyeche • Murshidabad kon nodir tire obosthito ? • team–kon team Ashes hereche ? 9. MISC same as method–1. Table 6: Statistics of Training Data set Table 7: Success rates using rules of Direct set Sl.No. Question Class Total Classes Precision Recall F–1 Score 1 Money as MNY 26 MNY 0.80 0.50 0.62 2 Temporal as TEMP 61 DIST 0.95 0.90 0.92 3 Distance as DIST 24 TEMP 1 0.96 0.98 4 Location as LOC 26 LOC 0.86 0.85 0.84 5 Object as OBJ 21 ORG 0.47 0.75 0.57 6 Organization as ORG 67 NUM 0.72 1 0.85 7 Number as NUM 45 PER 0.82 0.67 0.73 8 Person as PER 55 OBJ 0.5 0.2 0.29 9 Miscellaneous as MISC 5 MISC 0.17 0.13 0.14 2.3 Method–3 (Using Naive Bayes Classifier) Table 8: Success rates using Naive Bayes Classifier In this method, Naive Bayes classifier is used to train the Classes Precision Recall F–1 Score model. For training, a feature matrix with probable class MNY 0.79 0.69 0.73 tags is input to the Bayes classifier. For each question in DIST 1 0.95 0.98 training set, one feature is considered and the last column TEMP 0.68 0.96 0.80 of the feature matrix represents the question classes. This LOC 0.86 0.85 0.84 feature matrix is generated using the sets of direct and de- ORG 0.47 0.71 0.57 pendent features used in Method–1 and 2. NUM 0.99 1 0.96 PER 0.83 0.89 0.86 3. EXPERIMENT RESULTS OBJ 0.75 0.30 0.45 MISC 0.17 0.125 0.14 3.1 Data and Resources Two datasets as training and testing data set are released classes. 78.89% is the success rate achieved in this run using for this task [1]. It was allowed that, participants can use Method–2 and the accuracy is listed in Table 3.2.3. any number of resource for this task. Each entry in dataset has the following format as q no q string q class and is re- 3.3 Comparative Analysis ferred as question number, code–mixed cross–script question In this subtask-1, total of 20 system runs is submitted string and the class of the question respectively. Training by 7 teams and as an average 140 questions are successfully data set contains a total no. of 330 questions and is tagged tagged by these teams with an average of 40 unsuccessful among 9 question classes and details of the training data set tags. For this task, IINTU team achieved 83.33333% as with question classes is given in Table 6. highest accuracy and our team NLP–NITMZ got the high- 3.2 Results est accuracy as 78.88889%. Among these 9 question classes, ’DIST’ class has the highest precession value as 0.9903 and For this task, NLP–NITMZ team submitted 3 system runs. ’NUM’ class got the highest recall value as 0.9961 and tem- Among these runs, two of them are feature/ keyword based poral class achieved 0.9612 as highest F–1 score. and the third run is based on machine learning features. For the first run, different set of rules are identified for each question classes. 4. CONCLUSIONS We submitted 3 system runs and accuracy of our systems 3.2.1 Run–1 are 74.44%, 78.89%, and 74.44% respectively using three The first run, is conducted using method–1, which is used methods. For this subtask–1 of MSIR16, our system has all the direct rules. In this run, questions are identified given the best performance using Method–2 and we submit- and tagged with the class based on these direct rules. We ted the output of this method as run–3. For this run, two achieved a success rate of 74.44% using Method–1 and the accuracy of identification of question classes are listed in Table 7 using the performance parameters. Table 9: Success rates using Direct and Dependent 3.2.2 Run–2 rules Classes Precision Recall F–1 Score Naive Bayes classifier is used for this run. After being MNY 0.91 0.63 0.74 trained the model using training dataset, model is tested DIST 0.95 0.90 0.93 with the test data and class levels are predicted as classifier TEMP 1 0.92 0.96 output. For this run it has the accuracy of 74.44%, which LOC 0.77 0.87 0.82 is same as run–1. In Table 3.2.2, the precision, recall and ORG 0.88 0.58 0.70 F–1 score for each class labels is listed. NUM 0.81 1 0.90 3.2.3 Run–3 PER 0.83 0.89 0.86 For this run the direct and dependent feature set is used OBJ 0.38 0.30 0.33 together to address the ambiguity issues among the question MISC 0.17 0.25 0.20 types of features are working together and we have given di- rect and dependent as the name of two sets. Between these two sets, dependent features are mainly worked for tagging the questions without ambiguity and we got the 7th and 9th rank with the system run 3 and 2. 5. ACKNOWLEDGMENTS This work presented here under the research project Grant No. YSS/2015/000988 and supported by the Department of Science & Technology (DST) and Science and Engineering Research Board (SERB), Govt. of India. Authors have also acknowledged the Department of Computer Science & Engi- neering of National Institute of Technology Mizoram, India for proving infrastructural facilities. 6. REFERENCES [1] S. Banerjee, K. Chakma, S. K. Naskar, A. Das, P. Rosso, S. Bandyopadhyay, and M. Choudhury. Overview of the Mixed Script Information Retrieval (MSIR) at FIRE. In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016, CEUR Workshop Proceedings. CEUR-WS.org, 2016. [2] S. Banerjee, S. K. Naskar, P. Rosso, and S. Bandyopadhyay. The First Cross-Script Code-Mixed Question Answering Corpus. Proceedings of the workshop on Modeling, Learning and Mining for Cross/Multilinguality (MultiLingMine 2016), co-located with The 38th European Conference on Information Retrieval (ECIR), 2016. [3] B. Gambäck and A. Das. Comparing the level of code-switching in corpora. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), May 2016. [4] P. Gupta, K. Bali, R. E. Banchs, M. Choudhury, and P. Rosso. Query expansion for mixed-script information retrieval. In The 37th Annual ACM SIGIR Conference, SIGIR-2014, pages 677–686, Gold Coast, Australia, June 2014. [5] X. Li and D. Roth. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics, 2002. [6] P. Pakray and G. Majumder. Nlp–nitmz:part-of-speech tagging on italian social media text using hidden markov model. techreport, (Acceptd) In the SHARED TASK ON PoSTWITA – POS tagging for Italian Social Media Texts, EVALITA -2016.