-

Ensemble Classifier based approach for Code-Mixed Cross-Script Question Classification

Debjyoti Bhattacharjee

debjyoti001@ntu.edu.sg 1

Paheli Bhattacharya

paheli@iitkgp.ac.in 0 0 Dept. of Computer Science and Engineering, Indian Institute of Technology Kharagpur , West Bengal , India 1 School of Computer Science and Engineering, Nanyang Technological University , Singapore

\Kharagpur theke Howrah cab fare koto?" (gloss : What is With an increasing popularity of social-media, people post the cab fare from Kharagpur to Howrah?) has words from updates that aid other users in nding answers to their ques- a single script (Roman) but from two di erent languages tions. Most of the user-generated data on social-media are English (cab , fare) and Bengali (theke , koto). The words in in code-mixed or multi-script form, where the words are rep- Bengali have been transliterated to the Roman Script. Inturesented phonetically in a non-native script. We address the itively this a very easy form of writing for people not as wellproblem of Question-Class cation on social-media data. We versed with English as for their native language tend to use propose an ensemble classi er based approach towards ques- them for conversing in social media. With the rise in popution classi cation when the questions are written in mixed- larity of social-media, people constantly post updates from script, speci cally, the Roman script for the Bengali lan- their daily lives ranging from, but not limited to, sports and guage. We separately train Random Forests, One-Vs-Rest score updates, travel updates to food, hotel, transport and and k-NN classi ers and then build an ensemble classi er movie reviews, providing user feedbacks for Customer Supthat combines the best from the three worlds. We achieve an port Systems through tweets and blogs. Although Question accuracy of 82% approximately, suggesting that the method Answering (QA) is a well-addressed research problem with works well in the task. systems providing reasonable accuracy, QA on social-media text in mixed script is a challenging problem mainly due CCS Concepts twoortdhse wfarcitttetnhaint tnhoenr-enaistinveo ssctrainpdt.arFdoirzaintisotnanocfe,stpheelliBnegnsgfaolri Information systems ! Question answering; Computingword \ekhon" (meaning, now) may have multiple spellings methodologies ! Machine learning; Cross-validation; \akhan", \ekhon", \ekhan", \akon" etc. Categorizing a question into a speci c set of classes and then dealing with each

With the increase in popularity of the Web, users from all over the world now opt to write in their native language instead of English. A large number of South and SouthEast Asian languages are written in a transliterated form (phonetically representing words in a non-native script) using the Roman Script. These texts are said to be written in Mixed-Script. Since there are font-encoding issues in using the original script (for example, Devnagari for Hindi) people tend to transliterate or phonetically represent the words in the original language using the Roman script. To de ne Mixed Script Information Retrieval formally [ 6 ] we consider a set of natural languages L = fl1; l2; ; lng and a set of scripts S = fs1; s2; ; sng such that si is the native script for the language li. Given a word w, we represent it as two tuples hli; sj i to imply w is in language li and written using script sj . When i = j, we say that the word is written in its native script. Else, it has been transliterated into another script sj . In practice, when textual content is a mixture of words from various languages or scripts or both, it is called Multi-Script (MS) or Code-Mixing. For instance, 2.

RELATED WORK

Jamatia et al. [ 7 ] experiments with code-mixed EnglishHindi social-media text for Part-of-Speech tagging. They use both coarse and ne-grained tagsets for the task. Four machine learning algorithms Conditional Random Fields, Sequential Minimal Optimization, Nave Bayes and Random Forests, reporting highest accuracy with Random Forest based classi er. Information Retrieval on Multi-Script data has also been looked into [ 6 ]. Recent works on question classi cation include a machine learning based approach [ 8 ] towards question class. A hierarchial classi er is rst used to classify the question into coarse-grained classes and then into ne-grained classes. The feature space consisted of primitive ones like pos tags, chunks, named entities and also complex features such as conjunctive n-gram features and relational features. Question-Answering corpus acquisition using social-media content and question acquisition with human involvement have been reported in [ 2 ]. In FIRE 2015, the Transliterated Search track introduced three subtasks | language labelling of words in code-mixed text fragments, ad-hoc retrieval of Hindi lm lyrics, movie reviews and astrology documents and transliterated question answering where the documents as well as questions were in Bangla script or Roman transliterated Bangla [ 4 ].

TASK DESCRIPTION

Question Answering systems are a classic application of natural language processing, where the retrieval task has to nd a concise and accurate answer to a given question. Question classi cation is one of the subtasks of QA system, required to determine the type of the answer corresponding to a question.

The Code-Mixed Cross-Script Question Classi cation task can be described as follows. Given a question Q written in Romanized Bengali, which can contain English words and phrases and a set C = fc1; c2; : : : ; cng of question classes, the task is to classify the question Q into one of these prede ned classes.

Example: Question: airport theke howrah station distance koto ? Question Class: DIST 3.1

Dataset description

The training dataset consists of 330 questions and each question is assigned to a single question class. There are 9 question classes in all and the number of questions in each class is shown in Table 1. The minimum and maximum number of words in a question is 2 and 11 respectively while each question on average has 5.3 words.

PROPOSED APPROACH

To build a classi er to classify the questions into the speci ed classes, we created a vector representation of the each question which is used as input to the classi er. We considered the top 2000 most frequently occurring words in the supplied training dataset as features. Each question is represented as a 2000-element binary vector. Element ei = 1, if the ith most frequently word is present in the question, otherwise 0.

We used three separate classi ers namely Random Forests (RF), One-vs-Rest (OvR) classi er and k-Nearest Neighbour (k-NN) classi er, followed by building an ensemble classi er using these three classi ers for the classi cation task.

In k-NN classi cation, a sample is classi ed by a majority vote of its neighbours, with the object being assigned to the most common class among its k-nearest neighbours (k > 0; k 2 I). k-NN classi cation is a lazy learning method which defers computation till the classi cation is performed. k-NN is one of the simplest classi ers.

One-vs{Rest strategy uses one classi er per class for tting. Each classi er is trained against all the classes. The approach allows information regarding each class by inspecting the classi er trained for that class. In OvR, each classier is trained with the entire data set while in RF, samples drawn from the original data set are used for training.

A Random Forest is a ensemble learning method which can be classi cation [ 3 ]. Random Forest ts a number of decision trees on various sub-samples of the dataset, with the samples drawn from the original dataset with or without replacement. Random Forests overcome the problem of over tting of decision tree of their training set [ 5 ].

Using the above three classi ers, we built an ensemble classi er (EC). The ensemble classi er takes the output label by each of the individual classi ers and gives the majority label as output, otherwise any label is chosen at random as output. Each of the individual classi ers is trained on a subset of the original training dataset, by sampling with replacement.

In the following section, we describe the details of implementation of the classi ers and the obtained results. 5.

EXPERIMENTAL SETUP AND RESULTS

We implemented the proposed approach using Python 3 and used the scikit-learn tool-kit for the classi ers. The following instantiations were used for the rst three classi ers, which were available in sckit-learn. We implemented the ensemble classi er on our own. rf = RandomForestClassifier(n_estimators=100) ovr = OneVsRestClassifier(LinearSVC(random_state=0)) clf = neighbors.KNeighborsClassifier(30, weights='uniform')

We split the labelled data set into two parts | training set (90%) and validation set (10%). The RF classi er performed the best, followed by EC, OvR and k-NN in decreasing order of classi cation accuracy. Thereafter, we used these trained classi ers for classifying the test data set. During classi cation, we marked the samples for which all the 4classi ers predicted the same label. We used these samples, in addition to the original labelled data set for retraining the classi ers.

The results on the test data set for the classi ers RF, EC and OvR were submitted as nal run and is shown summarily in Table 2 and Table 3. The classi cation results of kNN classi er were not submitted as run and hence accuracy of the results is not available. 6.

CONCLUSION AND FUTURE WORK

In this paper we have addressed the problem of question classi cation for Bengali-English Code-Mixed social-media data. We have experimented with three machine learning based classi ers - Random Forests, One-vs-Rest and k-NN and then built an ensemble of these classi ers to achieve the best results. The method is scalable to other Code-Mixed languages mainly because it does not perform any language F-1 0.78 0.80 0.76 0.85 0.89 0.85 0.63 0.65 0.59 0.92 0.94 0.94 1 1 1 0.81 0.81 0.85 0.97 0.97 0.97 0.46 0.53 0.46 NA NA NA or script-based feature engineering.

We would like to experiment with other multi-script data where more than two languages have been mixed. We aim to apply other machine learning algorithms with more linguistic and syntactic features. 7.

[1]

Banerjee ,

Chakma ,

S. K.

Naskar , A. Das , P.

Rosso , S.

Bandyopadhyay , and M.

Choudhury . Overview of the Mixed Script Information Retrieval (MSIR) at FIRE . In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation , Kolkata, India, December 7- 10 , 2016 ,

CEUR

Workshop Proceedings . CEUR-WS.org, 2016 .

[2]

Banerjee ,

S. K.

Naskar ,

Rosso , and

Bandyopadhyay . The rst cross-script code-mixed question answering corpus . In Modelling, Learning and mining for Cross/Multilinguality Workshop, 38th European Conference on Information Retrieval (ECIR) , pages 56 { 65 , 2016 .

[3]

Bengio . Learning deep architectures for ai . Foundations and trends R in Machine Learning , 2 ( 1 ):1{ 127 , 2009 .

[4]

Choudhury ,

Gupta ,

Rosso ,

Kumar ,

Banerjee ,

S. K.

Naskar ,

Bandyopadhyay ,

Chittaranjan , A. Das , and K. Chakma . Overview of re -2015 shared task on mixed script information retrieval .

[5]

Friedman ,

Hastie , and

Tibshirani . The elements of statistical learning , volume 1 . Springer series in statistics Springer, Berlin, 2001 .

[6]

Gupta ,

Bali ,

R. E.

Banchs ,

Choudhury , and

Rosso . Query expansion for mixed-script information retrieval . In The 37th Annual ACM SIGIR Conference , pages 677 { 686 , 2014 .

[7]

Jamatia , B.

Gamback, and

Das . Part-of-speech tagging for code-mixed english-hindi twitter and facebook chat messages . In 10th Recent Advances of Natural Language Processing (RANLP) , pages 239 { 248 , 2015 .

[8]

Li and

Roth . Learning question classi ers . In Proceedings of the 19th international conference on Computational linguistics-Volume 1 , pages 1 { 7. Association for Computational Linguistics, 2002 .