<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ensemble Classifier based approach for Code-Mixed Cross-Script Question Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Debjyoti Bhattacharjee</string-name>
          <email>debjyoti001@ntu.edu.sg</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paheli Bhattacharya</string-name>
          <email>paheli@iitkgp.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science and Engineering, Indian Institute of Technology Kharagpur</institution>
          ,
          <addr-line>West Bengal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science and Engineering, Nanyang Technological University</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
      </contrib-group>
      <issue>30</issue>
      <abstract>
        <p>\Kharagpur theke Howrah cab fare koto?" (gloss : What is With an increasing popularity of social-media, people post the cab fare from Kharagpur to Howrah?) has words from updates that aid other users in nding answers to their ques- a single script (Roman) but from two di erent languages tions. Most of the user-generated data on social-media are English (cab , fare) and Bengali (theke , koto). The words in in code-mixed or multi-script form, where the words are rep- Bengali have been transliterated to the Roman Script. Inturesented phonetically in a non-native script. We address the itively this a very easy form of writing for people not as wellproblem of Question-Class cation on social-media data. We versed with English as for their native language tend to use propose an ensemble classi er based approach towards ques- them for conversing in social media. With the rise in popution classi cation when the questions are written in mixed- larity of social-media, people constantly post updates from script, speci cally, the Roman script for the Bengali lan- their daily lives ranging from, but not limited to, sports and guage. We separately train Random Forests, One-Vs-Rest score updates, travel updates to food, hotel, transport and and k-NN classi ers and then build an ensemble classi er movie reviews, providing user feedbacks for Customer Supthat combines the best from the three worlds. We achieve an port Systems through tweets and blogs. Although Question accuracy of 82% approximately, suggesting that the method Answering (QA) is a well-addressed research problem with works well in the task. systems providing reasonable accuracy, QA on social-media text in mixed script is a challenging problem mainly due CCS Concepts twoortdhse wfarcitttetnhaint tnhoenr-enaistinveo ssctrainpdt.arFdoirzaintisotnanocfe,stpheelliBnegnsgfaolri Information systems ! Question answering; Computingword \ekhon" (meaning, now) may have multiple spellings methodologies ! Machine learning; Cross-validation; \akhan", \ekhon", \ekhan", \akon" etc. Categorizing a question into a speci c set of classes and then dealing with each</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the increase in popularity of the Web, users from
all over the world now opt to write in their native language
instead of English. A large number of South and
SouthEast Asian languages are written in a transliterated form
(phonetically representing words in a non-native script)
using the Roman Script. These texts are said to be written in
Mixed-Script. Since there are font-encoding issues in using
the original script (for example, Devnagari for Hindi)
people tend to transliterate or phonetically represent the words
in the original language using the Roman script. To de ne
Mixed Script Information Retrieval formally [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] we consider
a set of natural languages L = fl1; l2; ; lng and a set of
scripts S = fs1; s2; ; sng such that si is the native script
for the language li. Given a word w, we represent it as
two tuples hli; sj i to imply w is in language li and written
using script sj . When i = j, we say that the word is
written in its native script. Else, it has been transliterated into
another script sj . In practice, when textual content is a
mixture of words from various languages or scripts or both,
it is called Multi-Script (MS) or Code-Mixing. For instance,
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Jamatia et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] experiments with code-mixed
EnglishHindi social-media text for Part-of-Speech tagging. They
use both coarse and ne-grained tagsets for the task. Four
machine learning algorithms Conditional Random Fields,
Sequential Minimal Optimization, Nave Bayes and
Random Forests, reporting highest accuracy with Random
Forest based classi er. Information Retrieval on Multi-Script
data has also been looked into [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Recent works on question
classi cation include a machine learning based approach [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
towards question class. A hierarchial classi er is rst used
to classify the question into coarse-grained classes and then
into ne-grained classes. The feature space consisted of
primitive ones like pos tags, chunks, named entities and
also complex features such as conjunctive n-gram features
and relational features. Question-Answering corpus
acquisition using social-media content and question acquisition
with human involvement have been reported in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In FIRE
2015, the Transliterated Search track introduced three
subtasks | language labelling of words in code-mixed text
fragments, ad-hoc retrieval of Hindi lm lyrics, movie reviews
and astrology documents and transliterated question
answering where the documents as well as questions were in
Bangla script or Roman transliterated Bangla [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>TASK DESCRIPTION</title>
      <p>Question Answering systems are a classic application of
natural language processing, where the retrieval task has
to nd a concise and accurate answer to a given question.
Question classi cation is one of the subtasks of QA system,
required to determine the type of the answer corresponding
to a question.</p>
      <p>The Code-Mixed Cross-Script Question Classi cation task
can be described as follows. Given a question Q written in
Romanized Bengali, which can contain English words and
phrases and a set C = fc1; c2; : : : ; cng of question classes, the
task is to classify the question Q into one of these prede ned
classes.</p>
      <p>Example:
Question: airport theke howrah station distance koto ?
Question Class: DIST
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Dataset description</title>
      <p>The training dataset consists of 330 questions and each
question is assigned to a single question class. There are 9
question classes in all and the number of questions in each
class is shown in Table 1. The minimum and maximum
number of words in a question is 2 and 11 respectively while
each question on average has 5.3 words.</p>
    </sec>
    <sec id="sec-5">
      <title>PROPOSED APPROACH</title>
      <p>To build a classi er to classify the questions into the
speci ed classes, we created a vector representation of the each
question which is used as input to the classi er. We
considered the top 2000 most frequently occurring words in the
supplied training dataset as features. Each question is
represented as a 2000-element binary vector. Element ei = 1,
if the ith most frequently word is present in the question,
otherwise 0.</p>
      <p>We used three separate classi ers namely Random Forests
(RF), One-vs-Rest (OvR) classi er and k-Nearest Neighbour
(k-NN) classi er, followed by building an ensemble classi er
using these three classi ers for the classi cation task.</p>
      <p>In k-NN classi cation, a sample is classi ed by a majority
vote of its neighbours, with the object being assigned to the
most common class among its k-nearest neighbours (k &gt;
0; k 2 I). k-NN classi cation is a lazy learning method
which defers computation till the classi cation is performed.
k-NN is one of the simplest classi ers.</p>
      <p>One-vs{Rest strategy uses one classi er per class for
tting. Each classi er is trained against all the classes. The
approach allows information regarding each class by
inspecting the classi er trained for that class. In OvR, each
classier is trained with the entire data set while in RF, samples
drawn from the original data set are used for training.</p>
      <p>
        A Random Forest is a ensemble learning method which
can be classi cation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Random Forest ts a number of
decision trees on various sub-samples of the dataset, with
the samples drawn from the original dataset with or
without replacement. Random Forests overcome the problem of
over tting of decision tree of their training set [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Using the above three classi ers, we built an ensemble
classi er (EC). The ensemble classi er takes the output label
by each of the individual classi ers and gives the majority
label as output, otherwise any label is chosen at random
as output. Each of the individual classi ers is trained on
a subset of the original training dataset, by sampling with
replacement.</p>
      <p>In the following section, we describe the details of
implementation of the classi ers and the obtained results.
5.</p>
    </sec>
    <sec id="sec-6">
      <title>EXPERIMENTAL SETUP AND RESULTS</title>
      <p>We implemented the proposed approach using Python 3
and used the scikit-learn tool-kit for the classi ers. The
following instantiations were used for the rst three classi ers,
which were available in sckit-learn. We implemented the
ensemble classi er on our own.
rf = RandomForestClassifier(n_estimators=100)
ovr = OneVsRestClassifier(LinearSVC(random_state=0))
clf = neighbors.KNeighborsClassifier(30, weights='uniform')</p>
      <p>We split the labelled data set into two parts |
training set (90%) and validation set (10%). The RF classi er
performed the best, followed by EC, OvR and k-NN in
decreasing order of classi cation accuracy. Thereafter, we used
these trained classi ers for classifying the test data set.
During classi cation, we marked the samples for which all the
4classi ers predicted the same label. We used these samples,
in addition to the original labelled data set for retraining
the classi ers.</p>
      <p>The results on the test data set for the classi ers RF, EC
and OvR were submitted as nal run and is shown
summarily in Table 2 and Table 3. The classi cation results of kNN
classi er were not submitted as run and hence accuracy of
the results is not available.
6.</p>
    </sec>
    <sec id="sec-7">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper we have addressed the problem of question
classi cation for Bengali-English Code-Mixed social-media
data. We have experimented with three machine learning
based classi ers - Random Forests, One-vs-Rest and k-NN
and then built an ensemble of these classi ers to achieve the
best results. The method is scalable to other Code-Mixed
languages mainly because it does not perform any language
F-1
0.78
0.80
0.76
0.85
0.89
0.85
0.63
0.65
0.59
0.92
0.94
0.94
1
1
1
0.81
0.81
0.85
0.97
0.97
0.97
0.46
0.53
0.46
NA
NA
NA
or script-based feature engineering.</p>
      <p>We would like to experiment with other multi-script data
where more than two languages have been mixed. We aim
to apply other machine learning algorithms with more
linguistic and syntactic features.
7.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chakma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bandyopadhyay</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Choudhury</surname>
          </string-name>
          .
          <article-title>Overview of the Mixed Script Information Retrieval (MSIR) at FIRE</article-title>
          .
          <source>In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation</source>
          , Kolkata, India, December 7-
          <issue>10</issue>
          ,
          <year>2016</year>
          ,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop</surname>
          </string-name>
          <article-title>Proceedings</article-title>
          . CEUR-WS.org,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          .
          <article-title>The rst cross-script code-mixed question answering corpus</article-title>
          .
          <source>In Modelling, Learning and mining for Cross/Multilinguality Workshop, 38th European Conference on Information Retrieval (ECIR)</source>
          , pages
          <fpage>56</fpage>
          {
          <fpage>65</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Learning deep architectures for ai</article-title>
          .
          <source>Foundations and trends R in Machine Learning</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>127</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chittaranjan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Chakma</surname>
          </string-name>
          . Overview of re
          <article-title>-2015 shared task on mixed script information retrieval</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Friedman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hastie</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Tibshirani</surname>
          </string-name>
          .
          <article-title>The elements of statistical learning</article-title>
          , volume
          <volume>1</volume>
          . Springer series in statistics Springer, Berlin,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Banchs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          .
          <article-title>Query expansion for mixed-script information retrieval</article-title>
          .
          <source>In The 37th Annual ACM SIGIR Conference</source>
          , pages
          <volume>677</volume>
          {
          <fpage>686</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jamatia</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          <article-title>Gamback, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Das</surname>
          </string-name>
          .
          <article-title>Part-of-speech tagging for code-mixed english-hindi twitter and facebook chat messages</article-title>
          .
          <source>In 10th Recent Advances of Natural Language Processing (RANLP)</source>
          , pages
          <fpage>239</fpage>
          {
          <fpage>248</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <article-title>Learning question classi ers</article-title>
          .
          <source>In Proceedings of the 19th international conference on Computational linguistics-Volume</source>
          <volume>1</volume>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          7. Association for Computational Linguistics,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>