<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modeling Classifier for Code Mixed Cross Script Questions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rupal Bhargava</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shubham Khandelwal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Akshit Bhatia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yashvardhan Sharma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CCS Concepts</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>WiSoc Lab, Department of Computer Science Birla Institute of Technology and Science</institution>
          ,
          <addr-line>Pilani Campus Pilani-333031</addr-line>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>With a boom in the internet, the social media text had been increasing day by day and the user generated content (such as tweets and blogs) in Indian languages are written using Roman script due to various socio-cultural and technological reasons. A majority of these posts are multilingual in nature and many involve code mixing where lexical items and grammatical features from two languages appear in one sentence. Focusing on this current multilingual scenario, code-mixed cross-script (i.e., non-native script) data gives rise to a new problem and presents serious challenges to automatic Question Answering (QA) and for this question classi cation will be required which is an important step towards QA. This paper proposes an approach to handle cross script question classi cation as it is an important task of question analysis which detects the category of the question.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        With the proliferation of social network large volumes of
text is being generated daily. Traditional machine
learning algorithms used for text analysis such as Named
Entity Recognition (NER) or POS Tagging or parsing, are
language dependent.These algorithms usually achieve their
objective using co-occurrence patterns of features. Due to
such language dependence, it has been observed by many
studies that a variety of problems related to social media
text are hindered. One such problem is Question
Answering (QA). Being a classic application of NLP, Question
Answering (QA) has practical applications in various domains
such as education, health care, personal assistance etc. QA
is a retrieval task which is more challenging than the task
of common search engine because the purpose of QA is to
nd accurate and concise answer to a question rather than
just retrieving relevant documents containing the answer [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Recently, Banerjee et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] formally introduced the
codemixed cross-script QA problem. The rst step of
understanding a question is to perform a question analysis.
Question classi cation is an important task of question
analysis which detects the answer to the type of the question.
Question classi cation helps not only lter out a wide range
of candidate answers but also determine answer selection
strategies [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Furthermore, it has been observed that the
performance of question classi cation has signi cant in
uence on the overall performance of a QA system.
      </p>
      <p>The Subtask 1 in the shared task on Mixed Script
Information Retrieval in FIRE-2016 addresses the task of code
mixed cross script question classi cation where 'Q'
represents set of factoid questions written in Romanized Bengali
along with English. The task is to classify each given
question into one of the prede ned coarse-grained classes. This
paper proposes an algorithm for solving question classi
cation task proposed by MSIR, FIRE 2016 Subtask 1
organizers, using di erent machine learning algorithms.</p>
      <p>
        Rest of the paper is organized as follows. Section 2
explains the related work that has been done in the past few
years. Section 3 presents the analysis of dataset provided
by MSIR 2016 Task Organizers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Section 4 explains the
methodology that have been performed for the task with
owcharts to explain the ow. Section 5 describes the
algorithm proposed for question classi cation. Section 6
elaborates the evaluation and experimental results and error
analysis. Section 7 concludes the paper and presents future work.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Today social media platforms are ooded by millions of
posts everyday on various topics resulting in code mixing in
multilingual countries like India. A lot of work had been
done in FIRE 2015 for language identi cation in cross script
information retrieval. Bhattu et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed a two stage
algorithm, where in the rst stage sentence level n-grams
based classi ers and in the second stage word level n-grams
classi ers were used. Bhargava et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed a hybrid
approach to do query labeling by generating char n-grams as
features and using logistic regression for language labeling.
For question analysis of such data, Question classi cation
is done to understand the question that allows
determining some constraints the question imposes on a possible
answer. Zhang et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] used bag of words and bag of n-grams
as features and applied K-NN, SVM, Naive Bayes to
automate question classi cation and concluded that with
surface text features the SVM outperforms the other classi ers.
Banerjee et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed a QA system which takes
crossscript (non-native) code-mixed questions and provides a list
of information response to automate the question answering.
Corpus acquisition was done from social media, question
acquisition using a cloud based service without getting bias,
corpus annotations and an evaluation scheme suitable to the
corpus annotation. Li et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] proposed question classi
cation using the role of semantic information developing a
hierarchical classi er guided by a layered semantic hierarchy
of answer types.
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA ANALYSIS</title>
      <p>
        The training data provided [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], consisted of 330 questions
labeled with its speci c coarse-grained question type classes.
In total there are 9 di erent question type classes in the data
set. As shown in Figure 1, class-types 'ORG' and 'TEMP',
comprises majority of the instances. Each of these classes
represents a particular type of question related to speci c
entities. Class type 'MNY' stands for Money related
questions and the instances comprises of words like 'fare', 'price'
and helping words like 'koto' (bn) and how much etc. Class
type 'PER' stands for Person related questions mostly
comprising of words like 'who', 'whom' etc. implying for the
subject of the sentence being a person. Class type 'TEMP'
implies time related questions mainly comprising of words
like 'when', 'at' etc. Class type 'OBJ' stands for the
Entity/Object implying that subject of the sentence is an
entity and mainly comprising of words like 'what', 'kon' etc.
Class type 'NUM' stands for Numeric entity related
questions and mainly involves usage of words like 'how many',
'koto' etc. Class type 'DIST' stands for Distance and implies
that question is related to distance between places. Class
type 'LOC' stands for Location and thereby mainly
comprises of words like 'where', 'jabe' etc. Class type 'ORG'
stands for Organization and relates to questions centered on
particular organization, team or any other group of people
and these questions mainly comprises of words like 'which',
'what', 'team' etc. Class type 'MISC' stands for
Miscellaneous; this class has the minimum representation in the data
set and relates to a variety of questions.
      </p>
      <p>The entire data set has sentences in a code-mixed format,
consisting of words which either belong to Bengali or English
language. The data set does not contain any code-mixing
done at word level. Also there are no punctuation in the
data set except the question mark (?) while there are a lot
of named entities (belonging to both English and Bengali)
present in it.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>PROPOSED TECHNIQUE</title>
      <p>A word-level n-gram based approach is used to classify
code mixed cross script question records (comprising of words
belonging to both English and Bengali languages), into nine
di erent coarse grained question type classes. The proposed
methodology involves a pipelined deployment of di erent
techniques as mentioned in Figure 2. Proposed technique
can be majorly divided into the following four phases:</p>
      <sec id="sec-4-1">
        <title>1. Pre-processing</title>
      </sec>
      <sec id="sec-4-2">
        <title>3. Translation</title>
      </sec>
      <sec id="sec-4-3">
        <title>4. Classi cation</title>
      </sec>
      <sec id="sec-4-4">
        <title>2. Named-entity recognition and removal</title>
        <p>4.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Pre-processing</title>
      <p>Data is pre-processed for label separation and case
conversion for the e cient application of the classi ers. The
pre-processing techniques were deployed as follows:
1. Separation of Class labels and Training Set Entries :
The data set comprised of mixed script question records
labeled with speci c class type. These question entries
and the respective class labels were segregated, on the
basis of position of question mark symbol in the data
entries. This segregation was done so that separate
feature vectors of question records and class labels could
be formed, as per the deployment requirements of the
classi ers.
2. Case Conversion: For the purpose of normalization,
all the data entries were converted into lower case.
This technique involved identi cation and replacement
of the upper case letters with their lower case
counterparts by means of manipulation of the ASCII code
values.
4.2</p>
    </sec>
    <sec id="sec-6">
      <title>Named-entity recognition and removal</title>
      <p>
        The pre-processed data set comprised of entries which
had a large number of named-entities. Named-entities in
a text can be referred to pre-de ned categories such as the
names of persons, organizations, locations, expressions of
times, quantities, monetary values, percentages etc. In
English language named entities occur in certain manner at
certain positions according to sentence structure. But when
it comes to multi-lingual sentences, sentence structure varies
a lot. Named Entities are identi ed using a dictionary based
approach. The data set used for NER mainly comprised of
the entries from FIRE 2015 Subtask1's data set [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This
data set contained entries belonging to both Bengali and
English languages. For the purpose of classi cation of the
question records into one of the class types, the presence
of these named-entities was irrelevant, as these entities did
not contribute in building question structure for class-type
determination, and hence their removal was mandatory.
4.3
      </p>
    </sec>
    <sec id="sec-7">
      <title>Translation</title>
      <p>After the initial two phases, the remaining Bengali words
were transliterated into their native scripts and then
further translated to their respective English counterparts
using the Google translation API 1. This technique helped to
create a monolingual, single-script data set from the mixed
script data set provided so that the e cient application of
classi ers could take place. Using this approach, di erent
code mixed cross script variants (each variant using di erent
combination of words belonging to either Bengali or English
languages) of the same question record were translated and
hence standardized to only one question record (in English
language). For example the question record "Hazarduari te
koto dorja ache?" and the record "Hazarduari te how many
dorja ache?", both refer to the same question but use
different combination of words, and hence standardizing this
to the English translation, would lead to an increase in the
accuracy.
4.4</p>
    </sec>
    <sec id="sec-8">
      <title>Classification</title>
      <p>
        The proposed approach uses the data set obtained from
translation phase and deploys the technique of n-grams to
form the feature vectors for each record in the data set. The
approach follows a word-level implementation of n-grams
with 'n' being varied in the range 2 to 4, and thereby
generation of feature vectors for each question record in the
training set. The transposed matrix of these feature
vectors along with the numerically encoded class label matrix
is then used as inputs to classi ers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For the three runs,
the following di erent classi ers are used:
      </p>
      <sec id="sec-8-1">
        <title>1. Gaussian Naive Bayes Classi ers</title>
      </sec>
      <sec id="sec-8-2">
        <title>2. Logistic Regression Classi er 3. Random Forest Classi er with Random State = 1</title>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>ALGORITHM</title>
      <p>Algorithm 1 explains the proposed technique. Data set
comprising of mixed script question records along with their
respective class labels, is used as input. This input is
preprocessed by deploying the techniques for separation of class
labels from data entries (implemented by the function:
Label Separation()) and case conversion of data records into
lower cases (using function: Case Conversion()). Named
entities (NE) are removed from this pre-processed data
(implemented by the function: NE Removal()). The remaining
Bengali words present in the data set are then translated to
their respective English equivalents by using Google
Translation API 1 (by means of the function: Translation()). The
technique of n-grams is then applied on this data set to
form the corresponding feature vectors. First a vector for
converting the textual entries into a matrix of n-gram token
counts (word level n-grams with n in the range 2 to 4) is
created (implemented by the function Count Vectorizer()).
1https://translate.google.com/
This vector is then used to generate another callable (by
function: Build Analyzer()) which is used to produce
ngrams tokens when called on each data set rows
(implemented by callable: Analyser()) and word level n-grams
corresponding to each row is appended (by means of function
append()) to the n-gram list. Feature vectors for the data
entries are generated using this n-gram list (implemented
by the function: Create Feature Vector()). The class labels
for the training purpose are numerically encoded (by means
of function: Encode Class()) and then corresponding
feature vectors for these class labels are generated. These two
sets of feature vectors are then used as inputs to the
classier (implemented by the function: Classi er()). Classi er()
function can be replaced by functions of di erent classi er
like GaussianNB, LogisticRegression or
RandomForestClassi er. The classi er is then used to predict the class labels
generated as output.</p>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTS</title>
      <p>MSIR, FIRE 2016, Subtask 1 involved classi cation of
mixed-script (Bengali and English) questions into nine
different coarse grained question type classes as discussed in
Section 3. The training dataset comprised of 330 records
(along with class labels) and it was used to classify a test
dataset comprising of 180 mixed script question records.
Total seven teams from di erent institutes of the country
participated in the process and each team used three di erent
approaches for classi cation and generated results as
mentioned in Figure 3. Approach proposed in this paper used
machine learning for classi cation and three runs were
submitted for the same. Runs submitted varied from each other
in terms of classi ers used (Gaussian Naive Bayes, Logistic
Regression and Random Forest Classi ers). Using the
approach of Gaussian Naive Bayes classi er, an accuracy of
81.12 % was obtained, using Logistic Regression an
accuracy of 80% was obtained and using Random Forest
Classi ers an accuracy of 72.78% was obtained. The results in
details, analysis and comparison for the same are discussed
further.
6.1</p>
    </sec>
    <sec id="sec-11">
      <title>Evaluation and Discussion</title>
      <p>The MSIR, FIRE 2016, Subtask 1 organizer evaluated the
results which gave a comparison of accuracy achieved by the
7 teams that participated as shown in Figure 3. The
proposed approach (team BITS PILANI) got ranked as 2nd
with an accuracy of 81.12% for run1 while the highest
accuracy achieved was 83.34% (by the team IINTU). Choice
of Gaussian Naive Bayes classi er leads to the maximum
accuracy attainment, as the proposed algorithm deals with
the problem involving continuous attributes. Usage of Naive
Bayes helps in building simplistic and highly scalable models
which are fast and scale linearly with number of predictors
and rows. Also the process of building a naive bayes model
is highly parallelized even at the level of scoring. It was
also observed from the results, that the proposed algorithm
generated highest F-measure scores for the classes of
Organization (ORG), Money (MNY) and Miscellaneous (MISC).
Figure 4 shows the comparison of the di erent f-measure
scores of the teams obtained for the class Organization. The
proposed algorithm (implemented by team BITS PILANI)
got the highest scores of 0.74418 using Gaussian Naive Bayes
approach. This implies that the questions relating to a
particular organization mainly being framed with words like
"which", "what" etc. could be e ciently classi ed by means
of this approach. These scores can be attributed to the fact
that the instances of the class ORG were maximum in the
data set (67 out of 330 as discussed in Section 3). Also the
proposed algorithm involves the formation of word level
ngrams due to which words and phrases like "which", "team",
"series", "sponsor" etc. got associated, and thus might have
contributed to an increase in the scores.</p>
      <p>Figure 4 also shows the comparison of the di erent
fmeasure scores of the teams obtained for the class Money.
Using the proposed algorithm (team BITS PILANI) achieved
the highest scores of 1 using Logistic Regression as a
classier (run 2). Hence all the questions relating to money being
framed with words like "how much", "price", "fare" etc. could
be e ciently classi ed by means of the proposed approach.
These high f-scores could be attributed to the e cient
deployment of the word level n-gram techniques which in a
way linked the words like "fare", "how", "much", "price" etc.
and thus might contributed to an increase in accuracy.</p>
      <p>The evaluated results also showed that only two teams
(team BITS PILANI and team NLP-NITMZ) were able to
identify instances belonging to Miscellaneous (MISC) class.
This can be attributed to the fact that there were only 5 out
of 330 instances of MISC class in the training data set. The
proposed approach (team BITS PILANI) got the highest
scores of 0.2 using Gaussian Naive Bayes classi er, which
again attributes for the simplistic approach of GaussianNB
classi ers and the e cient deployment of the word level
ngrams technique.</p>
      <p>
        Figure 5 shows a comparison of the accuracy obtained
(taken the best accuracy obtained out of the three runs for
each team) for classifying each of the nine classes. As
evident from the gure, the proposed approach (implemented
by team BITS PILANI) was able to obtain satisfactory
results in identifying the correct class labels particularly in the
cases of MISC, ORG, MNY, NUM and OBJ classes with an
f-measure score of 1 obtained for the class Money. Table
1 shows the scores of precision, recall and F-measure for
each of the nine di erent classes, as evaluated by the FIRE
2016 task organizers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], for the proposed algorithm
(implemented by the team BITS PILANI) for the three di erent
runs submitted.
      </p>
      <p>There are a few phases at which proposed approach could
have attributed to the mis-classi cation of a few records.
The proposed approach involves a dictionary based method
for named entity recognition for which the corpus used had
only limited entries due to which some of the entities might
not have been recognized and removed. Also the data set
had a large number of instances of named-entities which
referred to the same name but had similar but di erent
spellings. For instance, in the data set, words "masjid" and
"mosjid" both referred to the same word implying "mosque"
but had di erent spelling. Since the proposed approach used
a corpus for NER these entities couldn't be removed unless
all the spellings of these words were added to the corpus.</p>
      <p>The proposed approach also involves the usage of a
translation system (Google API1) for translating words of
Bengali to English, but since the translation system did not
consider the semantics of the sentence where the word was
being used, it may have happened that the particular
Bengali word would have been incorrectly translated. The given
data set did not have a uniform distribution of class
instances, as shown in Figure 1 the data set comprised only of
1.51% of MISC class instances while ORG class comprises
20% of the entries in the data set due to which the model
trained could be biased. Also as mentioned before, not even
a single instance of MISC class from the test data set could
be identi ed by most of the teams, and even the proposed
system was able to get an f-measure score of only 0.2 because
of lesser number of instances of the class.</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this paper, a word-level n-gram based approach of
classi cation of code mixed cross script question records into
nine di erent coarse grained question type classes has been
presented for Subtask 1 of MSIR, FIRE 2016. Presented
approach uses a pipelined stages to classify questions using
various machine learning algorithms(Gaussian Naive Bayes,
Logistic Regression and Random Forest). Proposed approach
obtained highest accuracy of 81.12% using Gaussian Naive
Bayes approach among all the three runs submitted. Future
work could be an improvisation of dictionaries for
namedentity recognition for Bengali and English languages. Di
erent Named Entity Recognizer and taggers along with trained
models for Name Entity Recognition could be deployed. It
would be interesting to nd approaches by which implicit
features about the code-mixed cross script data set could be
e ciently trained using deep learning algorithms. Machine
learning based models for language identi cation along with
appropriate transliteration and translation tools (which take
into consideration of the correct semantics) could be
improved further.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chakma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bandyopadhyay</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Choudhury</surname>
          </string-name>
          .
          <article-title>Overview of the Mixed Script Information Retrieval (MSIR) at FIRE</article-title>
          .
          <source>In Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation</source>
          , Kolkata, India, December 7-
          <issue>10</issue>
          ,
          <year>2016</year>
          ,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop</surname>
          </string-name>
          <article-title>Proceedings</article-title>
          . CEUR-WS.org,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          .
          <article-title>The rst cross-script code-mixed question answering corpus</article-title>
          . In First Workshop on Modeling,
          <article-title>Learning and Mining for Cross/Multilinguality (MultiLingMine 2016) co-located with the 38th European Conference on Information Retrieval (ECIR</article-title>
          <year>2016</year>
          ), volume
          <volume>1589</volume>
          , pages
          <fpage>56</fpage>
          {
          <fpage>65</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Baid</surname>
          </string-name>
          .
          <article-title>Query labelling for indic languages using a hybrid approach</article-title>
          .
          <source>In Working notes of FIRE 2015 - Forum for Information Retrieval Evaluation</source>
          , Gandhinagar, India, December,
          <year>2015</year>
          , volume
          <volume>1587</volume>
          <source>of CEUR Workshop Proceedings</source>
          , pages
          <volume>40</volume>
          {
          <fpage>42</fpage>
          . CEUR-WS.org,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Bhattu</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Ravi</surname>
          </string-name>
          .
          <article-title>Language identi cation in mixed script social media text</article-title>
          .
          <source>In Working notes of FIRE 2015 - Forum for Information Retrieval Evaluation</source>
          , Gandhinagar,India, December,
          <year>2015</year>
          , volume
          <volume>1587</volume>
          <source>of CEUR Workshop Proceedings</source>
          , pages
          <volume>37</volume>
          {
          <fpage>39</fpage>
          . CEURWS.org,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chittaranjan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Chakma</surname>
          </string-name>
          . Overview of re
          <article-title>-2015 shared task on mixed script information retrieval</article-title>
          .
          <source>In Working notes of FIRE 2015 - Forum for Information Retrieval Evaluation</source>
          , Gandhinagar, India, December,
          <year>2015</year>
          , pages
          <fpage>19</fpage>
          {
          <fpage>25</fpage>
          . CEUR-WS.org,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <article-title>Learning question classi ers</article-title>
          .
          <source>In Proceedings of the 19th international conference on Computational linguistics-Volume</source>
          <volume>1</volume>
          , pages
          <fpage>1</fpage>
          <lpage>{</lpage>
          7. Association for Computational Linguistics,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          , et al.
          <article-title>Scikit-learn: Machine learning in python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          (Oct):
          <volume>2825</volume>
          {
          <fpage>2830</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. S.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Question classi cation using support vector machines</article-title>
          .
          <source>In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval</source>
          , pages
          <volume>26</volume>
          {
          <fpage>32</fpage>
          . ACM,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>