<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The First Cross-Script Code-Mixed Question Answering Corpus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Somnath Banerjee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sudip Kumar Naskar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sivaji Bandyopadhyay</string-name>
          <email>sbandyopadhyayg@cse.jdvu.ac.in</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science and Engineering Department, Jadavpur University</institution>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>PRHLT Reearch Center, Universitat Politecnica de Valencia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <fpage>56</fpage>
      <lpage>65</lpage>
      <abstract>
        <p>In this paper, we formally introduce the problem of crossscript code-mixed question answering (QA) and we elaborate the corpus acquisition process and an evaluation strategy related to the said problem. Today social media platforms are ooded by millions of posts everyday on various topics. This paper emphasizes the use of such ever growing user generated content to serve as information collection source for the QA task on a low-resource language for the rst time. A majority of these posts are multilingual in nature and many of them involve code mixing. The multilingual aspect of social media content is re ected in the use of multilingual words as well as in the writing script. For the ease of use multilingual users often pose questions in non-native script. Focusing on this current multilingual scenario, code-mixed cross-script (i.e., non-native script) data give rise to a new problem and present serious challenges to automatic QA. In the work presented in this paper, Bengali is considered as the native language while English is considered to be the non-native language. However, the dataset construction approach presented in this paper is generic in nature and could be used for any other language pair. Apart from introducing this novel problem, this paper highlights corpus development process and a suitable evaluation framework.</p>
      </abstract>
      <kwd-group>
        <kwd>Question Answering</kwd>
        <kwd>Code Mixing</kwd>
        <kwd>Code Switching</kwd>
        <kwd>Cross-script</kwd>
        <kwd>social media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction and Related Work</title>
      <p>Code-mixing refers to the phenomenon where lexical items and grammatical
features from two languages appear in one sentence. The use of code-mixing is
spreading widely in informal text communications such as newsgroups, tweets,
blogs, and other social media platforms. Sometimes it is used to refer to
relatively stable informal mixtures of two languages, such as Spanglish, Franponais
or Portun~ol. Nowadays in social media people tend to share everything under
the sun. Social media users often share their travel experiences as well as seek
travel suggestions from their social networks. Similarly sports events are among
the mostly discussed topics in social media. People post live updates of ongoing
sports events such as Football World Cup, Champions League, T20 Series, etc.
This results in potentially rich resources for languages which are less
computerized.</p>
      <p>In bilingual or multilingual countries like India, speakers often incorporate
lexical items, phrases, and clauses from more than one language into their spoken
or written communication act. This results in words or phrases from di erent
languages in the same sentence or utterance. This phenomenon is referred to
as code-mixing. Although this phenomenon has been studied extensively in
formal and spoken context, the research community in natural language processing
(NLP) has just started paying sincere attention to code-mixing due to its
prevalence of use in electronic communication mainly in the social media. English is
predominantly the most used language on the internet; Indians also use English
extensively while sur ng the internet. Even they (phonetically) use the Roman
script instead of using their own native scripts. Another important reason
behind the use of the English language and the Roman script may be the keyboards
which are in the non-native Roman script, and Indian internet users are more
comfortable using that keyboard rather than the on-screen native script
keyboard or a combination of keys which generate native alphabets. Every natural
language is generally written using a particular script which is referred to as the
native script for that language. All other scripts which are not used in writing
the language can be referred to as the non-native script with respect to that
language. For example, the English language is written in the Roman script.
Thus, Roman script is the native script for English, however Bengali script is a
non-native script for English. We refer to the phenomenon of using a non-native
script phonetically for writing native words as cross-script. For example, if a
Bengali user writes Bengali words in Bengali script, that is considered as using
native script. However, if he writes Bengali words in Roman script or English
words in Bengali script, then he is making use of cross-script.</p>
      <p>
        Being a classic application of NLP, QA has practical applications in
various domains such as education, health care, personal assistance, etc. Presently,
QA is a well addressed research problem and several QA systems are available
with reasonable accuracy. A number of QA systems were developed for
European languages particularly for English ([
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), Middle Eastern languages
([
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) and Asian languages, e.g., Japanese ([
        <xref ref-type="bibr" rid="ref8">8</xref>
        ],[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) Chinese ([
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]). In
this paper, we introduce a new research problem in the context of QA research
cross-script code-mixed QA.
      </p>
      <p>The rest of the paper is organized as follows. Section 2 states the
codemixed cross-script QA problem. We discuss corpus acquisition in Section 3. The
proposed corpus annotation process and corpus statistics are described in Section
4 and Section-5, respectively. We present the evaluation scheme in Section 6.
Section 7 concludes the paper.
Problem Statement: Building a question answering system which takes
crossscript (non-native) code-mixed questions as information request, processes a
cross-script code-mixed text corpus and provides an (or a list of ) exact answer(s)
as information response.</p>
      <p>We introduce this novel research problem for the following reasons:
1. Multilingual non-native English speakers predominantly use the Roman script
in social media platforms during their conversations even while the written
communication takes place entirely in a native language (i.e., not English).
2. To make the written communication more fascinating, borrowing foreign
words from di erent languages is very common in social media
communication and this is a growing trend.
3. The ever increasing posts in many less-computerized languages could serve
as a potential source of digital content for language research.
4. The research community need to move towards the next generation search
engine that boosts the necessity of developing QA system for less-resourced
languages.</p>
      <p>
        This paper presents a cross-script code-mixed QA corpus for Bengali;
however, this context is very common with other non-English languages, e.g. Spanish,
French, etc. Despite the advances in QA research and the fact that Bengali is one
of the most spoken languages, very little work ([
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) has been conducted
in QA for Bengali so far. Language identi cation in the code-mixing scenario has
been addressed extensively in shared tasks in EMNLP-20143 and FIRE-20144
and in few other research works [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ],[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. However, to the best of our
knowledge, no work has been conducted so far on the novel problem addressed
in this paper.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Corpus Acquisition</title>
      <p>Because of the following characteristics of social media, we consider social media
content for code-mixing cross-script QA corpus:
i) Substantial and ever increasing user base.</p>
      <p>ii) A sizable volume of informal text data are added on various domains on
a daily basis.</p>
      <p>iii) Various APIs are available to access social media data.
iv) Most likely source of getting code-mixed data.</p>
      <p>Even though acquiring a sizable volume of the code-mixed cross-script data
is not a tough task, our work on developing a QA system for code-mixed
crossscript data is at its initial stages. Therefore, we have collected a small set of data
which could be increased in future following with a similar approach. Research
3 http://emnlp2014.org/workshops/CodeSwitch/call.html
4 http:// re.irsi.res.in/ re/home
in QA system primarily requires three data resources: (i) question which is asked
to get a piece of information, (ii) answer to an asked question as a response, and
(iii) potential sources of the answers from which a QA system can directly or
indirectly infer an answer to a question. We describe the acquisition of these
resources in this section. For the present study, we restricted our focus to the
tourism and the sports domains which are among the most popular domains in
the social media. Social media data on other domains could be acquired with
a similar approach presented here. In the code-mixed cross-script QA scenario,
the resource development involves two separate processes: (i) collecting social
media text for the desired domains; and (ii) question acquisition and answer
annotation.
3.1</p>
      <sec id="sec-2-1">
        <title>Message(text) Acquisition</title>
        <p>
          For the document collection we consider the social media as it is the most likely
potential source of code-mixed cross-script data. We acquired all the messages
from di erent social media platforms, e.g., twitter, blogs, forums, etc. For the
sports domain, we selected social media posts on recently held 10 exciting cricket
matches. Ten popular tourist spots in India were selected for tourism domain.
Tweepy API and an in-house focused crawler were employed for collecting tweets,
blogs, and forum posts. For collecting only code-mixed data, we set a language
mix ratio (i.e., non-native:native) which is computed by employing a language
identi er whose accuracy, as reported in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], is 92.4%. Language mixing ratio
(LMR) is employed for collecting only code-mixed data. The language mixing
ratio has been set to 0.2 after manually verifying a small set of crawled data.
Therefore, a message post is included in the corpus when at least 16.67% (i.e. 1
in 6) of the words belong to the non-native language.
        </p>
        <p>Examples of valid Message:
a) Message: SAnO janB runnE korechenB ajnB BDnO parbenB kinB ?nO
LMR = #n#onnatniavteive = ##EBnenggliaslhi wwoorrddss = 15 = 0:2(&gt;= 0:2)
b) Message: MashrafenO wellnE trynE butnE kinB rnB koranB jabenB ...nO
captainnE !!!nO</p>
        <p>LMR = #non native = ##EBnenggliaslhi wwoorrddss = 44 = 1(&gt;= 0:2)</p>
        <p>#native</p>
        <p>
          The language identi er, as reported in [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], does not identify named entities.
Considering the fact that the answer to a factoid question is always a named
entity, we ltered out the messages under human supervision which do not
contain any named entity. Thus, we nalized 299 posts as messages out of the 334
messages which were initially selected by the language identi er and the LMR
ratio.
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Question Acquisition</title>
        <p>The question preparation task is more challenging than the message
acquisition and requires more human involvement. Our prime target was to involve
as many question setters as possible to reduce bias. A cloud-based service was
built and requests were sent to the undergraduate students of the university. Two
groups, namely sports-domain group (SG) and tourism-domain group (TG) with
15 students each were formed from thirty students who agreed for the question
annotation task. Ten topics on sports domain were provided to each member
of SG and they were asked to submit at least 10 questions on each topic. The
submitted questions were stored in the web server along with the messages
associated with the topic. After receiving these questions, we kept only the questions
having code-mixed nature and satisfying the LMR criterion. Subsequently, the
annotators were asked to nd out the answer to their legitimate questions from
the stored messages. An analogous procedure was followed for TG also.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Annotation</title>
      <p>For document management and storing, EXtensible Markup Language ( XML)
was chosen because of its popularity and ease of understanding. The QA
annotation framework which was adopted in this work is depicted in Fig. 1. The tagset
de ned in Table 1 was used for three purposes: document information, message
annotation and QA annotation. We will format the corpus in Text Encoding
Initiative5(TEI) in future.
Tag De nition Tag De nition
Question Document body CorpusID Corpus id number
Domain Domain name Topic Topic name
Data Data section Q Question
Q id Question unique number Q type Question type, e.g., Factoid, Procedural
Q text Code-mixed NL question Q Int Interrogative class
Ans Answer E ans Exact answer
S ans Segment answer M ans Message Id of a message that contains answer
Msg Public posts as messages</p>
      <p>A document in the corpus comprises of data section and question section.
The data section contains the public posts collected from social media. Each
public post is referred to as a message and described in the &lt; msg &gt; tag.
Each message is assigned a unique number, i.e., msg Id. The factoid questions
follow the data section. Each question is marked by the Q tag, (i.e., &lt; Q &gt; and
&lt; =Q &gt;). Like each message, every question is also assigned a unique question
identi er. The question type (Q type) denotes the type of a question such as
factoid, procedural, etc. The code-mixed cross script question is enclosed by the
q text tag.</p>
      <p>
        Interrogative types of questions are very much useful for answer extraction
and validation. On the basis of syntactic structure, Bengali interrogatives are
5 http://www.tei-c.org/index.xml
classi ed into three categories - single interrogative (SI), dual interrogative (DI)
and compound interrogative (CI) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The interrogative type (i.e., SI, DI, and
CI) of a question gives a clue about the number information of the candidate
answer.
      </p>
      <p>The answer to a question is annotated by the Ans tag. The exact answer is
given in E ans tag. The Segment answer (S ans) tag refers to the portion or
segment of the message text which provides the answer. The message id from
which the exact answer can be found is given in the message answer (M ans)
tag. The segment answer tag and message tag could be thought of as supporting
information for the exact answer.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Corpus Statistics</title>
      <p>The statistics of the messages, i.e., public posts and questions in the corpus
for the two di erent domains, namely Sports and Tourism, are given in Table
2. Altogether 299 code-mixed cross-script messages were collected of which 183
and 116 messages are from the tourism and sports domains respectively. 506
code-mixed cross-script questions were acquired of which 314 questions are from
the tourism domain and 192 questions belong to the sports domain. Average
number of messages per document (Avg. M/D in Table 2)is higher for the tourism
domain than for the sports domain. Average number of questions generated per
document (Avg. Q/D in Table 2) is higher for the tourism domain than for the
sports domain accordingly.
6 http://trec.nist.gov/
7 http://www.clef-initiative.eu/
monolingual and cross-lingual QA. In order to maintain the consistency with
the state-of-the-art QA evaluation metrics, we also suggest the use of accuracy
and c@1 for the code-mixed cross-script QA task. As the prepared corpus
contains only one correct answer (as opposed to a list of exact answers) for every
question, MRR is not useful for evaluation on the said dataset. Just as in the
past ResPubliQA8 campaigns, systems have the option of withholding the
answer to a question because they are not su ciently con dent that it is correct
(i.e., NAO). As per ResPubliQA, the inclusion of NAO improves the system
performance by reducing the number of incorrect answers.</p>
      <p>Now, C@1 = N1 (Nr + Nu: NNur )
Accuracy = Nr</p>
      <p>N
C@1 = Accuracy; if Nu = 0
Where, Nr = number of right answers.</p>
      <p>Nu = number of unanswered questions
N = total questions</p>
      <p>Correct, Partially-supported and Unsupported answers provide the exact
answers only.</p>
      <p>Therefore, Nr = (#C + #U + #P )</p>
      <p>Considering the importance of supporting segment, we introduce a new
metric \answer-support performance" (ASP) which measures the answer correctness
and which is de ned as follows:</p>
      <p>ASP = N1 (c 1:0 + p 0:75 + i 0:25)
where, c, p and i denote total number of correct, partially-supported and
inexact answers respectively.
7</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper we presented a novel research problem - cross-script code-mixed
QA. Our major contributions include (i) proposing an annotation scheme, ii)
creating a dataset which is the rst resource of its kind, and (iii) proposing an
evaluation strategy that is suitable to our corpus annotation. Bearing in mind
the small dataset, the proposed evaluation methodology and created dataset will
be helpful for the QA research and development community, particularly those
who want to address code-mixed cross-script QA.
8 http://nlp.uned.es/clef-qa/repository/resPubliQA.php</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>We acknowledge the support of the Department of Electronics and Information
Technology (DeitY), Government of India, through the project \CLIA System
Phase II". The work of the third author was in the framework of the SomEMBED
MINECO TIN2015-71147-C2-1-P research project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Buscaldi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanchis</surname>
          </string-name>
          , E.:
          <article-title>Answering Questions with an ngram based Passage Retrieval Engine</article-title>
          .
          <source>In: Journal of Intelligent Information Systems</source>
          ,
          <volume>34</volume>
          :
          <fpage>113</fpage>
          -
          <lpage>134</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Brill</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banko</surname>
            ,
            <given-names>M.:.</given-names>
          </string-name>
          <article-title>An analysis of the AskMSR question-answering system</article-title>
          .
          <source>In: Empirical methods in natural language processing-</source>
          Volume
          <volume>10</volume>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>264</lpage>
          , Association for Computational Linguistics (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>AnswerBus question answering system</article-title>
          .
          <source>In: International conference on Human Language Technology Research</source>
          , pp.
          <fpage>399</fpage>
          -
          <lpage>404</lpage>
          , Morgan Kaufmann Publishers Inc. (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ittycheriah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Franz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>W. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ratnaparkhi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mammone</surname>
            ,
            <given-names>R. J.:</given-names>
          </string-name>
          <article-title>IBM's Statistical Question Answering System</article-title>
          . In: TREC (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mohammed</surname>
            ,
            <given-names>F. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nasser</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harb</surname>
            ,
            <given-names>H. M.:</given-names>
          </string-name>
          <article-title>A knowledge based Arabic question answering system (AQAS)</article-title>
          .
          <source>In: ACM SIGART Bulletin</source>
          ,
          <volume>4</volume>
          (
          <issue>4</issue>
          ),
          <fpage>21</fpage>
          -
          <lpage>30</lpage>
          (
          <year>1993</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kanaan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hammouri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Shalabi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swalha</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A new question answering system for the Arabic language</article-title>
          .
          <source>In: American Journal of Applied Sciences</source>
          ,
          <volume>6</volume>
          (
          <issue>4</issue>
          ),
          <volume>797</volume>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hammo</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abu-Salem</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lytinen</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>QARAB: A question answering system to support the Arabic language</article-title>
          . In: ACL-02 workshop on Computational approaches to semitic languages, pp.
          <fpage>1</fpage>
          -
          <lpage>11</lpage>
          , Association for Computational Linguistics (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sakai</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saito</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ichimura</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koyama</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kokubu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manabe</surname>
          </string-name>
          , T.:
          <article-title>ASKMi: A Japanese Question Answering System based on Semantic Role Analysis</article-title>
          .
          <source>In: RIAO</source>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>231</lpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Isozaki</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sudoh</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsukada</surname>
          </string-name>
          , H.:
          <article-title>NTTs japanese-english cross-language question answering system</article-title>
          .
          <source>In: NTCIR Workshop 5 Meeting</source>
          , pp.
          <fpage>186</fpage>
          -
          <lpage>193</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yongkui</surname>
            ,
            <given-names>Z. H. A. N. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zheqian</surname>
            ,
            <given-names>Z. H. A. O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lijun</surname>
            ,
            <given-names>B. A. I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xinqing</surname>
            ,
            <given-names>C. H. E. N.</given-names>
          </string-name>
          :
          <article-title>Internet-based Chinese Question-Answering System</article-title>
          . In: Computer Engineering,
          <volume>15</volume>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Chinese question answering based on syntax analysis and answer classi cation</article-title>
          .
          <source>In: Acta Electronica Sinica</source>
          ,
          <volume>36</volume>
          (
          <issue>5</issue>
          ) (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bandyopadhyay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Bengali Question Classi cation: Towards Developing QA System</article-title>
          . In: SANLP-COLING,
          <string-name>
            <surname>IIT</surname>
          </string-name>
          ,Mumbai,India (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lohar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naskar</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bandyopadhyay</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The First Resource for Bengali Question Answering Research</article-title>
          . In: PolTAL-2014. Poland.
          <source>In Advances in Natural Language Processing</source>
          , pp.
          <fpage>290</fpage>
          -
          <lpage>297</lpage>
          . Springer International Publishing (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naskar</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bandyopadhyay</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>BFQA: A Bengali Factoid Question Answering System</article-title>
          .
          <source>In: Text, Speech and Dialogue (TSD)</source>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>224</lpage>
          . Springer International Publishing, Czech
          <string-name>
            <surname>Republic</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bali</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banchs</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Choudhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Query Expansion for Mixed-script Information Retrieval</article-title>
          .
          <source>In: The 37th Annual ACM SIGIR Conference</source>
          , SIGIR-2014, Gold Coast, Australia, June 6-11, pp.
          <fpage>677</fpage>
          -
          <lpage>686</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>King</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abney</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Labeling the languages of words in mixed-language documents using weakly supervised methods</article-title>
          .
          <source>In: NAACL-HLT</source>
          , pages
          <volume>1110</volume>
          {
          <fpage>1119</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Barman</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chrupala</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foster</surname>
          </string-name>
          , J.:
          <article-title>Identi cation of languages and encodings in a multilingual document</article-title>
          .
          <source>In: EMNLP</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Choudhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chittaranjan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of FIRE 2014 Track on Transliterated Search</article-title>
          . In: FIRE (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Banerjee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuila</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naskar</surname>
            ,
            <given-names>S. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bandyopadhyay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>A Hybrid Approach for Transliterated Word-Level Language Identi cation: CRF with Post Processing Heuristics</article-title>
          .
          <source>In: Forum for Information Retrieval Evaluation</source>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>59</lpage>
          , ACM Digital Publication (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20. Pen~as,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Rodrigo</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>A Simple Measure to Assess Non-response. In: 49th Annual Meeting of the Association for Computational Linguistics - Human Language Technologies (ACL-HLT</article-title>
          <year>2011</year>
          ). Portland, Oregon, USA (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>The TREC-8 question answering track report</article-title>
          .
          <source>In: 8th Text Retrieval Conference (TREC)</source>
          , Gaithersburg, Maryland, USA, pp.
          <fpage>77</fpage>
          -
          <lpage>82</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>