<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NELIS - Named Entity and Language Identification System: Shared Task System Description</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rampreeth Ethiraj</string-name>
          <email>1ethirajrampreeth@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sampath Shanmugam</string-name>
          <email>2sampath_shanmugam@outlook.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gowri Srinivasa</string-name>
          <email>3gsrinivasa@pes.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>PES Center for Pattern Recognition PESIT Bangalore South Campus Bengaluru</institution>
          ,
          <addr-line>Karnataka</addr-line>
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Rochester Institute of Technology Rochester</institution>
          ,
          <addr-line>New York</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <fpage>43</fpage>
      <lpage>46</lpage>
      <abstract>
        <p>This paper proposes a simple and elegant solution for language identification and named entity (NE) recognition at a word level, as a part of Subtask-1: Query Word Labeling of FIRE 2015. Given any query q1:w1 w2 w3 … wn in Roman script, the task calls for labeling words of the query as English (En) or a member of L, where L = {Bengali (Bn), Gujarati (Gu), Hindi (Hi), Kannada (Kn), Malayalam (Ml), Marathi (Mr), Tamil (Ta), Telugu (Te)}. The approach presented in this paper uses the combination of a dictionary lookup with a Naïve Bayes classifier trained over character n-grams. Also, we devise an algorithm to resolve ambiguities between languages, for any given word in a query. Our system achieved impressive f-measure scores of 85-90% in four languages and 74-80% in another four languages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The philosophy of this approach was inspired partially by how
humans identify languages of words. First, if the word is a part of
their vocabulary, then they know the language of the word. If the
word is unfamiliar to them, then they tend to make a guess, based
on the structure of the word. Finally, if they are given a sentence
and have managed to decode the language of a few words, then
they can make a fairly accurate guess about the language of the
unknown words as well. A close analogy can be drawn between
the above and the approach suggested in this paper; the human
language vocabulary is equivalent to the language dictionaries and
the guess made based on the features of the word is performed by
the Naïve Bayes classifier, using n-gram as features. A logical
method for disambiguation is suggested in this paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. DATASETS</title>
      <p>The core of the system was building strong dictionaries for each
language. The wordlists used to compile the dictionaries are listed
in Table 1.</p>
      <sec id="sec-2-1">
        <title>Class</title>
        <p>En
MIX
NE
http://www.mieliestronk.com/wordlist.html</p>
        <sec id="sec-2-1-1">
          <title>FIRE 2013 Dataset [2]</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Mieliestronk's word list</title>
          <p>+ FIRE 2013 Dataset</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>FIRE 2015 Dataset</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>FIRE 2015 Dataset</title>
          <p>List of most frequently used English words</p>
          <p>
            [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] were translated and transliterated.
          </p>
          <p>
            The most frequently used words in En were translated into their
respective Indian language equivalents, using Google's online
translation service1. But the translated words were all in their
native scripts. These had to be transliterated into their Roman
equivalents. The process of phonetically representing the words of
a language in a non-native script is called transliteration [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ].
Baraha Software2 was used to transliterate these words into their
Roman script equivalents.
          </p>
          <p>
            While this sufficed for En and Hi, the data collected was not
enough for accurate classification of other languages. Thus, in
addition to these word lists, mining of data from other sources was
necessary to account for various spelling variations [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] and also to
capture the commonly used words of each language. These
secondary sources include song lyrics, common SMS messages,
and 'learn to speak' websites found online. Even shorthand
notations of various words were effectively captured from these
sources.
          </p>
          <p>For example, consider Gu. 'che' is also sometimes spelt as '6e' .
We manually extracted language words in Roman form from
these secondary sources, cleaned them and keyed them into the
dictionaries. Table 2 lists these secondary sources.</p>
          <p>Comprehensive dictionaries were hence manually formed for each
language. Table 3 lists the final sizes of all language dictionaries.
1 https://translate.google.com/
2 http://www.baraha.com
Bn, Gu, Kn, Ml, Mr, Ta, Te
Bn, Gu, Kn, Ml, Mr, Ta, Te</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. APPROACH</title>
      <p>Problem Statement:
Suppose that a query is given in Roman script, the task is to label
the words as En or a member of L. Assumptions to be made:</p>
      <p>The words of a single query usually come from 1 or 2
languages and very rarely from 3 languages.</p>
      <p>In case of mixed language queries, one of the languages
is either En or Hi.</p>
      <p>The approach is divided into two sections; Section 3.1 explains
the process of classification of tokens, while Section 3.2
elaborates on the process of disambiguation. Figure 1 depicts the
overall process.</p>
      <sec id="sec-3-1">
        <title>Source</title>
        <sec id="sec-3-1-1">
          <title>Song Lyrics</title>
          <p>Kn, Mr, Te: http://www.hindilyrics.net/
Gu: http://songslyricsever.blogspot.com/p/blog-page_9289.html
Ml: http://www.malayalamsonglyrics.net</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Bn: http://www.lyricsbangla.com</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Ta: http://www.paadalvarigal.com/</title>
          <p>SMS messages and 'learn to speak' websites.</p>
          <p>http://www.funbull.com/sms/sms-jokes.asp
http://www.omniglot.com/language/phrases/langs.html</p>
          <p>Commonly used SMS abbreviations.</p>
          <p>http://www.connexin.net/internet-acronyms.html</p>
          <p>Common names of people, places, organizations and brands.</p>
          <p>https://bitbucket.org/happyalu/corpus_indian_names/downloads
http://simhanaidu.blogspot.in/2013/01/text-list-of-indian-cities-alphabetical.html
http://www.elections.in/political-parties-in-india/
http://business.mapsofindia.com/top-brands-india/
97271
26094
23992
25472
19573
10564
20729
22219
32479
Organization of dictionaries:</p>
          <p>process of dictionary lookup. For example, all tokens of a
language that started with 'a' would be grouped together.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3.1 Classification of Tokens</title>
      <p>
        The system built to demonstrate this approach was written entirely
in Python, using the NLTK package3 for processing and
classification. The test file provided consisted of utterances
(sentences or queries). The system read the input file utterance by
utterance, and each utterance was tagged token (word) by token,
sequentially. Section 3.1.1 explains the tagging of X tokens with
regular expressions, Section 3.1.2 explains process of tagging of
language tokens. At the end of the process, an annotated output
file was generated.
3.1.1 Regular Expression based Tagging
Regular Expressions were used to match X tokens [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Table 4
shows the expressions used and their class. The X dictionary was
also referenced in case none of the expressions matched the token.
3.1.2 Language Tagging
To tag language tokens, the combination of dictionary lookup and
Naïve Bayes classifier were used. The subsections below explain
the process of tagging language tokens. The techniques were
combined and used, sequentially.
3.1.2.1 Dictionary Lookup and Tagging
Dictionaries of all language were looked up each time, if the token
has not been already tagged as X, MIX or NE. Three cases could
arise:
      </p>
      <p>Case 1:
Case 2:
The token belongs to exactly one language. Hence tag
as this language.
3 http://www.nltk.org
The token belongs to more than one language. Tag as
ambiguous, along with the set of languages causing the
ambiguity.
The token is not found in any of the language
dictionaries. Use the Naïve Bayes classifier to guess the
language, as explained in Section 3.1.2.2.</p>
      <p>
        After all tokens had been tagged by the dictionary, an aggregation
of the number of occurrences of each language tag was
performed. This is used later while trying to resolve ambiguity.
3.1.2.2 Naïve Bayes Classifier and Tagging
An inherently multiclass Naïve Bayes classifier, from the NLTK
package was trained specifically for language identification. Each
language l  L is a class. While training, the frequencies of
cooccurrences of character n-grams in the language dictionaries
prepared in Section 2 were analyzed. An n-gram is an n-character
slice of a longer string [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A frequency distribution of character
2-gram, 3-gram, 4-gram and 5-gram was studied and used for the
purpose of training the classifier.
      </p>
      <p> (P(t | l)P(l))
lang  arglmLax  P(t) 
where lang is the language of a given token, t is the token and l is
a language in L.</p>
      <p>Those tokens that were not tagged after the dictionary look up
were tagged by the Naïve Bayes classifier. After all tokens had
been tagged by the classifier, an aggregation of the number of
occurrences of each language tag was performed. But this time,
the number of occurrences of each language was multiplied by a
certain specific weight. This weight was based on the accuracy of
Repeat till i = n.</p>
      <sec id="sec-4-1">
        <title>Class</title>
        <p>X
X
X
X
X
X
X
X
X
X
X
the classifier for that particular language. These were added with
the previously computed values for each language while
performing language dictionary lookups in Section 3.1.2.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3.2 Further Processing and Disambiguation</title>
      <p>Disambiguation of words belonging to multiple languages tends to
be a challenge, unless the context of the utterance is known. In
cases where utterances were bilingual, based on observation of the
training set, we concluded that it is more probable for En to be a
part of the bilingual utterance.</p>
      <p>To begin the process, we perform yet another count, but this time
exclusively for ambiguous tokens. A count of the number of
occurrences for each language was computed and was multiplied
by a weight. Let this weight be sizel for any given language l in L,
where
sizel
En was not taken into account while computing the total sum,
because of the large size of the dictionary. These newly computed
scores were added to the scores computed previously in Section
3.1.2.2 for each language and were used determine the
language(s) of the utterance. The language with the maximum
score is ranked highest.</p>
      <p>Hence the challenge was to be able to identify either a language or
a pair of languages for each utterance. This was done by
identifying the most frequently occurring Indian language, say
lang, in an utterance, and the count of En in this utterance, as
computed previously. The steps involved in resolving ambiguity
in an utterance is as follows.</p>
      <p>Step 1:
Step 2:
All those unambiguous tokens that belonged to neither
lang nor En were converted to lang. This assumption
was made given the strength of the En dictionary, as the
probability of a new word belonging to En, given that it
is not in the En dictionary, is low.</p>
      <p>All ambiguous tokens, where the ambiguity was
between En and another language or a set of languages,
and lang is absent, were converted to En.</p>
      <p>All ambiguous tokens, where the ambiguity was
between lang and another language or a set of
languages, were converted to lang.</p>
      <p>For all ambiguous tokens that were not disambiguated
in the previous steps, the following was followed:
This scheme worked by identifying the overall language(s) of the
utterance and then narrowing it down to the language of the
individual token, for disambiguation.</p>
    </sec>
    <sec id="sec-6">
      <title>4. RESULTS</title>
      <p>A single run was submitted for the subtask and the results are
summarized in the Table 5 and Table 6.</p>
      <sec id="sec-6-1">
        <title>Strict f-measure</title>
      </sec>
      <sec id="sec-6-2">
        <title>Class</title>
        <p>NE
X
Bn
En
Gu
Hi
Kn
Ml
Mr
Ta
Te
Step 4:</p>
      </sec>
      <sec id="sec-6-3">
        <title>Strict</title>
        <p>Precision
0
0.645
0.952
0.795
0.898
0.270
0.713
0.937
0.675
0.808
0.912
0.774</p>
      </sec>
      <sec id="sec-6-4">
        <title>Measure</title>
        <sec id="sec-6-4-1">
          <title>TokensAccuracy</title>
        </sec>
        <sec id="sec-6-4-2">
          <title>UtterancesAccuracy</title>
        </sec>
        <sec id="sec-6-4-3">
          <title>Average F-measure</title>
        </sec>
        <sec id="sec-6-4-4">
          <title>Weighted F-measure</title>
        </sec>
      </sec>
      <sec id="sec-6-5">
        <title>Strict</title>
        <p>Recall</p>
        <p>0
0.326
0.941
0.921
0.852
0.490
0.841
0.814
0.830
0.774
0.872
0.778</p>
        <p>0
0.433
0.947
0.853
0.874
0.349
0.771
0.871
0.744
0.791
0.891
0.777
Run-1
82.715
26.389
0.692
0.829
If the token is not the first token in the utterance and the
previous token is a language token, then the token will
be of the same language as the previous token.
Else If the next token is a language token, then the
current token will be of the same language.</p>
        <sec id="sec-6-5-1">
          <title>Else, tag as En.</title>
          <p>Here praan, antim and yatra are all Hi words too.</p>
          <p>It fails to tag mix words in the test dataset due to the presence of
MIX tokens in specific language dictionaries in the training data.
For example, account-la, where account is En, la is Ta, is in the
Ta dictionary. This explains the low scores for MIX.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. CONCLUSION AND FUTURE SCOPE</title>
      <p>In this paper, we present the brief synopsis of a methodology to
classify query words into their respective languages. The
methodology involves a dictionary lookup networked with a
Naïve Bayes classifier to accomplish the task. Usage of word
level n-grams as a feature to the Naïve Bayes classifier can be
experimented with. A new approach to identify and tag MIX
tokens will have to be devised. Furthermore, the accuracy of Gu
and the overall accuracy of the system can be upgraded by
devising a new technique to handle the indeterminateness between
Hi and Gu.</p>
    </sec>
    <sec id="sec-8">
      <title>5. ERROR ANALYSIS</title>
      <p>This system yields very promising results for word level language
identification and named entity recognition. Bn, En, Kn, Ta all
have f-measures above 85%. Similarly, the remaining languages
with the exception of Gu have f-measures above 74%.
Errors during translation and transliteration are to be accounted
for. The accuracy of Gu was comparatively low. Upon detailed
analysis, it was observed that various spelling variations could not
be accounted for, neither in the dictionaries, nor while training.
Also, much ambiguity existed between Hi and Gu. Because Hi
words are more frequently occurring, the system is biased towards
Hi in such ambiguous situations. This made it particularly very
difficult to identify correctly Gu in utterances of short length.
For example, from the training set provided:
praan ni antim yatra</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Umair Z.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          , Kalika Bali, Monojit Choudury, Sowmya VB.
          <article-title>Challenges in Designing Input Method Editors for Indian Languages: The Role of Word-Origin and Context</article-title>
          .
          <source>In Proceedings of the WTIM</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>FIRE</surname>
          </string-name>
          <year>2013</year>
          Dataset.
          <article-title>Datasets for FIRE 2013</article-title>
          . URL: http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/index.htm l.
          <source>Last accessed: October 5</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3] first20hours.
          <fpage>google</fpage>
          -10000-english/20k.txt. URL: https://github.com/first20hours/google-10000- english/blob/master/20k.txt.
          <source>Last accessed: October 5</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Knight</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Graehl</surname>
          </string-name>
          .
          <article-title>"Machine Transliteration"</article-title>
          .
          <source>Computational Linguistics</source>
          , pages
          <fpage>599</fpage>
          -
          <lpage>612</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Royal</given-names>
            <surname>Denzil</surname>
          </string-name>
          <string-name>
            <given-names>Sequiera</given-names>
            , Shashank S. Rao,
            <surname>Shambavi B R. Word - Level Language</surname>
          </string-name>
          Identification and
          <article-title>Back Transliteration of Romanized Text: A Shared Task Report by BMSCE</article-title>
          .
          <source>Shared Task System Description in MSRI FIRE Working Notes</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Navneet</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Gowri</given-names>
            <surname>Srinivasa</surname>
          </string-name>
          .
          <article-title>Hindi-English Language Identification, Named Entity Recognition and Back Transliteration: Shared Task System Description</article-title>
          .
          <source>Shared Task System Description in MSRI FIRE Working Notes</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>William</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Cavnar</surname>
            ,
            <given-names>John M.</given-names>
          </string-name>
          <string-name>
            <surname>Trenkle</surname>
          </string-name>
          .
          <article-title>N-Gram-Based Text Categorization</article-title>
          .
          <source>In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval</source>
          , pages
          <fpage>161</fpage>
          -
          <lpage>169</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>