=Paper=
{{Paper
|id=Vol-1587/T2-7
|storemode=property
|title=NELIS - Named Entity and Language Identification System: Shared Task  System Description
|pdfUrl=https://ceur-ws.org/Vol-1587/T2-7.pdf
|volume=Vol-1587
|authors=Rampreeth Ethiraj,Sampath Shanmugam,Gowri Srinivasa,Navneet Sinha
|dblpUrl=https://dblp.org/rec/conf/fire/EthirajSSS15
}}
==NELIS - Named Entity and Language Identification System: Shared Task  System Description==
<pdf width="1500px">https://ceur-ws.org/Vol-1587/T2-7.pdf</pdf>
<pre>
NELIS - Named Entity and Language Identification System:
           Shared Task System Description
  Rampreeth Ethiraj1, Sampath Shanmugam2, Gowri Srinivasa3                                                           Navneet Sinha
                        PES Center for Pattern Recognition                                                Rochester Institute of Technology
                         PESIT Bangalore South Campus                                                          Rochester, New York
                             Bengaluru, Karnataka                                                                       USA
                                      India                                                                      navneet.sinha27@gmail.com
         1
          ethirajrampreeth@gmail.com, 2sampath_shanmugam@outlook.com,
                              3
                                gsrinivasa@pes.edu


ABSTRACT                                                                     2. DATASETS
This paper proposes a simple and elegant solution for language               The core of the system was building strong dictionaries for each
identification and named entity (NE) recognition at a word level,            language. The wordlists used to compile the dictionaries are listed
as a part of Subtask-1: Query Word Labeling of FIRE 2015.                    in Table 1.
Given any query q1:w1 w2 w3 … wn in Roman script, the task calls                    Table 1. Primary sources used to prepare dictionaries.
for labeling words of the query as English (En) or a member of L,
where L = {Bengali (Bn), Gujarati (Gu), Hindi (Hi), Kannada                           Class                              Source
(Kn), Malayalam (Ml), Marathi (Mr), Tamil (Ta), Telugu (Te)}.                      Bn, Hi, Gu                    FIRE 2013 Dataset [2]
The approach presented in this paper uses the combination of a
dictionary lookup with a Naïve Bayes classifier trained over                                                     Mieliestronk's word list
character n-grams. Also, we devise an algorithm to resolve                             En           http://www.mieliestronk.com/wordlist.html
ambiguities between languages, for any given word in a query.
                                                                                                                  + FIRE 2013 Dataset
Our system achieved impressive f-measure scores of 85-90% in
four languages and 74-80% in another four languages.                                  MIX                          FIRE 2015 Dataset
                                                                                       NE                          FIRE 2015 Dataset
Keywords
                                                                                  Bn, Gu, Kn,       List of most frequently used English words
Language Identification, N-grams, Naïve Bayes classifier                          Ml, Ta, Te           [3] were translated and transliterated.
1. INTRODUCTION                                                              The most frequently used words in En were translated into their
India's heritage in languages is one of the richest in the world and         respective Indian language equivalents, using Google's online
is also known as the "Museum of Languages". India is a multi-                translation service1. But the translated words were all in their
language, multi-script country, with 22 official languages. A large          native scripts. These had to be transliterated into their Roman
number of these languages are written using indigenous scripts.              equivalents. The process of phonetically representing the words of
However, often websites and user generated content such as                   a language in a non-native script is called transliteration [4].
tweets and blogs in these languages are written using Roman                  Baraha Software2 was used to transliterate these words into their
script [1] due to various social, cultural and technological reasons.        Roman script equivalents.
This paper presents an approach to analyze a sentence written in             While this sufficed for En and Hi, the data collected was not
En and a transliterated language L, where L = {Bn, Gu, Hi, Kn,               enough for accurate classification of other languages. Thus, in
Ml, Mr, Ta, Te}, adopting the Roman script, from sources such as             addition to these word lists, mining of data from other sources was
tweets, blogs and user-generated messages and button down the                necessary to account for various spelling variations [5] and also to
language every word belongs to.                                              capture the commonly used words of each language. These
The philosophy of this approach was inspired partially by how                secondary sources include song lyrics, common SMS messages,
humans identify languages of words. First, if the word is a part of          and 'learn to speak' websites found online. Even shorthand
their vocabulary, then they know the language of the word. If the            notations of various words were effectively captured from these
word is unfamiliar to them, then they tend to make a guess, based            sources.
on the structure of the word. Finally, if they are given a sentence          For example, consider Gu. 'che' is also sometimes spelt as '6e' .
and have managed to decode the language of a few words, then
they can make a fairly accurate guess about the language of the              We manually extracted language words in Roman form from
unknown words as well. A close analogy can be drawn between                  these secondary sources, cleaned them and keyed them into the
the above and the approach suggested in this paper; the human                dictionaries. Table 2 lists these secondary sources.
language vocabulary is equivalent to the language dictionaries and           Comprehensive dictionaries were hence manually formed for each
the guess made based on the features of the word is performed by             language. Table 3 lists the final sizes of all language dictionaries.
the Naïve Bayes classifier, using n-gram as features. A logical
method for disambiguation is suggested in this paper.
                                                                             1
                                                                                 https://translate.google.com/
                                                                             2
                                                                                 http://www.baraha.com

                                                                        43
Organization of dictionaries:                                               process of dictionary lookup. For example, all tokens of a
Each language dictionary was divided into sub-dictionaries based            language that started with 'a' would be grouped together.
on the starting character, sorted alphabetically, to speed up the
                                       Table 2. Secondary sources used to prepare dictionaries.
                Class                                                                    Source
                                                                                       Song Lyrics
                                                                      Kn, Mr, Te: http://www.hindilyrics.net/
                                                         Gu: http://songslyricsever.blogspot.com/p/blog-page_9289.html
    Bn, Gu, Kn, Ml, Mr, Ta, Te
                                                                      Ml: http://www.malayalamsonglyrics.net
                                                                           Bn: http://www.lyricsbangla.com
                                                                        Ta: http://www.paadalvarigal.com/
                                                                   SMS messages and 'learn to speak' websites.
    Bn, Gu, Kn, Ml, Mr, Ta, Te                                      http://www.funbull.com/sms/sms-jokes.asp
                                                              http://www.omniglot.com/language/phrases/langs.html
                                                                        Commonly used SMS abbreviations.
                  X
                                                                 http://www.connexin.net/internet-acronyms.html
                                                          Common names of people, places, organizations and brands.
                                                         https://bitbucket.org/happyalu/corpus_indian_names/downloads
                  NE                            http://simhanaidu.blogspot.in/2013/01/text-list-of-indian-cities-alphabetical.html
                                                                 http://www.elections.in/political-parties-in-india/
                                                                 http://business.mapsofindia.com/top-brands-india/
          Table 3. Final sizes of all language dictionaries
                                                                            3.1 Classification of Tokens
             Language               Dictionary Size (in words)              The system built to demonstrate this approach was written entirely
                 En                           97271                         in Python, using the NLTK package3 for processing and
                                                                            classification. The test file provided consisted of utterances
                 Hi                           26094                         (sentences or queries). The system read the input file utterance by
                 Ta                           23992                         utterance, and each utterance was tagged token (word) by token,
                                                                            sequentially. Section 3.1.1 explains the tagging of X tokens with
                 Te                           25472                         regular expressions, Section 3.1.2 explains process of tagging of
                 Bn                           19573                         language tokens. At the end of the process, an annotated output
                                                                            file was generated.
                 Mr                           10564
                 Gu                           20729                         3.1.1 Regular Expression based Tagging
                                                                            Regular Expressions were used to match X tokens [6]. Table 4
                 Ml                           22219                         shows the expressions used and their class. The X dictionary was
                 Kn                           32479                         also referenced in case none of the expressions matched the token.
                                                                            3.1.2 Language Tagging
3. APPROACH                                                                 To tag language tokens, the combination of dictionary lookup and
Problem Statement:
                                                                            Naïve Bayes classifier were used. The subsections below explain
Suppose that a query is given in Roman script, the task is to label         the process of tagging language tokens. The techniques were
the words as En or a member of L. Assumptions to be made:                   combined and used, sequentially.
     1.    The words of a single query usually come from 1 or 2
           languages and very rarely from 3 languages.                      3.1.2.1 Dictionary Lookup and Tagging
                                                                            Dictionaries of all language were looked up each time, if the token
     2.    In case of mixed language queries, one of the languages          has not been already tagged as X, MIX or NE. Three cases could
           is either En or Hi.                                              arise:
The approach is divided into two sections; Section 3.1 explains                         Case 1:
the process of classification of tokens, while Section 3.2
elaborates on the process of disambiguation. Figure 1 depicts the                       The token belongs to exactly one language. Hence tag
overall process.                                                                        as this language.
                                                                                        Case 2:


                                                                            3
                                                                                http://www.nltk.org

                                                                      44
          The token belongs to more than one language. Tag as                                         The token is not found in any of the language
          ambiguous, along with the set of languages causing the                                      dictionaries. Use the Naïve Bayes classifier to guess the
          ambiguity.                                                                                  language, as explained in Section 3.1.2.2.
          Case 3:                                                                           After all tokens had been tagged by the dictionary, an aggregation
                                                                                            of the number of occurrences of each language tag was
                                                                                            performed. This is used later while trying to resolve ambiguity.
                                        Tagging.


                                                                                      Tag Language tokens by
                                          Tag X tokens of                            performing their respective    For all untagged        Resolve
                                                               Tag MIX and
               Read                       utterance[i] with                         language dictionary lookups.        tokens of       ambiguity using
  Input                     Tokenize                          NE tokens with                                                                              Output
           utterance[i],                       regular                                                              utterance[i], tag    the algorithm
   file                                                       their respective                                                                             file
          where i = 1 to n.               expressions and                            Tag as ambiguous, if same        using Naïve         specified in
                                                                dictionaries.
                                             dictionary.                             token is present in multiple   Bayes classifier.     Section 3.2.
                                                                                        language dictionaries.


                                                                       Repeat till i = n.


                                               Figure 1. Overall process of tagging, from input to output.
           Table 4. Regular Expressions used to tag X                                       the classifier for that particular language. These were added with
                                                                                            the previously computed values for each language while
                    Regular Expression                                  Class
                                                                                            performing language dictionary lookups in Section 3.1.2.1
            r'[\.\=\:\;\,\#\@\(\)\`\~\$\*\!\?\"\+\-                       X
                  \\\/\|\{\}\[\]\_\<\>\%\&]+'                                               3.2 Further Processing and Disambiguation
                            r'[0-9]+'                                     X                 Disambiguation of words belonging to multiple languages tends to
                                                                                            be a challenge, unless the context of the utterance is known. In
               r'[a-zA-Z]+[\@]+[a-zA-Z\.]*'                               X                 cases where utterances were bilingual, based on observation of the
                            r'http+'                                      X                 training set, we concluded that it is more probable for En to be a
                                                                                            part of the bilingual utterance.
                 r'www.[A-Za-z0-9]+.com'                                  X
                                                                                            To begin the process, we perform yet another count, but this time
                    r'[A-Za-z0-9]+.com'                                   X                 exclusively for ambiguous tokens. A count of the number of
                      r'[0-9]+[tT][hH]'                                   X                 occurrences for each language was computed and was multiplied
                                                                                            by a weight. Let this weight be sizel for any given language l in L,
                    r'[0-9]*[1]+[sS][tT]'                                 X                 where
                    r'[0-9]*[2][nN][dD]'                                  X
                                                                                                  sizel
                     r'[0-9]*[3][rR][dD]'                                 X                                                               -
            r'[^a-zA-z]' & length of token = 1                            X                 En was not taken into account while computing the total sum,
                                                                                            because of the large size of the dictionary. These newly computed
3.1.2.2 Naïve Bayes Classifier and Tagging                                                  scores were added to the scores computed previously in Section
An inherently multiclass Naïve Bayes classifier, from the NLTK                              3.1.2.2 for each language and were used determine the
package was trained specifically for language identification. Each                          language(s) of the utterance. The language with the maximum
language l  L is a class. While training, the frequencies of co-                           score is ranked highest.
occurrences of character n-grams in the language dictionaries                               Hence the challenge was to be able to identify either a language or
prepared in Section 2 were analyzed. An n-gram is an n-character                            a pair of languages for each utterance. This was done by
slice of a longer string [7]. A frequency distribution of character                         identifying the most frequently occurring Indian language, say
2-gram, 3-gram, 4-gram and 5-gram was studied and used for the                              lang, in an utterance, and the count of En in this utterance, as
purpose of training the classifier.                                                         computed previously. The steps involved in resolving ambiguity
                                                                                            in an utterance is as follows.
                                   ( P(t | l ) P(l ))                                               Step 1:
                   lang  arg max                     
                            lL 
                                                      
                                         P(t )                                                       All those unambiguous tokens that belonged to neither
                                                                                                      lang nor En were converted to lang. This assumption
where lang is the language of a given token, t is the token and l is                                  was made given the strength of the En dictionary, as the
a language in L.                                                                                      probability of a new word belonging to En, given that it
Those tokens that were not tagged after the dictionary look up                                        is not in the En dictionary, is low.
were tagged by the Naïve Bayes classifier. After all tokens had                                       Step 2:
been tagged by the classifier, an aggregation of the number of
                                                                                                      All ambiguous tokens, where the ambiguity was
occurrences of each language tag was performed. But this time,
                                                                                                      between En and another language or a set of languages,
the number of occurrences of each language was multiplied by a
                                                                                                      and lang is absent, were converted to En.
certain specific weight. This weight was based on the accuracy of

                                                                                   45
            Step 3:                                                                   If the token is not the first token in the utterance and the
            All ambiguous tokens, where the ambiguity was                             previous token is a language token, then the token will
            between lang and another language or a set of                             be of the same language as the previous token.
            languages, were converted to lang.                                        Else If the next token is a language token, then the
            Step 4:                                                                   current token will be of the same language.

            For all ambiguous tokens that were not disambiguated                      Else, tag as En.
            in the previous steps, the following was followed:
This scheme worked by identifying the overall language(s) of the            Here praan, antim and yatra are all Hi words too.
utterance and then narrowing it down to the language of the                 It fails to tag mix words in the test dataset due to the presence of
individual token, for disambiguation.                                       MIX tokens in specific language dictionaries in the training data.
4. RESULTS                                                                  For example, account-la, where account is En, la is Ta, is in the
A single run was submitted for the subtask and the results are              Ta dictionary. This explains the low scores for MIX.
summarized in the Table 5 and Table 6.
                                                                            6. CONCLUSION AND FUTURE SCOPE
    Table 5. Summary of the scores obtained for each class                  In this paper, we present the brief synopsis of a methodology to
                   Strict          Strict       Strict f-measure            classify query words into their respective languages. The
    Class                                                                   methodology involves a dictionary lookup networked with a
                  Precision        Recall
                                                                            Naïve Bayes classifier to accomplish the task. Usage of word
    MIX                 0            0                   0                  level n-grams as a feature to the Naïve Bayes classifier can be
     NE                0.645       0.326             0.433                  experimented with. A new approach to identify and tag MIX
                                                                            tokens will have to be devised. Furthermore, the accuracy of Gu
      X                0.952       0.941             0.947
                                                                            and the overall accuracy of the system can be upgraded by
     Bn                0.795       0.921             0.853                  devising a new technique to handle the indeterminateness between
     En                0.898       0.852             0.874                  Hi and Gu.

     Gu                0.270       0.490             0.349                  7. REFERENCES
      Hi               0.713       0.841             0.771                  [1] Umair Z. Ahmed, Kalika Bali, Monojit Choudury, Sowmya
                                                                                VB. Challenges in Designing Input Method Editors for
     Kn                0.937       0.814             0.871
                                                                                Indian Languages: The Role of Word-Origin and Context. In
     Ml                0.675       0.830             0.744                      Proceedings of the WTIM, pages 1-9, 2011.
     Mr                0.808       0.774             0.791                  [2] FIRE 2013 Dataset. Datasets for FIRE 2013. URL:
     Ta                0.912       0.872             0.891                      http://cse.iitkgp.ac.in/resgrp/cnerg/qa/fire13translit/index.htm
                                                                                l. Last accessed: October 5, 2015.
     Te                0.774       0.778             0.777
                                                                            [3] first20hours. google-10000-english/20k.txt. URL:
                                                                                https://github.com/first20hours/google-10000-
        Table 6. Summary of the overall scores obtained                         english/blob/master/20k.txt. Last accessed: October 5, 2015.
                      Measure                   Run-1                       [4] Kevin Knight, Jonathan Graehl. "Machine Transliteration".
                                                                                Computational Linguistics, pages 599-612, 1998.
               TokensAccuracy                   82.715
                                                                            [5] Royal Denzil Sequiera, Shashank S. Rao, Shambavi B R.
              UtterancesAccuracy                26.389                          Word - Level Language Identification and Back
              Average F-measure                 0.692                           Transliteration of Romanized Text: A Shared Task Report by
             Weighted F-measure                 0.829                           BMSCE. Shared Task System Description in MSRI FIRE
                                                                                Working Notes, 2014.
5. ERROR ANALYSIS                                                           [6] Navneet Sinha, Gowri Srinivasa. Hindi-English Language
This system yields very promising results for word level language               Identification, Named Entity Recognition and Back
identification and named entity recognition. Bn, En, Kn, Ta all                 Transliteration: Shared Task System Description. Shared
have f-measures above 85%. Similarly, the remaining languages
                                                                                Task System Description in MSRI FIRE Working Notes,
with the exception of Gu have f-measures above 74%.
                                                                                2014.
Errors during translation and transliteration are to be accounted           [7] William B. Cavnar, John M. Trenkle. N-Gram-Based Text
for. The accuracy of Gu was comparatively low. Upon detailed
                                                                                Categorization. In Proceedings of SDAIR-94, 3rd Annual
analysis, it was observed that various spelling variations could not
be accounted for, neither in the dictionaries, nor while training.              Symposium on Document Analysis and Information
Also, much ambiguity existed between Hi and Gu. Because Hi                      Retrieval, pages 161-169, 1994.
words are more frequently occurring, the system is biased towards
Hi in such ambiguous situations. This made it particularly very
difficult to identify correctly Gu in utterances of short length.
For example, from the training set provided:
praan ni antim yatra

                                                                       46

</pre>