Hierarchical classification for Multilingual Language
           Identification and Named Entity Recognition

                  Saatvik Shah                          Vaibhav Jain                              Sarthak Jain
               CSE, MNIT, Jaipur                      CSE, MNIT, Jaipur                         ECE, MNIT, Jaipur
               Rajasthan-302017                       Rajasthan-302017                          Rajasthan-302017
          saatvikshah1994@gmail.com               vj.25071996@gmail.com                       sarthakssj@gmail.com
                  Anshul Mittal                         Jatin Verma                            Shubham Tripathi
               EE, MNIT, Jaipur                         EE, MNIT, Jaipur                          EE, MNIT, Jaipur
              Rajasthan-302017                         Rajasthan-302017                          Rajasthan-302017
          anshulmittal712@gmail.com              jatin.verma.205@gmail.com                   stripathi1770@gmail.com
                                                       Rajesh Kumar
                                                       EE, MNIT, Jaipur
                                                      Rajasthan-302017
                                                    rkumar.ee@gmail.com

ABSTRACT                                                               lem. This research paper addresses language identification
This paper describes the approach for Subtask-1 of the FIRE-           (LI) and Named Entity Recognition (NER) for text in social
2015 Shared Task on Mixed Script Information Retrieval.                media. Such kind of text could involve, for example, switch-
The subtask involved multilingual language identification              ing of language in between a sentence (code switching), or
(including mixed words and anomalous foreign words), named             even include words of mixed languages (code mixing). In
entity recognition (NER) and subclassification. The pro-               addition to this, people also use phonetic typing and irreg-
posed methodology starts with cleaning the data and then               ular spellings in social media (‘2nyt’ in place of ‘tonight’ or
extracting structural and contextual features from the text            ‘y’ in place of ‘why’). All these challenges make the correct
for further processing. A subset of these features is selected         identification of language in data from social media a tough
(based on validation) for training supervised classifiers, sep-        task.
arately for language identification and NER. Finally, they             The task required rigorous preprocessing of both the train-
are applied hierarchically to annotate the entire text. The            ing (validation) as well as testing dataset. Section 2 de-
detected named entities are further subclassified by a novel           scribes, in detail, the approach followed for it. Rest of the
unsupervised technique based on query refinement and key-              paper is organized as follows. Section 3 describes all the
word based scoring. The proposed approach on the testing               features extracted. Section 4,5 and 6 describe the hierar-
dataset of the shared task showed promising results with               chical classification approach that we implemented for tag-
a weighed F-measure of 0.8082. However, it is worth not-               ging of tokens into their respective categories. Section 7
ing that the classifiers have been sub-optimal with respect            and 8 cover the Results and Conclusion. A Web Tool has
to discriminating between certain linguistically similar lan-          been developed for the implemented approach and can be ac-
guages (for e.g., Gujarati in Hindi and Gujarati pairs). The           cessed at https://mixscian.herokuapp.com. The source
proposed approach is flexible and robust enough to handle              code has been shared on the link, https://github.com/
additional languages for identification as well as anomalous           saatvikshah1994/hline for future research purposes.
foreign or extraneous words. The implementation of the
approach has also been shared for the purpose of future re-            2.     PREPROCESSING
search usage.                                                            During training, the following steps were performed to
                                                                       achieve preprocessing:
Keywords                                                                    1. Began preparing the data by skipping all the X’s present
Language Identification; NER;Information Retrieval;                            in the Input file, which are punctuation, numerals,
Wikipedia; Query Refinement; Ensembling;                                       emoticons, mentions (words starting with @), hashtags
                                                                               (words starting with #) and acronyms; with respect to
1.   INTRODUCTION                                                              the given annotation file directly.
   The ubiquitous use of indigenous languages in social me-                 2. Then Unicode based cleanup of words was performed,
dia and search engines has led to the increasing need for                      to bring Unicode characters to their closest matching
techniques to tackle mixed script information retrieval. In                    ASCII representatives. (Eg. “ZineTM ” to ZineTM)
such an environment, language and named entity identifica-
                                                                            3. After that, each utterance was formatted to remove
tion has become both a necessary and challenging task. At
                                                                               any punctuation marks in between letters.
this juncture, the shared task for mixed script information
retrieval for the FIRE workshop serves as an excellent source               4. Finally, erroneous utterances (with unequal number of
to obtain multiple insights to the solution to such a prob-                    words and labels) were discarded.


                                                                  33
                                                                            8. Compressed Word Normalization: The output of
                                                                               Word Normalization is compressed to remove repeti-
                                                                               tions of similar characters appearing together. For ex-
                                                                               ample ”AaaaAAa000” becomes ”AaAa0”.
                                                                            9. Composition features:A feature was defined whose
                                                                               value depended upon the presswork of the word. The
                                                                               following four values were possible:

                                                                                 • AllCaps (if the current word is made up of all
                                                                                   capitalized letters),
                                                                                 • AllSmall (word is constructed with only lowercase
                                                                                   characters),
                                                                                 • WordDigit (word contains digits and alphabets
                                                                                   both)
                                                                                 • InitCaps (whether the first character is in upper-
                                                                                   case) and
                                                                       These features, having variable length for every token under
                                                                       consideration, had to be further processed into a Bag-Of-
                                                                       Words formation. Term Frequency Matrix was computed
     Figure 1: Hierarchical Classification Workflow                    for the same, resulting in the formation of sparse matrices,
                                                                       which were finally passed onto the classifiers.

  During testing, the process started with performing Uni-
code based cleanup, followed by removal of punctuations or             4.     PUNCTUATION (X) RECOGNITION
any other invalids in between letters from tokens.                        This section describes the methodology used for tagging
                                                                       Xs, or tags which are punctuation marks, numerals, men-
3.     FEATURE EXTRACTION                                              tions, hashtags, acronyms or emoticons. The process begin
                                                                       by a rule based approach for isolating the hashtags, men-
  The proposed technique uses a comprehensive set of fea-              tions, and numerals. This is done by checking for tokens
tures [1] for both language identification and NER. These              that start with ‘@’, or ‘#’, or tokens that are composed en-
are explained in brief below:                                          tirely of digits. Then, the tokens are checked for web URLs
   1. Word Context: Word Context features that can be                  or email addresses, and punctuations. A regular expression
      extracted from a token (word) being labeled and its              based classification was implemented to search for tokens
      surrounding context. Considering w0 as current token,            starting with ‘http://’ or ‘https://’, or any token of the
      we took Subset of all features for upto 2 context words,         format ‘abc@pqr.xyz’. Also, any tokens that are com-
      both in the left and right (w-2 , w-1 , w0 , w+1 , w+2 ).        posed only of punctuation marks are isolated and tagged as
     2. Prefix and Suffix: Prefix and suffix, defined as vari-         X. Another regular expression check is applied to catch any
        able length character sequences (here, 3 and 2), are           emoticon such as ‘:-)’, ‘;)’, ‘:}’, etc. A dictionary based ap-
        stripped from each token and used as the features of           proach is then applied to check for any token to exist in a list
        classifier.                                                    of popular acronyms. If there is a match, the corresponding
     3. Stemming/Lemmatization: A step applying stem-                  token is also classified as an X. Once these ’X’s have been
        ming or lemmatization of input tokens or both before           filtered out, a cleaning operation was performed on the re-
        extracting features was added to the feature extraction        maining tokens. Cleaning is defined as converting Unicode
        pipeline and tested.                                           characters to the nearest matching ASCII characters. For
                                                                       example, converting “ZineTM ” to ZineTM. After this pre-
     4. Character level n-gram: Character n-gram is a con-             processing, these ‘cleaned’ tokens are sent for Named Entity
        tiguous sequence of n character extracted from a given         Recognition.
        word. We extract character n-grams of length two (bi-
        gram) and three (trigram), quad(four), penta(five) and
        use these as features.                                         5.     NAMED ENTITY RECOGNITION
     5. POS tags: POS tagging(or grammatical tagging),                    This task focuses on tagging those tokens as named entity
        marks part of speech of a word present in corpus based         which can be names of persons, locations, organizations etc
        on both definition and context.                                [2] in the given text. For each named-entity, we can:
     6. Relative Position:The relative position of specific to-
        ken in its utterance was considered and added as a                  1. Classify it as {NE}, OR
        feature.
                                                                            2. Sub-classify it as {NE L, NE P, NE O, NE PA,
     7. Word Normalization: Word Normalization is done
                                                                               NE LA, NE OA, NE Ls , NE X, NE XA}
        to capture similarity between two words with com-
        mon properties like ”HahaHHa123” is normalized to
                                                                       Methodology: Tagging of named entities has been divided
        ”AaaaAAa000”, with ”A” denoting a capital, ”a” denot-          into two steps:(1)Supervised approach for binary classifica-
        ing non capital and ”0” denoting a numerical character.        tion and (2)Unsupervised approach for sub-classification.


                                                                  34
5.1    Supervised Approach                                                  3. Using the object, all the above-mentioned informa-
   The Supervised approach includes training of Classifiers                    tion is concatenated to form a single string denoted
[3], namely, Linear-kernel SVM(using liblinear), Logistic Re-                  by wiki-extract. For convenience, the information is
gression and Random Forest Model on the extracted fea-                         first converted into lowercase, and any of the punctu-
tures explained in section 3. The input data for train-                        ation marks, numerals, Unicode characters, etc. are
ing included this year’s training data along with dataset of                   removed. Each of the words in wiki-extract is lem-
FIRE 2014. All these were implemented by use of Trans-                         matized, and to counter the internet dependence and
form Pipelines containing Clean Transform(to clean data)                       decrease the processing time, the final wiki-extract is
and Feature Extractor. Finally, the use of Term Frequency                      cached in a file with the corresponding token against
Matrix was done to obtain numeric features. Based on indi-                     it.
vidual feature cross-validation, the most important features                4. The reference set keywords are searched in wiki-extract.
that contributed in identifying the Named Entities were:                       The number of times a keyword is spotted in wiki-
    • Word(and Compressed) Normalization                                       extract will be the score of the subclass (‘NE P’, ‘NE L’
    • Local Context(with a smaller window)                                     or ‘NE O’) to which the keyword belongs.
                                                                            5. After each of the keywords’ search is completed, we get
    • Relative Position
                                                                               a final score for each subclass. If the difference between
    • Prefix/Suffix                                                            the scores of the subclass with the maximum score and
    • POS Tags.                                                                that with the second maximum score is greater than or
    • Composition Features.                                                    equal to a confidence threshold (5 in this paper), then
   An Ensemble Model of NER was developed which per-                           token is annotated as the subclass with the maximum
formed a ”LOGIC OR” on the predicted results of each clas-                     score, otherwise it is annotated as ‘en’. If the final
sifier to decide the final tag for a particular token. One(1)                  scores of all the subclasses is zero, it is marked as ‘en’.
was used to denote a token being a named entity and Zero(0)                  If in the 2nd step, Wikipedia shows disambiguation error,
for non-named entity.                                                     then token is searched on Bing using its API for python
  T ag out = 0 OR (T ag classifier 1 ) OR (T ag classifier 2 )....        named ‘py bing search’. As long as the data under process
                                       OR (T ag classifier n ) (1)        belongs to social networking sites including popular topics
                                                                          like politics, movies, etc., the results by Bing will help in
  where                                                                   finding out the most popular/probable meaning of token out
Tagout is the predicted result for each token,                            of many ambiguous cases. Steps to be followed are:-
Tagclassifier k is the predicted tag by classifier k, and
                                                                               i Using ‘py bing search’, token appended with the word
n is number of classifiers used.
                                                                                 ‘Wikipedia’ (for e.g. ‘AAP wikipedia’) is searched on
5.2    Unsupervised Approach                                                     Bing so that it returns URLs of Wikipedia pages at
                                                                                 top priority. Only first three URLs are considered.
  An Unsupervised approach based on parsing Wikipedia
to find out whether the predicted named entity is a per-                      ii Only those URLs that point towards a Wikipedia page
son, location or organization was employed [6]. Searching                        except disambiguation pages are considered.
the Wikipedia page of a named entity provides information                    iii As soon as the desired URL is obtained, the proper
comprising of the title of the Wikipedia Page, the summary,                      term that would not cause disambiguation error and
the section headings, the infobox contents, the categories                       could replace token is extracted out of that URL. For
section, etc. indicating whether it is a person, location or                     e.g. if the token was ‘AAP’. Bing will return a URL
organization. The prediction is done using three reference                       ‘https://en.wikipedia.org/wiki/Aam Aadmi Party’ for
sets each containing some keywords that are commonly used                        this token and then the term ‘Aam Aadmi Party’ ex-
in the context of a person, location or organization. For                        tracted out from this URL will replace token.
a person, keywords can be ‘born’, ‘Education’, ‘President’,                  iv Now, rest of the steps of searching the new token on
etc.; for a location, they can be ‘place’, ‘nation’, etc.; for an                Wikipedia are same as step 2 to step 5 discussed above.
organization, they can be ‘party’, ‘company’, etc. While
parsing the Wikipedia page of ‘Narendra Modi’, we will                       If, among the three URLs, none of them satisfies the condi-
surely get across the word ‘President’ declaring ‘Narendra                tion of being a Wikipedia page except disambiguation pages,
Modi’ as ‘NE P’ or person. However, it was found that                     then under disambiguation error there is an option to fetch
for some named entities like AAP, Mark, etc., it displays a               the options offered by Wikipedia disambiguation page. Only
disambiguation page whereas some named entities like So-                  first five options are considered in this paper. These five op-
nia, Pradep, etc. didn’t even have a Wikipedia page. Thus,                tions are concatenated to form wiki-extract and rest of the
a systematic approach overcoming all the shortcomings of                  procedure is same as step 3 to step 5 keeping the confidence
only Wikipedia parsing was formulated which is described                  threshold zero.
in the following paragraph.                                                  If in the 2nd step, Wikipedia shows page error, then to
  Assuming the named entity to be of one word, let the                    replace token with a proper term that would not cause page
named entity that is to be sub-classified be denoted by a                 error, Bing search is used again. Hence, the steps to be fol-
variable token. The following procedure prevails:-                        lowed are same as step i to iv. Again, if among the three
   1. First of all, any of the punctuation marks, numerals or             URLs obtained from ‘py bing search’, none of them satisfies
       special characters are removed from token.                         the condition of being a Wikipedia page except disambigua-
   2. The token is searched using Wikipedia API for python                tion pages, then the token is annotated as ‘en’.
       that returns an object containing the information about               Apart from the above algorithm, it is constantly sought
       token extracted from its Wikipedia page.                           that if the term succeeding token also comes out to be named


                                                                     35
       P, R, F-S     Bn      En      Gu       Hi      Kn      Ml         Mr       Ta      Te      X      NE      NE L    NE P
           P        0.878   0.958   0.097   0.817    0.575   0.394      0.705    0.937   0.431   0.961   0.368   0.722   0.2121
           R        0.966   0.848    0.5    0.74     0.829   0.752       0.79    0.708   0.687   0.966   0.528   0.124    0.25
          F-S       0.838    0.9    0.163   0.776    0.679   0.517      0.745    0.806   0.529   0.964   0.433   0.214    0.229

                   Table 1: Strict Precision (P), Recall (R), F-scores (F-S) for languages and NEs.


entity, provided that token is not positioned last in the cur-          run was submitted for Subtask-1. The overall performance
rent utterance, then they are treated as a single entity and            was measured in terms of Utterance, Token and NE accu-
then searched on Wikipedia. For e.g. the tokens are ‘White’             racies along with precision, recall and F-measure scores for
‘House’ which are marked as named entity. If we search                  individual languages and sub-divided NE. The Average F-
‘White’ on Wikipedia we won’t get desired results. But if               measure and Weigthed F-measure scores obtained was
we search ‘White House’, we may surely get it as a loca-                0.654 and 0.808 respectively. While generally the Mean
tion, and thus both ‘White’ and ‘House’ are annotated as                Average F-measure, Weigthed F-measure score was
‘NE L’. If the tokens are misspelled or ambiguous, then                 0.5395, 0.6990 and Max Average F-measure, Weigthed
Wikipedia may cause page or disambiguation error. In that               F-measure score was 0.6917, 0.8299.
case, the combined token is first searched on Bing to find
out the term that could replace the combined token. If this             8.      CONCLUSION
results in failure in finding out a Wikipedia page except dis-
                                                                          In this paper, a system was described for Subtask-1 i.e.
ambiguation pages, then the process is reverted and the two
                                                                        Query Word Labelling in FIRE Shared Task 2015 on Mixed
tokens are searched individually again following the proce-
                                                                        Script Information Retrieval. The validation procedure indi-
dure discussed earlier.
                                                                        cated that the context of a token, its POS tag, its structure
                                                                        and normalization play key roles in both NER and Language
6.     LANGUAGE IDENTIFICATION                                          identification. It is interesting to note that the performance
   This section addresses the problem of identification of lan-         on linguistically dissimilar languages such as English and
guage of origin of a query term in the code-mixed query [4].            Bengali has been top notch. On the other hand, perfor-
For Language Identification, linear-kernel SVM(using liblin-            mance on similar language pairs such as Hindi and Marathi
ear) classifier was used with important features being:                 have invariably lead to the classifier getting confused. The
    • Word n-grams(bi-grams, tri-grams, quad-grams and                  use of Search engines to refine query terms followed by use
      penta-grams)                                                      of publically available encyclopedias such as Wikipedia for
    • Local Knowledge                                                   subclassifying named entities is a newly proposed technique
                                                                        that has given promising results.
    • Part of Speech Tags
    • Composition Features
                                                                        9.      REFERENCES
Parameter optimization was performed using parameter tun-
ing based on cross-validation. Supervised approach[5] was                  [1] Gupta, D. K., Kumar, S., & Ekbal, A. Machine Learn-
implemented with the classifier being trained on current                ing Approach for Language Identification & Transliteration:
year’s training dataset and previous year’s dataset. A sim-             Shared Task Report of IITP-TS.
ilar transform pipeline as used for NER was implemented                 [2] Abhinaya N., Neethu John, Dr. M. Anand Kumar, Dr.
for the training of classifiers. A multilevel classification was        K.P. Soman (2014). Amrita @ FIRE-2014: Named Entity
performed with possible tags for each token:                            Recognition for Indian Languages
                                                                        [3] Dubey, S., Goel, B., Prabhakar, D. K., & Pal, S. ISM@
     1. One of Languages in L={(en), (hi), (bn), (gu), (ka),            FIRE-2014: Named Entity Recognition Indian Languages.
        (ml), (mr), (ta), (te)}                                         [4] King, B., & Abney, S. P. (2013, June). Labeling the
                                                                        Languages of Words in Mixed-Language Documents using
     2. Token made of two languages {MIX} and its sub-                  Weakly Supervised Methods. In HLT-NAACL (pp. 1110-
        category {MIX Lr Ls } where r and s are the language            1119).
        of root and suffix respectively                                 [5] A. Das and B. Gamback. (2014). Code-Mixing in So-
                                                                        cial Media Text:The Last Language Identification Frontier?
     3. None of the above {O}
                                                                        Traitement Automatique des Langues (TAL): Special Issue
  For the purpose of training, the datasets were cleaned and            on Social Networks and NLP , TAL Volume 54 no 3/2013,
tokens annotated as NE or its subcategories and X’s were                Pages 41-64
removed. The trained classifier was then applied on testing             [6] Nothman, J., Ringland, N., Radford, W., Murphy, T.,
set to get the prediction for each token.                               & Curran, J. R. (2013). Learning multilingual named en-
                                                                        tity recognition from Wikipedia. Artificial Intelligence, 194,
                                                                        151-175.
7.     RESULTS
  In this section, results are reported to proposed exper-
iments. To investigate and find the most effective meth-
ods as described in upper sections, results are initially ob-
tained by development(Training) set and finally, the model
was selected using 4-fold cross validation. Thereafter, a


                                                                   36