Code Mixed Entity Extraction in Indian Languages using
                        Neural Networks

              Irshad Ahmad Bhat                               Manish Shrivastava                      Riyaz Ahmad Bhat
                   LTRC, IIIT-H                                     LTRC, IIIT-H                            LTRC, IIIT-H
                 Hyderabad, India                                 Hyderabad, India                        Hyderabad, India
          irshad.bhat@research.iiit.ac.in m.shrivastava@iiit.ac.in                               riyaz.bhat@research.iiit.ac.in


ABSTRACT                                                                                         Data-set    # Tweets     # Tokens
                                                                                                  Train        2,700       39,216
In this paper we present our submission for FIRE 2016 Shared Task
                                                                                                  Test         7,429      1,50,056
on Code Mixed Entity Extraction in Indian Languages. We describe
a Neural Network system for Entity Extraction in Hindi-English
                                                                                                      Table 1: Data Statistics
Code Mixed text. Our method uses distributed word representa-
tions as features for the Neural Network and therefore, can easily
                                                                                      Tag               Count           Tag             Count
be replicated across languages. Our system ranked first place for
                                                                                ENTERTAINMENT            858       LOCOMOTIVE             11
Hindi-English with an F1-score of 68.24% .
                                                                                    PERSON               338           YEAR               9
                                                                                   LOCATION              88         MATERIALS             9
1.      INTRODUCTION                                                             ORGANIZATION            72            TIME               8
   This paper describes our system for the FIRE 2016 Shared Task                    COUNT                65         FACILITIES            8
on Code Mixed Entity Extraction in Indian Languages. The work-                      PERIOD               59         LIVTHINGS             5
shop focuses on NLP approaches for identifying named entities                      ARTIFACT              29          DISEASE              5
such as Person names, Organization names, Product names, Lo-                         DATE                27            SDAY               3
cation names etc. in code mixed text.                                               MONEY                23           MONTH               3
   In this paper, we present a simple feed forward neural network                     DAY                18           OTHER             38366
for Named Entity Recognision (NER) that use distributed word
representations built using word2vec [2] and no other language-                             Table 2: Tag Statistics in the training data
specific resources but the unlabeled corpora.
   The rest of the paper is organized as follows: In Section 2, we         3.    METHODOLOGY
discuss about the data of the shared task. In Section 3, we dis-
                                                                              We modelled the task into a classification problem where each
cuss the methodology we adapted to address the problem of NER,
                                                                           token needs to be labelled with one of the 20 tags as given in table 2.
in detail. Experiments and Results based on our methodology are
                                                                           For this classification task, we use a simple neural network archi-
discussed in Section 4. Finally we conclude in Section 5.
                                                                           tecture. The neural network model is the standard feed-forward
                                                                           neural network with a single layer of hidden units. The output
2.      DATA                                                               layer uses softmax function for probabilistic multi-class classifica-
                                                                           tion. The model is trained by minimizing cross entropy loss with an
   The Entity Extraction in the Code-Mixed (CM) data in Indian
                                                                           l2-regularization over the entire training data. We use mini-batch
Languages shared task is meant for NER in 2 language pairs namely,
                                                                           Adagrad for optimization and apply dropout.
Hindi-English (H-E) and Telugu-English (T-E). However, we only
                                                                              We explored various token level and contextual features to build
recived data for Hindi-English language pair.
                                                                           an optimal Neural Network using the provided training data. These
                                                                           features can be broadly grouped as described below:
2.1      Data Format                                                       Contextual Word Features: They constitute the current word and
   The training data is provided into two files; a raw-tweets file and     2 words to either side of the current word.
an annotation file. (1) below, shows the format of raw tweets in           Contextual Prefix Features: They constitute the current word pre-
train and test data, while (2) shows the format of named entity an-        fix and prefixes of 2 words to either side of the current word. All
notations for the training data.                                           these prefixes are of length 3.
                                                                           Contextual Suffix Features: They constitute the current word suf-
     (1) TweetID, UserID, Tweet                                            fix and suffixes of 2 words to either side of the current word. All
                                                                           these suffixes are of length 3.
     (2) TweetID, UserID, NE-Tag, NE, startIndex, Length
                                                                           Non-lexical Features: They constitute capitalization feature and
                                                                           length feature. Capitalization feature represents if a word is in
2.2      Data Statistics                                                   upper-case, lower-case or title-case. Length feature represents the
   Table 1 shows the train and test data stattictics after tokenization.   token length in the form of bins: 1-5, 6-8 and rest. The non-lexical
Table 2 shows the tag statistics in the training data. Note that the       features are added for the current word only.
OTHER tag in table 2 is not any NE tag. We introduced this tag for            We include the lexical features in the input layer of the Neu-
all non-NE tokens.                                                         ral Network using the distributed word representations while for
the non-lexical features we use randomly initialized 3-dimensional                            Precision    Recall    F1-score
vectors within a range of −0.25 to +0.25. We use Hindi and En-                                  80.92      59.00      68.24
glish monolingual corpora to learn the distributed representation of
the lexical units. The English monolingual data contains around                                   Table 4: Test set results
280M sentences, while the Hindi data is comparatively smaller
and contains around 40M sentences. To learn the suffix and pre-          5.   CONCLUSION
fix embeddings, we simply crated prefix and suffix corpora from             In this paper, we proposed a resource light Neural Network ar-
the original monolingual corpora of Hindi and English and then           chitecture for Entity Extraction in Hindi-English Code Mixed text.
use word2vec to learn their embeddings.                                  The Neural Network uses distributed representation of lexical fea-
   Instead of a single language specific word embedding for each         tures learned from monolingual corpora. Despite the simplesity of
lexical feature, we use a concatenated vector from Hindi and En-         our architecture we achieved best results.
glish word embeddings. This approach has three main benefits.
First, we do not need a language identification system to choose
the embedding space of a lexical item. Second, we do not depend
                                                                         6.   REFERENCES
on a joint word embedding space which is usually trained using           [1] I. A. Bhat, V. Mujadia, A. Tammewar, R. A. Bhat, and
a costly bilingual lexicon. Third, the named entities are usually            M. Shrivastava. Iiit-h system submission for fire2014 shared
shared between the languages. This provides a two-way evidence               task on transliterated search. In Proceedings of the Forum for
to the training model to learn named entities. We use [1] transliter-        Information Retrieval Evaluation, FIRE ’14, pages 48–53,
ation system 1 to transliterate Roman words to Devanagari, so that           New York, NY, USA, 2015. ACM.
we can extract their embeddings from the Hindi embeddings space.         [2] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient
Apart from named entities Hindi words will not be present in the             estimation of word representations in vector space. arXiv
English embedding space and English words will not be present in             preprint arXiv:1301.3781, 2013.
the Hindi embedding space.


4.      EXPERIMENTS AND RESULTS
   In any non-linear neural network model, we need to tune a num-
ber of hyperparameters for an optimal performance. The hyper-
parameters include number of hidden units, choice of activation
function, choice of optimizer, learning rate, dropout, dimension-
ality of input units, etc. We used 20% of training data for tun-
ing these parameters. The optimal parameters include: 200 hidden
units, adagrad optimizer, rectilinear activation function, 200 bacth
size, 0.025 learning rate, 0.5 dropout and 25 training iterations.
We obtained best development set accuracy at 80 dimensional word
embeddings and 20 dimensional prefix and suffix embeddings. De-
velopment set results are given in Table 3. Test set results are given
in Table 4
           NE-TAG           Precision   Recall      F1-score   Support
          ARTIFACT            1.00       0.10         0.18        10
           COUNT              0.67       0.46         0.55        13
            DATE              1.00       0.43         0.60        7
             DAY              1.00       1.00         1.00        4
          DISEASE             0.00       0.00         0.00        1
      ENTERTAINMENT           0.98       0.62         0.76       174
         LOCATION             0.88       0.54         0.67        13
        LOCOMOTIVE            0.00       0.00         0.00        3
         MATERIALS            0.00       0.00         0.00        3
           MONEY              0.00       0.00         0.00        3
           MONTH              0.00       0.00         0.00        1
       ORGANIZATION           1.00       0.07         0.13        14
           PERIOD             0.64       0.78         0.70        9
           PERSON             0.98       0.68         0.80        71
            TIME              0.00       0.00         0.00        1
            YEAR              1.00       1.00         1.00        2
          avg / total         0.92       0.57         0.68       329

                      Table 3: Development set results


1
    https://www.github.com/irshadbhat/indic-trans