AMRITA_CEN@FIRE 2016: Code-Mix Entity Extraction for
        Hindi-English and Tamil-English Tweets

                      Remmiya Devi G, Veena P V, Anand Kumar M and Soman K P
                              Centre for Computational Engineering and Networking (CEN)
                                       Amrita School of Engineering, Coimbatore
                                 Amrita Vishwa Vidyapeetham, Amrita University, India


ABSTRACT                                                          In the above example, Hindi words are in italics and En-
Social media text holds information regarding various im-      glish words are in bold. Communication through social net-
portant aspects. Extraction of such information serves as      working sites likes Facebook and Twitter are in code-mix
the basis for the most preliminary task in Natural Language    language. Dataset given for this task includes two subsets,
Processing called Entity extraction. The work is submitted     where Indian languages like Hindi and Tamil were mixed
as a part of Shared task on Code Mix Entity Extraction         with Roman script. The major task is to develop a method
for Indian Languages(CMEE-IL) at Forum for Information         that is applicable to process and extract the entities of the
Retrieval Evaluation (FIRE) 2016. Three different method-      data in code-mix language.
ology is proposed in this paper for the task of entity ex-        In recent years, a significant number of researches were
traction for code-mix data. Proposed systems include ap-       carried out in the field of data processing using code-mix
proaches based on the Embedding models and feature based       data. A language identification task was carried out for
model. Creation of trigram embedding and BIO tag format-       code-mix social media data [5]. A paper was published on
ting were done during feature extraction. Evaluation of the    thematic knowledge discovery using topic modeling for chat
system is carried out using machine learning based classifier, messages in code mixed language [4]. SVM based classifica-
SVM-Light. Overall accuracy through cross validation has       tion for entity extraction was carried out previously for In-
proven that the proposed system is efficient in classifying    dian languages [3]. Conditional Random Field (CRF) based
unknown tokens too.                                            entity extraction was implemented and Rich features of In-
                                                               dian Languages were also utilized to perform Named Entity
                                                               Recognition [10] [2]. Entity extraction using Structured skip
CCS Concepts                                                   gram based embedding features was implemented for Malay-
                                                               alam language [9].
•Information systems → Information extraction; •Theory Our submission includes three systems. First system is us-
of computation → Support vector machines;                      ing word embedding features obtained from wang2vec tool
                                                               [13]. Word embedding features are generally vector repre-
Keywords                                                       sentation of words. The second system utilizes word embed-
                                                               ding features from word2vec tool [7]. The major difference
Word embedding, Machine Learning, Support Vector Ma-           between wang2vec and word2vec features lies in the Skip
chine (SVM), Code-Mix, Entity extraction                       gram model used to develop these embedding features. In
                                                               third system, stylometric features were extracted from the
                                                               training data. Extracted features from system 1, 2 and 3
1. INTRODUCTION                                                are used to develop three separate models using machine
   Entity extraction has always been the most primary task     learning based classifier SVM-Light [6].
in Natural language processing. It is defined as a task of        An overview of the task description is given in Section
extracting the named entities from any text. Generally, en-    2. Details regarding the dataset used is given in Section
tities fall under the categories of name, person, and organi-  3. Section 4 discusses on the proposed system we used for
zation. It extends to date, time, period, month etc. Entity    the task. Experiments carried out and their corresponding
extraction in social media text is viewed as an information    results are discussed in Section 5. The conclusion of the
extraction task. Social media text is generally unstructured,  paper is stated in Section 6.
yet it is informative. Extracting such informative content
from an unorganized text format is the most challenging
task. In our task we deal with social media text, specifically
code-mix twitter dataset. In a society using multilingual      2. TASK DESCRIPTION
languages, conversation in code mixed language is prevalent.      The task organizers provided us with dataset obtained
Code-mix language is the combination of English language       from Twitter and other few microblogs. Given training data
with any other language. An example for code mixed lan-        contains two set of dataset with code mixed tweets - Hindi-
guage from the Hindi-English training data is given below.     English and Tamil-English. The task is to extract the enti-
                                                               ties from these two dataset. Named entities in the dataset
   shaq hai ki humari notice ke bagair tere ghar ke secret     include Person, Location, Organization, Entertainment and
route ki help se you met                                       so on. With the increasing number of social media platforms
                     Figure 1: Methodology of the Proposed System for Feature based model


Table 1: Number of Tweets and Average Tokens per tweet for train and test data of Hindi-English and
Tamil-English
                                                  Hindi-English Tamil-English
                             Tweet count              2700         3200
                      Train
                             Avg Tokens per Tweet    16.76         11.94
                             Tweet count              7429         1376
                       Test
                             Avg Tokens per Tweet    16.49         12.11


and the use of code-mix language in them, this task holds a
significant relevance in today’s world.


3.    DATASET DESCRIPTION
   The task contains two code-mix dataset, Hindi-English
and Tamil-English. The training data contains three fields-
Tweet ID, User ID and the tweets. Each training file holds
a corresponding annotation file. Annotation file contains
Tweet ID, User ID, Length, Index and Entity tag of the en-
tities present in the train data. Hindi-English training data
includes tweets entirely in code-mix language but Tamil-
English dataset includes some tweets in pure Tamil lan-
guage. Since we proposed embedding based methodology,
we were in need of additional dataset to train our word em-
bedding model. The additional dataset for Hindi-English
were collected from Mixed Script Information Retrieval 2016
(MSIR) [1], International Conference on Natural Language
Processing (ICON) 2015 POS Tagging task [12] and some
twitter data. Dataset provided by Sentiment Analysis in
Indian Languages (SAIL-2015) [8] [11] were used for Tamil-
English. The total number of tweets in the training data,
testing data and average tokens per tweet is tabulated in
Table 1. The additional dataset collected for Hindi-English
is 20671 and for Tamil-English is 1625.

                                                                Figure 2: Methodology of the Proposed System for
4.    METHODOLOGY                                               word embedding models
  Our submission for the task of entity extraction in code-
mix language includes three systems.
                                                                Social media text is generally subjected to various prepro-
                                                                cessing tasks. Given dataset includes Twitter data which is
     • System 1: Wang2vec based embedding features              subjected to tokenization. Each token from this tokenized
                                                                dataset is converted to conventional BIO format. This re-
     • System 2: Word2vec based embedding features              sults in BIO tag information for each word in the train-
                                                                ing data. BIO tag is defined as Beginning, Inside, Outside
                                                                tag of entities. For example, consider the sentence, “Pranab
     • System 3: Stylometric features                           Mukherjee is the President of India”. In general, the en-
                         Table 2: Features extracted from train and test data for System 3
                     Features                                          Representation
                     Lower case                      Represent word in lowercase
                                                     P3/P4: first 3/4 characters
                     Prefix-Suffix
                                                     S3/S4: last 3/4 characters
                     Starts with Hash,apostrophe     1 if word starts with #, ’ symbol
                     Numbers, apostrophe,punctuation Marks 1/0 if present/absent
                     Length & Index                  No of chars & Position of word
                     Contain HTTP                    1 if HTTP present
                     First character Uppercase       1 if first char is in uppercase
                     Full character Uppercase        1 if entire word is in uppercase
                     Contain 4-digit numbers         1 if Token is a 4-digit number
                     Gazetted Features               Location, Person, Organization, Entertainment


                     Table 3: Cross Validation Accuracy for Hindi-English and Tamil-English
                                              Hindi-English                 Tamil-English
                                      System1 System2 System3 System1 System2 System3
                   Known              92.9893    91.1001    94.2576 97.2717    97.378     97.4953
                   Ambiguous Known    83.0998    78.3239    86.5626 83.8063    83.839     85.9711
                   Unknown            91.0318    90.9519    86.9385 93.6683   93.4368     92.4647
                   Overall Accuracy 92.4688 91.0278 92.3718 96.1491 96.2534 95.9847


tity ‘Pranab Mukherjee’ indicates PERSON and ‘India’ in-          4.2 System 2: Word2vec based embedding fea-
dicates LOCATION. Since Pranab Mukherjee has two parts,               tures
it is tagged as beginning and inside. Words other than en-           Word2vec model provides the vector representation for
tities are tagged as O i.e. Outside. So using BIO tag, the        each word. Input for word2vec is sentences, as the major
proposed system labels Pranab as B-PERSON, Mukherjee              advantage of this model is that it provides vectors for each
as I-PERSON and India as LOCATION. This BIO tag in-               word based on the context. Vector representation for each
formation is utilized in the three systems proposed in the        word in the training data is obtained through Skip gram
paper.                                                            model in word2vec. These vectors are used to develop a en-
   Illustration of proposed feature based method is shown in      tity extraction system. Similar to system 1, this system also
Figure 1 and word embedding based models are shown in             includes trigram embedding feature set of word2vec embed-
Figure 2.                                                         ding vectors.
                                                                     Each word from the training data is combined with its cor-
4.1 System 1: Wang2vec based embedding fea-                       responding BIO tag information and the trigram embedding
    tures                                                         feature set. This combined feature set is given for training
   Wang2vec model is the modified version of word2vec with        machine learning based classifier, SVM-Light. After train-
an improvement in the structure of skip gram model. This          ing using SVM model, the test data is appended with the
modification made wang2vec better than word2vec. The              trigram embedding features and is given for testing. SVM
major difference in these two embedding models is that the        classifier uses the knowledge acquired from training data and
skip gram model in word2vec becomes Structured skip gram          performs recognition of entities in testing data.
model in wang2vec. The significant modification in this
model is the fact that the word order information is taken        4.3 System 3: Stylometric features
into consideration. Wang2vec features are the word vectors           Our third system is implemented using stylometric fea-
obtained using wang2vec model. The size of the vector n is        ture extraction. Features such as length, position, numbers,
fixed during the training of wang2vec. Thus each word in          hash tag, punctuation are considered to be the stylometric
the training data holds a vector of size n, which is set as 50.   features. The list of features used in our system is tabulated
The resultant vectors are the word embedding features of the      in Table 2. Stylometric features of each word in the train-
given dataset. From these vectors, the left context and the       ing data is extracted. This feature set is integrated with
right context features were extracted and appended to the         BIO tag information and used for developing a SVM model.
original embedding features. This resulted in a feature set       These features are also extracted for test data and given for
of size 150. Integrating the context features to the original     testing the system.
features forms Trigram embedding features. Thus embed-               The proposed methodology results with three SVM mod-
ding features along with BIO tag information is integrated        els for the three systems using wang2vec features, word2vec
and given to train the SVM classifier. Hence a SVM model          features and stylometric features.
corresponding to system 1 is obtained. Similar procedure
is followed for extraction of wang2vec embedding features
for test data. These features are used to form the trigram        5. EXPERIMENTS AND RESULTS
embedding feature set, which is given to the SVM classifier         Experiments for system 1 and 2 are similar in case of
for testing.                                                      extracting word embedding features. Major difference be-
                         Table 4: Result by CMEE-IL Task Organizers for Hindi-English
                                  RUN 1                      RUN 2                                      RUN 3
       TEAM
                        Precision Recall F-measure Precision Recall F-measure Precision                 Recall F-measure
 Irshad-IIT-Hyd          80.92     59      68.24       -       -        -         -                       -        -
 Deepak-IIT-Patna        81.15    50.39    62.17       -       -        -         -                       -        -
   Amrita CEN            75.19    29.46    42.33      75     29.17    42.00    79.88                    41.37    54.51
 NLP CEN Amrita          76.34    31.15    44.25    77.72    31.84    45.17       -                       -        -
 Rupal-BITS Pilani       58.66    32.93    42.18    58.84    35.32    44.14     59.15                   34.62    43.68


                      Table 5: Result by CMEE-IL Task Organizers for Tamil-English
                                 RUN 1                      RUN 2                                          RUN 3
       TEAM            Precision Recall F-measure Precision Recall F-measure Precision                     Recall F-measure
 Deepak-IIT-Patna       79.92    30.47    44.12       -        -       -          -                          -        -
    Amrita CEN          77.38     8.72    15.67    74.74     9.93    17.53     79.51                       21.88    34.32
  NLP CEN Amrita         77.7    15.43    25.75    79.56    19.59    31.44        -                          -        -
 Rupal-BITS Pilani-R2   55.86    10.87     18.2    58.71    12.21    20.22      58.94                      11.94    19.86
    CEN@Amrita          47.62    13.42    20.94       -        -       -          -                          -        -


tween these two features lies in the fact that system 1 will      media platforms is commonly seen today and extracting en-
use wang2vec embedding features which is retrieved using          tities like person, location or organization from them is a
structured skip gram model that takes the word order into         challenging task. The task organizers provided us with data
consideration. System 2 will use word2vec features retrieved      from Twitter and other few microblogs. Three systems were
using skip gram model without the word order consideration.       submitted for the task. The first two systems uses the word
In order to train word embedding model, additional dataset        embedding features of word2vec and wang2vec for entity
is required. Input data for word embedding models i.e.,           extraction task. The training data along with some addi-
word2vec and wang2vec will be the combination of training         tionally collected dataset were used for training the word
data and additional dataset. The size of vector to be gener-      embedding models. The third system uses only stylometric
ated is set to 50. From these 50 vectors, trigram embedding       features for classification. The three systems were trained
feature set of size 150 has been extracted. The training          and tested using machine learning based classifier, Support
dataset is tokenized based on whitespace and is converted         Vector Machine. As future work, instead of SVM based clas-
to BIO-formatted data. For each word in the training data,        sifier we are planning to use regression based methods.
its BIO-tag along with the 150 embedding features are given
as input to the SVM classifier. Trigram embedding feature
set of test data is also extracted in the same manner. Af-
                                                                  7. ACKNOWLEDGMENT
ter tokenization of test data, the trigram embedding feature         We would like to thank organizers of Forum for Informa-
vectors of size 150 are given to classifier for testing.          tion Retrieval Evaluation 2016 for organizing the task. We
   System 3 implements stylometric feature extraction for         would also like to thank the organizers of the CMEE-IL task.
code-mix data. Training data of Hindi-English and Tamil-
English are subjected to the preprocessing task, tokeniza-        8. REFERENCES
tion. For these tokenized words, features listed in Table 2
are extracted from training data. BIO tag information of           [1] Shared Task on Mixed Script Information Retrieval,
these words is combined to the extracted features and thus             https://msir2016.github.io, 2016.
forms stylometric feature set of training data. As far as test-    [2] N. Abinaya, N. John, H. Barathi Ganesh,
ing is concerned, the tokenized words and its corresponding            M. Anand Kumar, and K. P. Soman.
feature set is integrated and given to the classifier.                 AMRITA-CEN@FIRE-2014: Named entity
   Cross validation results for Hindi-English and Tamil-English        recognition for Indian languages using rich features.
dataset using system 1, 2, 3 are tabulated in Table 3. The             ACM International Conference Proceeding Series,
System 1 which uses wang2vec based features have shown                 05-07-Dec-2014:103–111, 2014.
better results in case of unknown tokens.                          [3] M. Anand Kumar, S. Shriya, and K. P. Soman.
   According to the results provided by CMEE-IL organiz-               AMRITA-CEN@FIRE 2015: Extracting entities for
ers, for Hindi-English we have acquired third place and for            social media texts in Indian languages. CEUR
Tamil-English we have acquired second place. The Preci-                Workshop Proceedings, 1587:85–88, 2015.
sion, Recall, F-measure of the top five teams for Hindi-           [4] K. Asnani and J. D. Pawar. Discovering thematic
English and Tamil English is tabulated in Table 4 and Table            knowledge from code-mixed chat messages using topic
5 respectively.                                                        model. 2016.
                                                                   [5] U. Barman, A. Das, J. Wagner, and J. Foster. Code
6.   CONCLUSION                                                        mixing: A challenge for Language Identification in the
 The work is submitted as a part of Shared task on Code                Language of Social Media. EMNLP 2014, 13, 2014.
Mix Entity Extraction for Indian Languages in FIRE 2016.           [6] Joachims and Thorsten. Svmlight: Support vector
The use of native languages using Roman script in social               machine. SVM-Light Support Vector Machine
     http://svmlight. joachims. org/, University of
     Dortmund, 19(4), 1999.
 [7] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
     J. Dean. Distributed representations of words and
     phrases and their compositionality. In Advances in
     neural information processing systems, pages
     3111–3119, 2013.
 [8] B. G. Patra, D. Das, A. Das, and R. Prasath. Shared
     Task on Sentiment Analysis in Indian Languages
     (SAIL) Tweets-An Overview. In International
     Conference on Mining Intelligence and Knowledge
     Exploration, pages 650–655. Springer, 2015.
 [9] G. Remmiya Devi, P. V. Veena, M. Anand Kumar,
     and K. P. Soman. Entity Extraction for Malayalam
     Social Media Text using Structured Skip-gram based
     Embedding Features from Unlabeled Data. Procedia
     Computer Science, 93:547–553, 2016.
[10] S. Sanjay, M. Anand Kumar, and K. P. Soman.
     AMRITA-CEN-NLP@FIRE 2015:CRF based named
     entity extraction for Twitter microposts. CEUR
     Workshop Proceedings, 1587:96–99, 2015.
[11] S. Shriya, R. Vinayakumar, M. Anand Kumar, and
     K. P. Soman. AMRITA-CEN@ SAIL2015: Sentiment
     Analysis in Indian Languages. In International
     Conference on Mining Intelligence and Knowledge
     Exploration, pages 703–710. Springer, 2015.
[12] Y. Vyas, S. Gella, J. Sharma, K. Bali, and
     M. Choudhury. POS Tagging of English-Hindi
     Code-Mixed Social Media Content. In Proceedings of
     the 2014 Conference on Empirical Methods in Natural
     Language Processing (EMNLP), pages 974–979.
     Association for Computational Linguistics, October
     2014.
[13] L. Wang, D. Chris, B. Alan, and T. Isabel. Two/too
     simple adaptations of word2vec for syntax problems.
     In Proceedings of the 2015 Conference of the North
     American Chapter of the Association for
     Computational Linguistics: Human Language
     Technologies., 2015.