=Paper=
{{Paper
|id=Vol-1737/T7-8
|storemode=property
|title=CEN@Amrita FIRE 2016: Context based Character Embeddings for Entity Extraction in Code-Mixed Text
|pdfUrl=https://ceur-ws.org/Vol-1737/T7-8.pdf
|volume=Vol-1737
|authors=Srinidhi Skanda V,Shivkaran Singh,Remmiya Devi G,Veena P V,Anand Kumar M,Soman K P
|dblpUrl=https://dblp.org/rec/conf/fire/VSGVMP16
}}
==CEN@Amrita FIRE 2016: Context based Character Embeddings for Entity Extraction in Code-Mixed Text==
<pdf width="1500px">https://ceur-ws.org/Vol-1737/T7-8.pdf</pdf>
<pre>
        CEN@Amrita FIRE 2016: Context based Character
       Embeddings for Entity Extraction in Code-Mixed Text
          Srinidhi Skanda V                                Shivkaran Singh                             Remmiya Devi G
      Center for Computational                        Center for Computational                    Center for Computational
  Engineering and Networking(CEN)                 Engineering and Networking(CEN)             Engineering and Networking(CEN)
   Amrita School of Engineering,                    Amrita School of Engineering,              Amrita School of Engineering,
            Coimbatore                                       Coimbatore                                  Coimbatore
   Amrita Vishwa Vidyapeetham,                      Amrita Vishwa Vidyapeetham                  Amrita Vishwa Vidyapeetham
       Amrita University,India                         Amrita University,India                     Amrita University,India
      skanda9051@gmail.com                         shivkaran.ssokhey@gmail.com


               Veena P V                                   Anand Kumar M                                   Soman K P
      Center for Computational                        Center for Computational                    Centre for Computational
  Engineering and Networking(CEN)                 Engineering and Networking(CEN)             Engineering and Networking(CEN)
   Amrita School of Engineering,                   Amrita School of Engineering,                Amrita School of Engineering
            Coimbatore                                      Coimbatore                                   Coimbatore.
   Amrita Vishwa Vidyapeetham,                     Amrita Vishwa Vidyapeetham,                  Amrita Vishwa Vidyapeetham,
       Amrita University,India                         Amrita University,India                     Amrita University,India
                                                   m_anandkumar@cb.amrita.edu


ABSTRACT                                                                the application of Natural Language Processing (NLP) methods.
This paper presents the working methodology and results on Code         One such application is Named Entity Recognition (NER), which
Mix Entity Extraction in Indian Languages (CMEE-IL) shared the          identifies and classifies entities into categories such as a person,
task of FIRE-2016. The aim of the task is to identify various           organization, date-time etc. The NER is used in text mining
entities such as a person, organization, movie and location names       application, information retrieval, and machine translation etc.
in a given code-mixed tweets. The tweets in code mix are written        NER task for Indian languages is a complex task. The use of
in English mixed with Hindi or Tamil. In this work, Entity              code-mixing further complicates this task.
Extraction system is implemented for both Hindi-English and             The shared task on Code Mix Entity Extraction in Indian
Tamil-English code-mix tweets. The system employs context               Languages (CMEE-IL)1 held at Forum for Information Retrieval
based character embedding features to train Support Vector              (2016) focuses on extracting entities from the code-mix twitter
Machine (SVM) classifier. The training data was tokenized such          data.
that each line containing a single word. These words were further
split into characters. Embedding vectors of these characters are        Many works have done on Entity Recognition using word
appended with the I-O-B tags and used for training the system.          embedding as a feature. Few research works are done on Entity
During the testing phase, we use context embedding features to          extraction using Character Embedding technique such as Entity
predict the entity tags for characters in test data. We observed that   recognition using character embedding by Cicero dos Santos et al.
the cross-validation accuracy using character embedding gave            [1], Text segmentation using character-level text embedding is
better results for Hindi-English twitter dataset compare to Tamil-      done by Grzegorz [2]. Some other works based on code -mix data
English twitter dataset.                                                are SVM-based classification for entity extraction for Indian
                                                                        languages [3]. Entity extraction was done based on Conditional
CCS Concepts                                                            Random Field [4] and Identification and linking of tweets was
• Information Retrieval ➝ Retrieval tasks and goals ➝                   performed earlier [10].
Information Extraction • Machine Learning ➝ Machine Learning
approaches ➝ Kernel Methods ➝ Support Vector Machines                   We have submitted a system based on the character embedding a
                                                                        feature representation. To retrieve character embedding features,
Keywords                                                                the word2vec model is used [5]. Word2vec converts input
Named-entity recognition, Information extraction, Code-Mixed            character to a vector of n-dimension. Feature representation
Support vector machine (SVM), Word embedding, Context-based             obtained from word2vec is used for training the system. Classifier
character embedding.                                                    used for training - Support Vector Machine (SVM) which is based
                                                                        on machine learning [6].
1. INTRODUCTION
The explosion of social networking site like Facebook, Twitter          An outline of the task is given in Section 2. Implemented system
and Instagram in a linguistic diverse country like India impacted       is described in Section 3. Section 4 describes Experiment and
many users using multiple languages in tweets, personal blogs etc.      result. The conclusion is discussed in Section 5.
The scarcity of proper input tools for Indian Languages forced the
user to use the roman script mixed with the native script that led
to the code mixed language. This code-mixing further complicates        1http://www.au-kbc.org/nlp/CMEE-FIRE2016/
2. CODE MIX ENTITY EXTRACTION                                          Named Entity (NE) type, Named Entity String, Start Index and
                                                                       Length of the string. The participants were asked to submit a
(CMEE-IL) TASK                                                         similar annotation file after extraction all the entity related
Hindi-English and Tamil-English code-mixed language is given           information from test data file.
as training data, to implement the shared task on CMEE-IL. The
task is to extract the named entities along with their named entity    3. SYSTEM DESCRIPTION
tags. The training and testing dataset had raw tweets and              In the proposed system, feature extraction plays a vital role as the
annotations in different files for code-mixed Hindi-English and        accuracy of the system relies majorly on the extracted features.
Tamil-English language. There were 18 types of entities present in     Before extracting any features, the dataset should be in an
the training dataset in which entities like a person, location, and    appropriate format for the learning algorithm. The I-O-B tagging
entertainment were more in number compare to other entity types.       format was used for this format conversion. The first step in
The number of tweets and number of tokens in training, as well as      format conversion was to tokenize the given data and arrange each
testing corpus for both code-mixed languages, are shown in Table       word (token) per line. To identify start and end of a tweet we
1.                                                                     added a tag <S> to make it easy for further processing. The
                      Table 1. Dataset Count                           proposed system is based on character based context embedding.
                                                                       The basic intuition behind the character based context embedding
    Tweet Type              # of Tweet            # of Tokens          is shown in Figure 1. In the figure, the Word(W0) (LIKE2) is split
                        Train        Test       Train       Test       into corresponding characters. Each character is represented by a
                                                                       50 − dimensional vector (vector length is user defined). These
   Hindi-English         2700        7429       130866     130868      character embeddings are further concatenated to form a word
   Tamil-English         3200        1376       41466       20190      vector.


The sample tweets from training dataset containing code-mix
Hindi-English and Tamil-English can be seen in Table 2. We
noticed there were some tweets written completely or partially in
the Tamil language in Tamil-English code-mixed corpus. No such
findings were noticed in the Hindi-English code-mixed corpus.
            Table 2. Examples of code-mixed Tweets
 English      @_madhumitha_ next time im in chennaidef going
 -Tamil         over like \"hello ennakuongapethitheriyum\"

                @thoatta நான் ச ால் ல வரது டீஸர்
               டிரரலர் நல் லா இருக்குனு ச ான்ன
             படங் கள எல் லாம் சவற் றி அரடயல.over
             expectations nala flop airumnu STD ச ால் லுது
 English     @aixo_anjum sister me aesy bkwas nhi krta na muje
 -Hindi      shoq he I liked your pinned tweet at that time too just
                           wait I shall give you proof


Further, we analyzed the corpus for the counts of major entities.
The count of major entities with the tags in the training corpus can
be seen in Table 3. It can be observed from Table 3 that entities
related to Entertainment tags were the most occurring entities.
                Table 3. Most frequent entity tags
  Code                            Entity Tags                                    Figure 1. Character Based Context Embedding
   Mix
            Person    Location     Organization      Entertainment     This unique vector is appended with the corresponding word. This
  Lang.                                                                50-dimensional representation is converted to 150-dimenstion by
 Hindi-                                                                appending the neighboring context words. W-1 represents previous
              712        194             109               810         context word of the word W0 and W+1 represent the next context
 English
                                                                       word of the word W0. Appending the vectors of W-1, W0 and W+1
 Tamil-                                                                are called context appending.
              661        188             68                260
 English
                                                                       The overall methodology is split into two modules; each module
                                                                       is described in separate figures. Figure 2 shows the method
The given raw text file contains Tweet-ID, User-ID and
corresponding tweet. The annotation file was arranged in
columns, with each column representing Tweet-ID, User-ID,              2 The capitalization of character is just for representation purpose.
followed for feature representation. Figure 3 shows steps involved    extracted from the output file are converted to required annotated
in the classification task.                                           format.
The first module depicted in Figure 2 explains step involved in
feature representation. We collected additional datasets from the
different data source to improve the accuracy of the system. The
additional dataset for Tamil-English was collected from Sentiment
Analysis in Indian Languages (SAIL-2015) [9]. Dataset from
International Conference on Natural Language Processing (ICON)
2015 POS Tagging task [8], Mixed Script Information Retrieval
2016 (MSIR) [7] and some twitter dataset using web scraping
were used for Hindi-English.
This huge file comprising of training as well as additional dataset
collected was then subjected for tokenization. These tokenized
words are further split into corresponding characters. These
characters are fed to vector embedding module. The output of the
module will have a matrix of length 1 × 50 for every character in
our corpus. These vectors are concatenated back to words using
an appending module. The embedding features for each word are
further subjected to context appending, resulting in a vector of
length 1 × 150. These vectors contain the information about the
word as well as neighboring word contexts.


                                                                              Figure 3. Architecture of the implemented system

                                                                      4. RESULT AND ANALYSIS
                                                                      This paper describes work of entity extraction system for code-
                                                                      mix twitter data using character embedding technique. In
                                                                      character embedding approach words in raw code-mix training
                                                                      data is tokenized into one word per line fashion and further split
                                                                      into characters. This character is embedded with the n-dimension
                                                                      vectors to produce feature vectors. Feature vectors appended with
                                                                      the I-O-B tags to represent feature - label sets. This set is fed to
                                                                      Support Vector Machine (SVM) classifier to train a classification
                                                                      model. During testing phase test data undergoes character
                                                                      embedding to represent the test data as a feature vector. These
                                                                      feature vectors are fed to classification model that we already
                                                                      trained. Classification model predicts the entity tags for the test
                                                                      data set. The output of the test data set is converted to a suitable
                   Figure 2. Feature extraction                       form to represent the Annotated format. Annotated format
                                                                      contains Tweet ID, User- ID; word its corresponding predicted
Format conversion involves raw tweets from the given training
                                                                      tags, Index and Length. Table 3 shows Cross Validation result for
data set and annotated file is given to the I-O-B module. I-O-B
                                                                      both Hindi-English and Tamil-English datasets. The overall
converts given training data into the I-O-B format. Here I-O-B
                                                                      accuracy of character embedding for the Hindi-English system is
formatted training set consists of training data along with
                                                                      95.996%. For Tamil-English datasets overall accuracy of
corresponding I-O-B tags. This formatted training set is appended
                                                                      character embedding is 94.3451%.
with the feature vector to form training data that has to be fed to
the classifier.                                                         Table 3. Cross-Validation results for character embedding
Figure 3 describes procedure involved in classification. During       Tweet Type                             Result
the training phase, the classification model is built based on the                     Overall     Known       Ambiguous       Unknown
feature and label pair. During the testing phase, test data along                     accuracy                  Known
with its embedding features are fed to predict entity tags for
testing data. In Figure 3, Context vectors represent features and       Hindi-
                                                                                      95.9962      97.6447       83.8945        92.1155
label pairs that are fed to SVM module. Here we used SVM Light          English
tool to build anSVM-based classifier model. The classifier takes        Tamil-
                                                                                      94.3451      95.8483       91.7017        89.6086
features and label as an input and trains itself finally builds a       English
classification model. From the learned classification model, the
system predicts the entity tags for testing data. Predicted tags
                                                                     [3] Anand Kumar, M., Shriya, S., and Soman, K.P. 2015.
Table 5 and 6 displays result participated in a shared task.             AMRITA-CEN@FIRE 2015: Extracting entities for social
Implemented system ranked eighth in Hindi-English twitter                media texts in Indian languages. CEUR Workshop
dataset with Precision 48.17 and F-measure 32.83. For Tamil-             Proceedings, 1587:85–88.
English our system was ranked fifth with Precision 47.62 and F-      [4] Sanjay, S., Anand Kumar, M., and Soman, K.P. 2015.
measure 20.94.                                                           AMRITA-CEN-NLP@FIRE 2015:CRF based named entity
         Table 5. Result of Hindi-English twitter data                   extraction for Twitter microposts. CEUR Workshop
                                                                         Proceedings, 1587:96–99.
                                            Best run
Rank            Team                                                 [5] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., and J.
                               Precision   Recall      F-Measure         Dean. 2013. Distributed representations of words and phrases
1        Irshad-IIIT-Hyd       80.92       59.00       68.24             and their compositionality. In Advances in neural
                                                                         information processing systems, pages 3111–3119.
2        Deepak-IIT-Patna      81.15       50.39       62.17
                                                                     [6] Joachims and Thorsten. 1999. SVM-light: Support vector
3        Veena-Amrita-T1       79.88       41.37       54.51             machine. SVM-Light Support Vector Machine
         CEN@Amrita            48.17       24.90       32.83             http://svmlight. Joachim. org/, University of Dortmund,
                                                                         19(4).
                                                                     [7] Shared task on mixed script information retrieval,
         Table 6. Result of Tamil-English twitter data                   https://msir2016.github.io, 2016.
                                            Best run                 [8] Vyas, Y., Gella, S., Sharma, J., Bali, K., and Choudhury.
Rank            Team
                               Precision   Recall      F-Measure         October 2014. POS tagging of English-Hindi code-mixed
                                                                         social media content. In Proceedings of the 2014 Conference
1        Deepak-IITPatna       79.92       30.47       44.12             on Empirical Methods in Natural Language Processing
2        VeenaAmrita-T1        79.51       21.88       34.32             (EMNLP), pages 974–979.
3        BharathiAmritha-      79.56       19.59       31.44         [9] Patra, B.G., Das, D., Das, A., and Prasath, R. 2015. Shared
         T2                                                              task on sentiment analysis in Indian languages (sail) tweets-
                                                                         an overview. In International Conference on Mining
         CEN@Amrita            47.62       13.42       20.94
                                                                         Intelligence and Knowledge Exploration, pages 650–655.
                                                                     [10] Barathi Ganesh, H. B., Abinaya, N., Anand Kumar, M.,
5. CONCLUSION AND FUTURE SCOPE                                            Vinayakumar, R., and Soman, K.P. 2015. AMRITA-CEN@
                                                                          NEEL: Identification and Linking of Twitter Entities.
The proposed system for code-mix entity extraction is submitted
                                                                          Making Sense of Microposts (# Microposts2015)
as a part shared task on code-mix entity extraction system in
Indian languages (CMEE-IL) conducted by FIRE 2016. Task              [11] Saha, Sujan Kumar, Sanjay Chatterji, Sandipan Dandapat,
involves extracting entity from the code-mix tweets. Given dataset        Sudeshna Sarkar, and Pabitra Mitra. 2008. A hybrid
consist of Hindi-English and Tamil-English code-mix tweets. In            approach for named entity recognition in indian languages.
our work, we implemented character embedding system. We                   In Proceedings of the IJCNLP-08 Workshop on NER for
conclude that overall accuracy of Character embedding for Hindi-          South and South East Asian Languages, pp. 17-24.
English is better compared to the Tamil-English.                     [12] Abinaya, N., Neethu John., Barathi HB Ganesh., Anand M.
Few errors that resulted in decrease performance of the system are        Kumar., and Soman, K.P. 2014. AMRITA_CEN@ FIRE-
    1.   While splitting words to characters due to encoding              2014: Named Entity Recognition for Indian Languages using
         problem and noise present in the dataset some                    Rich Features. In Proceedings of the Forum for Information
         characters are not properly represented.                         Retrieval Evaluation, pp. 103-111.
    2.   The performance of the implemented system could be          [13] Xue, Bai, Chen Fu, and Zhan Shaobin. 2014. A study on
         improved by using RNN or CNN based models.                       sentiment computing and classification of sinaweibo with
                                                                          word2vec. In 2014 IEEE International congress on Big Data,
                                                                          pp. 358-363.
6. ACKNOWLEDGMENT                                                    [14] Le, Quoc, V., and Tomas Mikolov. 2014. Distributed
We would like to give thanks to the task organizer - Forum for            Representations of Sentences and Documents. In ICML, vol.
Information Retrieval Evaluation. We also thank organizers of             14, pp. 1188-1196.
CMEE-IL task.


7. REFERENCES
[1] Dos Santos, Cıcero, Victor Guimaraes, R. J. Niterói and Rio
    de Janeiro. 2015. Boosting named entity recognition with
    neural character embeddings. In Proceedings of NEWS 2015
    The Fifth Named Entities Workshop, p. 25.
[2] Chrupała, Grzegorz. 2013. Text segmentation with character-
    level text embeddings. arXiv preprint arXiv:1309.4628

</pre>