Conditional Random Fields for Code Mixed Entity
                            Recognition
                              [NLP_CEN_AMRITA@CMEE-IL-FIRE-2016]
                       Barathi Ganesh HB                               Anand Kumar M and Soman KP
                   Artificial Intelligence Practice                 Center for Computational Engineering and
                    Tata Consultancy Services                                   Networking (CEN)
                           Kochi - 682 042                          Amrita School of Engineering, Coimbatore
                                  India                                   Amrita Vishwa Vidyapeetham
                 barathiganesh.hb@tcs.com                                     Amrita University, India
                                                                       m_anandkumar@cb.amrita.edu,
                                                                          kp_soman@.amrita.edu

ABSTRACT                                                          and in-proper words), mining information from social media
Entity Recognition is an essential part of Information Ex-        text has become complex to achieve. When shared texts
traction, where explicitly available information and relations    incorporates multiple languages and transliterated words, it
are extracted from the entities within the text. Plethora         introduces much complexity to make the fully automated
of information is available in social media in the form of        analytics system [7]. So far text analytics applications fo-
text and due to its nature of free style representation, it in-   cused on English text alone. In recent works, it can be
troduces much complexity while mining information out of          observed that, researches started contributing towards the
it. This complexity is enhanced more by representing the          code mixed text analytics applications [12, 5, 3, 1].
text in more than one language and the usage of translit-            By observing the above, we have experimented the se-
erated words. In this work we utilized sequential model-          quential modeling algorithm - Conditional Random Fields
ing algorithm with hybrid features to perform the Entity          (CRF) along with the hybrid features for performing entity
Recognition on the corpus given by CMEE-IL (Code Mixed            extraction on code mixed social media texts (i.e. tweets).
Entity Extraction - Indian Language) organizers. The ex-          A set of corpus based lexicon features extracted out of the
perimented approach performed great on both the Tamil-            words in the tweets to make Random Forest Tree based bi-
English and Hindi-English tweet corpus by attaining nearly        nary classifier (Entity, Non- Entity). This classifier predicts
95% against the training corpus and 45.17%, 31.44% against        the given word is entity or not. Along with this binary re-
the testing corpus.                                               sult, other common lexicon features are utilized to build the
                                                                  CRF based entity recognizer.
                                                                     Remaining of the paper details about the CRF for entity
Keywords                                                          recognition in section 2, Random Forest Tree as a binary
Entity Recognition; Sequential Modeling; Conditional Ran-         classifier in section 3, feature engineering carried over in sec-
dom Fields                                                        tion 4 and section 5 details about the experimentation and
                                                                  observations about the results achieved.

1.     INTRODUCTION                                               2.   SEQUENTIAL MODELING WITH CON-
   The information shared by people in this digital era has
continuous growth in nature (facebook1 , twitter2 ). Mining
                                                                       DITIONAL RANDOM FIELDS
information from these social media text has becomes essen-          Over the last few years, CRF has became the pioneer al-
tial for both the government and industrial sectors. More-        gorithm in sequential modeling applications (Part Of Speech
over these texts serves as a information source for different     tagging, Named Entity Recognition) [9, 2]. CRF is from a
text applications [13].                                           discriminative and undirected-probabilistic graphical model,
   Entity Recognition is one of the major key component in        which is generally used in structured prediction application.
information extraction applications, which could be used to       Unlike other ordinary classification methods CRF has a ca-
extract the implicitly and explicitly available information       pability of classifying sequence of sample (i.e. context load-
and relation between the information [8, 11]. Entity Recog-       ing with respect to the neighbouring words).
nition is a task of assigning words or phrases in a text into        The advantages of CRF over other sequential modeling al-
its predefined set of real world world entities like person,      gorithms are it avoids the label biasing problem; conditional
location, organization, ..., etc [10].                            probability distribution made over the target label sequences
   Due to the constraints introduced by the social media          (i.e. sequence of tags) given a input sequences (i.e. sequence
platforms (number of words and formats) and due to the ab-        of words); it has a capability to easily include a wide vari-
sence of proper constraints in usage of shared text (grammar      ety of arbitrary and non-independent features with respect
                                                                  to the input words [6]. Let x1:N be the word sequence and
1
    www.facebook.com                                              y1:N is the output label sequence, then the CRF can be
2
    www.twitter.com                                               mathematically represented as,
           Feature Type                       Binary         Nominal               performance sequential modeling system. Few of the nom-
         Type of the Word                                                          inal and binary features utilized in this proposed approach
         All upper, All Digit,                                                     is given in the Table 1.
         Alphanumeric word,                          X
        All symbols, All letter
         First letter capital,
                                                                                   3.   ENTITY SELECTION WITH RANDOM
        Shape of the Word                                                               FOREST TREE
         ex:- (Vijay - Uuuuu                         X             X                  The feature mentioned in Table 1 i.e. Entity or not, is a
     11-12-1991 - nnsnnsnnnn)                                                      binary function derived through Random Forest Tree clas-
        Part of Speech Tag                                         X               sifier. More than the other features mentioned in the Table
     Prefix of length 1 to 4                                                       1, this binary feature provide more constraint to the feature
   eg:- (Parking- g, ng, ing, king)                                X               function in CRF to find distribution over the output label.
      Suffix of length 1 to 4                                                      Random Forest Tree is a classification algorithm, which is
  eg:- (Parking- P, Pa, Par, Park)                                 X               formed by selecting the most occurring resultant class among
          Length of Word                                           X               the set of weak decision trees [4]. In this approach the lexi-
            Entity or not                                                          con based features from the entity words are extracted and
       Decision from Random                          X                             by considering these features as the attributes for the Ran-
        Forest Tree Classifier                                                     dom Forest Tree classifier, the classes (entity, not a entity)
                                                                                   for the given word is predicted.
                    Table 1: CRF Features                                             Given a training set W = w1, w2, w3, ..., wn (words) with
                                                                                   the output labels Y = y1, y2, y3, ..., yn (entity , not a entity)
                                                                                   and feature set F = f 1, f 2, f 3, ..., f n, bagging repeatedly (B
                                                                                   times - Number of trees) done by selecting random samples
                                                                   !               and attributes from the training set and builds the decision
   1     X                            X
                                                                                   tree for each set. Then the predictions for test words Ŵ
     exp   λj tj (yi−1 , yi , x, i) +   µk sk (yi , x, i)                    (1)
   Z     j                                                                         can be found by averaging the predictions from all the indi-
                                              k
                                                                                   vidual decision trees built through the train set. It can be
                                                                                   interpreted as following:
                                                                             !
      X            X                                 X
 Z=          exp        λj tj (yi−1 , yi , x, i) +       µk sk (yi , x, i)                               fb = f (Wb , Yb , Fb )                  (4)
      y1:N          j                                k
                                                                (2)
   In above equation x represents the input word sequence
([Vijay, acted, in, a, film, Sura]), y represents the output                                                      B
                                                                                                               1 X
label sequence ([Actor, other, other, other, other, Entertain-                                           Y =       fb (Ŵ F̂ )                   (5)
ment]), tj (yi−1 , yi , x, i) is a transition function constrained                                             B
                                                                                                                 b=1
by the feature function as given in equation 3 (i.e. probabil-                        Corpus based lexicon features are extracted in-order to
ity of label changing from one label to another learned from                       train the above classifier. Initially a feature set is built
training corpus and change of label at position i − 1 to i in                      from the entity words available in Tamil-English and Hindi-
test sequence), sk (yi , x, i) is similar to the emission proba-                   English corpus. Then by taking these features as a vocab-
bility at Hidden Morkov Model but constrained by feature                           ulary, the Term - Document Matrix (TDM) is built against
function similar to tj (yi−1 , yi , x, i), Z is the normalization                  the words. Then this matrix along with the binary labels
factor and λj , µk are the optimization parameters learned                         (entity, not a entity) are fed to the Random Forest Tree to
from training corpus.                                                              make the decision. The feature set of TDM includes prefix
   The transition function tj (yi−1 , yi , x, i) and emission func-                and suffix of length 1 to 3 of the words, length of words and
tion sk (yi , x, i) takes on the values only if b(x, i) is greater                 position of the word in that tweet.
than 0. b(x, i) will be greater than 0, if the current state
(in the case of the emission functions), previous and cur-                              Discription              Tamil-English        Hindi-English
rent states (in the case of the transition functions) take on
                                                                                          # Tweets                   3184.0              2701.0
particular values with respect to the training corpus. An
                                                                                      # Unique Tweets                2821.0              2669.0
example, b(x, i) activation function is given below:
                                                                                           # Tags                    1624.0              2413.0
                                                                                      # Unique Tags                  21.0                 21.0
                             
                             b (x, i)   if yi−1 = other and                           # Entity words                1624.0              2413.0
                                         yi = Entertainment
                             
   tj (yi−1 , yi , x, i) =                                                   (3)    # Unique Entity words            1016.0              1200.0
                             
                                                                                          # words                  32142.0              43766.0
                                0                otherwise                          Avg # words / tweet              10.1                 16.2
                                                                                      Entity-Word ratio               5.1%                5.5%
   In the above equation, b(x, i) will be greater than 0 only
if the following two labels (other, Entertainment) consecu-
                                                                                                  Table 2: Data-set Statistics
tively occurs in the training set. From the above inputs, it is
clear that transition and emission functions are constrained
with respect to the feature function b(x, i). Incorporating
relevant features from the training set will leads to a high                       4.   EXPERIMENT AND OBSERVATIONS
                                                                          Entity            Tamil-English       Hindi-English
                                                                       ARTIFACT                  18                   25
                                                                       LIVTHINGS                 16                    7
                                                                        DISEASE                   5                    7
                                                                         COUNT                   94                  132
                                                                          DATE                   14                   33
                                                                       FACILITIES                23                   10
                                                                        PERSON                   661                 712
                                                                       DISTANCE                   4                    -
                                                                          SDAY                    6                   23
                                                                         MONTH                   25                   10
                                                                           DAY                   15                   67
                                                                        PLANTS                    3                    1
                                                                       MATERIALS                 28                   24
                                                                          TIME                   18                   22
                                                                         MONEY                   66                   25
                                                                    ENTERTAINMENT                260                 810
                                                                       LOCATION                  188                 194
                                                                      LOCOMOTIVE                  5                   13
                                                                     ORGANIZATION                68                  109
                                                                         PERIOD                  53                   44
                                                                          YEAR                   54                  143
                                                                       QUANTITY                   -                    2

    Figure 1: Model Diagram of Proposed Approach                               Table 3: Entity Tags Statistics


   The overall approach is performed in a system with fol-         fix present in the corpus are taken as attributes to train
lowing specification: Linux operating system, python3.4, 16        the Random Forest Tree classifier. N C√N number of trees
GB RAM and 8 core processor. In order to perform CRF,              are utilized to built the Random Forest tree, where N is
Sklearn - CRFSuite3 is utilized, TDM matrix is built us-           the total number of attributes. Similarly testing corpus is
ing sklearn-CountVectorizer4 library, Random Forest Tree           also applied on the above steps to get the given word is en-
classifier5 is from sklearn library, part of speech tagging        tity or not. In order to measure the training performance
done using NLTK6 library and preprocessor using twitter-           10-fold 10-cross validation is carried out and obtained near
preprocessor7 .                                                    96%, 97% respectively for the Tamil - English and Hindi -
   The statistics about both the data-sets are given in Ta-        English corpus.
ble 2 and Table 3. Initially raw tweets are tagged with its           With the above obtained binary feature, other features
corresponding entities given in Table 3 with respect to the        mentioned in the Table 1 are extracted out of the training
annotation file provided by Code Mixed Entity Extraction           corpus. A window of length 5 is taken to capture the context
-Indian Language task organizers.                                  of word as well as features by taking previous two words and
   Since the given data-set is tweet, the tendency of noise        later two words from the current word. Using these features
presence is higher and unwanted text, non-text information         as the constraint function CRF sequential model is built
will lead to build a sequential model with low performance.        for entity recognition task. Similarly features are extracted
These unwanted informations, web links and emoticons are           for testing and output labels are predicted for input testing
removed from tweets through twitter preprocessor.                  word sequences. Finally words with the consecutive output
   Followed by the preprocessing step, a set of corpus based       labels are concatenated together to form phrases with single
features are extracted out of the entity words in a tweet to       tag. To ensure the training performance, similar to Random
built the Random Forest Tree based binary classifier. For          Forest Tree here also cross validation is carried over and
extraction initially all the entities in the training corpus are   obtained nearly 94% as the precision for both the corpus.
re-tagged as ’Entity’ and others as ’not a Entity’. From              The performance against the test set of top 5 teams are
the entity words present in the training corpus their corre-       given in Table 4 and Table 5. It can be observed that from
sponding prefix-suffix of length 1 to 4 are taken to build the     the top score the precision of the proposed system only varies
vocabulary for TDM matrix by using CountVectorizer.                around 2% in Hindi - English corpus and almost equal in
   The TDM matrix is built based upon the presence of pre-         Tamil - English Corpus. The problem arises with the recall,
fix, suffix information present within the words. Along with       which affects final F measure. Hence our future work will
this TDM matrix length of the word, position of the word           be focused on improving the recall of the proposed system.
lies in its tweet and total number of times the prefix or suf-
3
  pypi.python.org/pypi/sklearn-crfsuite                            5.   CONCLUSION
4
  scikit-learn.org                                                   Conditional Random Field based Entity Recognition with
5
  scikit-learn.org                                                 hybrid features was experimented on CMEE - IL (Code
6
  www.nltk.org                                                     Mixed Entity Extraction - Indian Language) corpus and
7
  github.com/s/preprocessor                                        attained greater performance. The experimented approach
          Team              Precision    Recall     F         [12] Y. Vyas, S. Gella, J. Sharma, K. Bali, and
    Deepak-IIT-Patna          79.92      30.47    44.12            M. Choudhury. Pos tagging of english-hindi
    Veena-Amritha-T1          79.51      21.88    34.32            code-mixed social media content. volume 14, pages
  Bharathi-Amrita-T2         79.56       19.59    31.44            974–979, October 2014.
  Rupal-BITS-Pilani-R2        58.71      12.21    20.22       [13] D. Westerman, P. Spence, and B. Van Der Heide.
  Shivkaran-Amritha-T3        47.62      13.42    20.94            Social media as information source: Recency of
                                                                   updates and credibility of information. volume 19,
         Table 4: Results : Tamil - English                        pages 171–183, January 2014.

         Team               Precision    Recall     F
    Irshad-IIIT-Hyd           80.92      59.00    68.24
   Deepak-IIT-Patna           81.15      50.39    62.17
   Veena-Amritha-T1           79.88      41.37    54.51
  Bharathi-Amrita-T2         77.72       31.84    45.17
   Rupal-BITS-Pilani          58.84      35.32    44.14

          Table 5: Results : Hindi - English


performed great on both the Tamil-English and Hindi-English
tweet corpus by attaining nearly 95% against the training
corpus and 45.17%, 31.44% against the testing corpus. Pre-
processing of social media text is an essential part. This
will improve the feature engineering (reduces the sparsity)
and boost the performance of the proposed system. Hence
the future work will be focused on incorporating necessary
pre-processing steps along with the proposed approach.

6.   REFERENCES
 [1] N. Abinaya, N. John, H. B. Barathi Ganesh,
     M. Anand Kumar, and K. Soman. Amrita cen
     fire-2014: Named entity recognition for indian
     languages using rich features. pages 103 – 111,
     December 2014.
 [2] H. B. Barathi Ganesh, N. Abinaya, M. Anand Kumar,
     R. Vinayakumar, and K. Soman. Amrita-cen neel:
     Identification and linking of twitter entities. 2015.
 [3] U. Barman, A. Das, J. Wagner, and J. Foster. Code
     mixing: A challenge for language identification in the
     language of social media. volume 13, 2014.
 [4] L. Breiman. Random forests. volume 1, pages 5–32,
     October 2001.
 [5] A. Das and B. Gamback. Code-mixing in social media
     text.
 [6] J. Lafferty, A. McCallum, and F. Pereira. Conditional
     random fields: Probabilistic models for segmenting
     and labeling sequence data. volume 1, pages 282–289,
     June 2001.
 [7] D. Maynard, K. Bontcheva, and D. Rout. Challenges
     in developing opinion mining tools for social media.
     pages 15–22, 2012.
 [8] J. Piskorski and R. Yangarber. Information extraction:
     Past, present and future. 2013.
 [9] A. PVS and G. Karthik. Part-of-speech tagging and
     chunking using conditional random fields and
     transformation based learning. volume 21, 2007.
[10] A. Ritter, S. Clark, and O. Etzioni. Named entity
     recognition in tweets: an experimental study. pages
     1524–1534, July 2011.
[11] J. Tang, M. Hong, D. Zhang, L. B, and L. J.
     Information extraction: Methodologies and
     applications. October 2007.