<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Conditional Random Fields for Code Mixed Entity Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barathi Ganesh HB</string-name>
          <email>barathiganesh.hb@tcs.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anand Kumar M and Soman KP</string-name>
          <email>kp_soman@.amrita.edu</email>
          <email>m_anandkumar@cb.amrita.edu</email>
          <email>m_anandkumar@cb.amrita.edu, kp_soman@.amrita.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Artificial Intelligence Practice, Tata Consultancy Services</institution>
          ,
          <addr-line>Kochi - 682 042</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Center for Computational Engineering and</institution>
          ,
          <addr-line>Networking (CEN)</addr-line>
          ,
          <institution>Amrita School of Engineering</institution>
          ,
          <addr-line>Coimbatore</addr-line>
          ,
          <institution>Amrita Vishwa Vidyapeetham, Amrita University</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Entity Recognition is an essential part of Information Extraction, where explicitly available information and relations are extracted from the entities within the text. Plethora of information is available in social media in the form of text and due to its nature of free style representation, it introduces much complexity while mining information out of it. This complexity is enhanced more by representing the text in more than one language and the usage of transliterated words. In this work we utilized sequential modeling algorithm with hybrid features to perform the Entity Recognition on the corpus given by CMEE-IL (Code Mixed Entity Extraction - Indian Language) organizers. The experimented approach performed great on both the TamilEnglish and Hindi-English tweet corpus by attaining nearly 95% against the training corpus and 45.17%, 31.44% against the testing corpus.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The information shared by people in this digital era has
continuous growth in nature (facebook1, twitter2). Mining
information from these social media text has becomes
essential for both the government and industrial sectors.
Moreover these texts serves as a information source for di erent
text applications [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        Entity Recognition is one of the major key component in
information extraction applications, which could be used to
extract the implicitly and explicitly available information
and relation between the information [
        <xref ref-type="bibr" rid="ref11 ref8">8, 11</xref>
        ]. Entity
Recognition is a task of assigning words or phrases in a text into
its prede ned set of real world world entities like person,
location, organization, ..., etc [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Due to the constraints introduced by the social media
platforms (number of words and formats) and due to the
absence of proper constraints in usage of shared text (grammar</p>
    </sec>
    <sec id="sec-2">
      <title>1www.facebook.com 2www.twitter.com</title>
      <p>
        and in-proper words), mining information from social media
text has become complex to achieve. When shared texts
incorporates multiple languages and transliterated words, it
introduces much complexity to make the fully automated
analytics system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. So far text analytics applications
focused on English text alone. In recent works, it can be
observed that, researches started contributing towards the
code mixed text analytics applications [
        <xref ref-type="bibr" rid="ref1 ref12 ref3 ref5">12, 5, 3, 1</xref>
        ].
      </p>
      <p>By observing the above, we have experimented the
sequential modeling algorithm - Conditional Random Fields
(CRF) along with the hybrid features for performing entity
extraction on code mixed social media texts (i.e. tweets).
A set of corpus based lexicon features extracted out of the
words in the tweets to make Random Forest Tree based
binary classi er (Entity, Non- Entity). This classi er predicts
the given word is entity or not. Along with this binary
result, other common lexicon features are utilized to build the
CRF based entity recognizer.</p>
      <p>Remaining of the paper details about the CRF for entity
recognition in section 2, Random Forest Tree as a binary
classi er in section 3, feature engineering carried over in
section 4 and section 5 details about the experimentation and
observations about the results achieved.
2.</p>
      <p>SEQUENTIAL MODELING WITH
CONDITIONAL RANDOM FIELDS</p>
      <p>
        Over the last few years, CRF has became the pioneer
algorithm in sequential modeling applications (Part Of Speech
tagging, Named Entity Recognition) [
        <xref ref-type="bibr" rid="ref2 ref9">9, 2</xref>
        ]. CRF is from a
discriminative and undirected-probabilistic graphical model,
which is generally used in structured prediction application.
Unlike other ordinary classi cation methods CRF has a
capability of classifying sequence of sample (i.e. context
loading with respect to the neighbouring words).
      </p>
      <p>
        The advantages of CRF over other sequential modeling
algorithms are it avoids the label biasing problem; conditional
probability distribution made over the target label sequences
(i.e. sequence of tags) given a input sequences (i.e. sequence
of words); it has a capability to easily include a wide
variety of arbitrary and non-independent features with respect
to the input words [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Let x1:N be the word sequence and
y1:N is the output label sequence, then the CRF can be
mathematically represented as,
      </p>
      <sec id="sec-2-1">
        <title>Feature Type</title>
      </sec>
      <sec id="sec-2-2">
        <title>Type of the Word</title>
        <p>All upper, All Digit,
Alphanumeric word,
All symbols, All letter</p>
        <p>First letter capital,</p>
      </sec>
      <sec id="sec-2-3">
        <title>Shape of the Word</title>
        <p>ex:- (Vijay - Uuuuu
11-12-1991 - nnsnnsnnnn)</p>
      </sec>
      <sec id="sec-2-4">
        <title>Part of Speech Tag</title>
      </sec>
      <sec id="sec-2-5">
        <title>Pre x of length 1 to 4</title>
        <p>eg:- (Parking- g, ng, ing, king)</p>
      </sec>
      <sec id="sec-2-6">
        <title>Su x of length 1 to 4</title>
        <p>eg:- (Parking- P, Pa, Par, Park)</p>
      </sec>
      <sec id="sec-2-7">
        <title>Length of Word</title>
      </sec>
      <sec id="sec-2-8">
        <title>Entity or not</title>
        <p>Decision from Random
Forest Tree Classi er
X
X</p>
        <p>X
(2)</p>
        <p>In above equation x represents the input word sequence
([Vijay, acted, in, a, lm, Sura]), y represents the output
label sequence ([Actor, other, other, other, other,
Entertainment]), tj (yi 1; yi; x; i) is a transition function constrained
by the feature function as given in equation 3 (i.e.
probability of label changing from one label to another learned from
training corpus and change of label at position i 1 to i in
test sequence), sk (yi; x; i) is similar to the emission
probability at Hidden Morkov Model but constrained by feature
function similar to tj (yi 1; yi; x; i), Z is the normalization
factor and j, k are the optimization parameters learned
from training corpus.</p>
        <p>The transition function tj (yi 1; yi; x; i) and emission
function sk (yi; x; i) takes on the values only if b(x; i) is greater
than 0. b(x; i) will be greater than 0, if the current state
(in the case of the emission functions), previous and
current states (in the case of the transition functions) take on
particular values with respect to the training corpus. An
example, b(x; i) activation function is given below:
tj (yi 1; yi; x; i) =
8b (x; i) if yi 1 = other and
&gt;
&lt;&gt; yi = Entertainment
&gt;&gt;: 0
otherwise</p>
        <p>In the above equation, b(x; i) will be greater than 0 only
if the following two labels (other, Entertainment)
consecutively occurs in the training set. From the above inputs, it is
clear that transition and emission functions are constrained
with respect to the feature function b(x; i). Incorporating
relevant features from the training set will leads to a high</p>
      </sec>
      <sec id="sec-2-9">
        <title>Binary</title>
      </sec>
      <sec id="sec-2-10">
        <title>Nominal</title>
        <p>X
X
X
X
X
!
(1)
!
(3)
performance sequential modeling system. Few of the
nominal and binary features utilized in this proposed approach
is given in the Table 1.
3.</p>
        <p>ENTITY SELECTION WITH RANDOM
FOREST TREE</p>
        <p>
          The feature mentioned in Table 1 i.e. Entity or not, is a
binary function derived through Random Forest Tree
classi er. More than the other features mentioned in the Table
1, this binary feature provide more constraint to the feature
function in CRF to nd distribution over the output label.
Random Forest Tree is a classi cation algorithm, which is
formed by selecting the most occurring resultant class among
the set of weak decision trees [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In this approach the
lexicon based features from the entity words are extracted and
by considering these features as the attributes for the
Random Forest Tree classi er, the classes (entity, not a entity)
for the given word is predicted.
        </p>
        <p>Given a training set W = w1; w2; w3; :::; wn (words) with
the output labels Y = y1; y2; y3; :::; yn (entity , not a entity)
and feature set F = f 1; f 2; f 3; :::; f n, bagging repeatedly (B
times - Number of trees) done by selecting random samples
and attributes from the training set and builds the decision
tree for each set. Then the predictions for test words W^
can be found by averaging the predictions from all the
individual decision trees built through the train set. It can be
interpreted as following:
fb = f (Wb; Yb; Fb)</p>
        <p>B
Y = 1 X fb(W^ F^)</p>
        <p>B
b=1
(4)
(5)</p>
        <p>Corpus based lexicon features are extracted in-order to
train the above classi er. Initially a feature set is built
from the entity words available in Tamil-English and
HindiEnglish corpus. Then by taking these features as a
vocabulary, the Term - Document Matrix (TDM) is built against
the words. Then this matrix along with the binary labels
(entity, not a entity) are fed to the Random Forest Tree to
make the decision. The feature set of TDM includes pre x
and su x of length 1 to 3 of the words, length of words and
position of the word in that tweet.</p>
        <p>Discription</p>
        <p># Tweets
# Unique Tweets</p>
        <p># Tags
# Unique Tags
# Entity words
# Unique Entity words</p>
        <p># words
Avg # words / tweet</p>
        <p>Entity-Word ratio</p>
        <p>Tamil-English
3184.0
2821.0
1624.0
21.0
1624.0
1016.0
32142.0
10.1
5.1%</p>
        <p>Hindi-English
2701.0
2669.0
2413.0
21.0
2413.0
1200.0
43766.0
16.2
5.5%
The overall approach is performed in a system with
following speci cation: Linux operating system, python3.4, 16
GB RAM and 8 core processor. In order to perform CRF,
Sklearn - CRFSuite3 is utilized, TDM matrix is built
using sklearn-CountVectorizer4 library, Random Forest Tree
classi er5 is from sklearn library, part of speech tagging
done using NLTK6 library and preprocessor using
twitterpreprocessor7.</p>
        <p>The statistics about both the data-sets are given in
Table 2 and Table 3. Initially raw tweets are tagged with its
corresponding entities given in Table 3 with respect to the
annotation le provided by Code Mixed Entity Extraction
-Indian Language task organizers.</p>
        <p>Since the given data-set is tweet, the tendency of noise
presence is higher and unwanted text, non-text information
will lead to build a sequential model with low performance.
These unwanted informations, web links and emoticons are
removed from tweets through twitter preprocessor.</p>
        <p>Followed by the preprocessing step, a set of corpus based
features are extracted out of the entity words in a tweet to
built the Random Forest Tree based binary classi er. For
extraction initially all the entities in the training corpus are
re-tagged as 'Entity' and others as 'not a Entity'. From
the entity words present in the training corpus their
corresponding pre x-su x of length 1 to 4 are taken to build the
vocabulary for TDM matrix by using CountVectorizer.</p>
        <p>The TDM matrix is built based upon the presence of
prex, su x information present within the words. Along with
this TDM matrix length of the word, position of the word
lies in its tweet and total number of times the pre x or
suf</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3pypi.python.org/pypi/sklearn-crfsuite</title>
      <p>4scikit-learn.org
5scikit-learn.org
6www.nltk.org
7github.com/s/preprocessor
x present in the corpus are taken as attributes to train
the Random Forest Tree classi er. N CpN number of trees
are utilized to built the Random Forest tree, where N is
the total number of attributes. Similarly testing corpus is
also applied on the above steps to get the given word is
entity or not. In order to measure the training performance
10-fold 10-cross validation is carried out and obtained near
96%, 97% respectively for the Tamil - English and Hindi
English corpus.</p>
      <p>With the above obtained binary feature, other features
mentioned in the Table 1 are extracted out of the training
corpus. A window of length 5 is taken to capture the context
of word as well as features by taking previous two words and
later two words from the current word. Using these features
as the constraint function CRF sequential model is built
for entity recognition task. Similarly features are extracted
for testing and output labels are predicted for input testing
word sequences. Finally words with the consecutive output
labels are concatenated together to form phrases with single
tag. To ensure the training performance, similar to Random
Forest Tree here also cross validation is carried over and
obtained nearly 94% as the precision for both the corpus.</p>
      <p>The performance against the test set of top 5 teams are
given in Table 4 and Table 5. It can be observed that from
the top score the precision of the proposed system only varies
around 2% in Hindi - English corpus and almost equal in
Tamil - English Corpus. The problem arises with the recall,
which a ects nal F measure. Hence our future work will
be focused on improving the recall of the proposed system.
5.</p>
      <p>CONCLUSION</p>
      <p>Conditional Random Field based Entity Recognition with
hybrid features was experimented on CMEE - IL (Code
Mixed Entity Extraction - Indian Language) corpus and
attained greater performance. The experimented approach</p>
      <sec id="sec-3-1">
        <title>Team</title>
        <p>Irshad-IIIT-Hyd
Deepak-IIT-Patna
Veena-Amritha-T1</p>
      </sec>
      <sec id="sec-3-2">
        <title>Bharathi-Amrita-T2</title>
        <p>Rupal-BITS-Pilani</p>
      </sec>
      <sec id="sec-3-3">
        <title>Precision</title>
        <p>80.92
81.15
79.88
77.72
58.84
performed great on both the Tamil-English and Hindi-English
tweet corpus by attaining nearly 95% against the training
corpus and 45.17%, 31.44% against the testing corpus.
Preprocessing of social media text is an essential part. This
will improve the feature engineering (reduces the sparsity)
and boost the performance of the proposed system. Hence
the future work will be focused on incorporating necessary
pre-processing steps along with the proposed approach.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Abinaya</surname>
          </string-name>
          ,
          <string-name>
            N. John,
            <given-names>H. B.</given-names>
            <surname>Barathi Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Anand</given-names>
            <surname>Kumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Soman</surname>
          </string-name>
          . Amrita cen re-2014:
          <article-title>Named entity recognition for indian languages using rich features</article-title>
          . pages
          <volume>103</volume>
          {
          <fpage>111</fpage>
          ,
          <year>December 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H. B.</given-names>
            <surname>Barathi Ganesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Abinaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Anand</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vinayakumar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Soman</surname>
          </string-name>
          .
          <article-title>Amrita-cen neel: Identi cation and linking of twitter entities</article-title>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>U.</given-names>
            <surname>Barman</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Das</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wagner</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Foster</surname>
          </string-name>
          .
          <article-title>Code mixing: A challenge for language identi cation in the language of social media</article-title>
          . volume
          <volume>13</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          .
          <article-title>Random forests</article-title>
          . volume
          <volume>1</volume>
          , pages
          <fpage>5</fpage>
          <lpage>{</lpage>
          32,
          <year>October 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Das</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Gamback</surname>
          </string-name>
          .
          <article-title>Code-mixing in social media text</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. La erty</given-names>
            , A.
            <surname>McCallum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pereira</surname>
          </string-name>
          .
          <article-title>Conditional random elds: Probabilistic models for segmenting and labeling sequence data</article-title>
          . volume
          <volume>1</volume>
          , pages
          <fpage>282</fpage>
          {
          <fpage>289</fpage>
          ,
          <year>June 2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Maynard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bontcheva</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Rout</surname>
          </string-name>
          .
          <article-title>Challenges in developing opinion mining tools for social media</article-title>
          .
          <source>pages 15{22</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Piskorski</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Yangarber</surname>
          </string-name>
          . Information extraction: Past, present and future.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>PVS</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Karthik</surname>
          </string-name>
          .
          <article-title>Part-of-speech tagging and chunking using conditional random elds and transformation based learning</article-title>
          . volume
          <volume>21</volume>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          .
          <article-title>Named entity recognition in tweets: an experimental study</article-title>
          .
          <source>pages</source>
          <volume>1524</volume>
          {
          <fpage>1534</fpage>
          ,
          <year>July 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. B, and
          <string-name>
            <surname>L. J.</surname>
          </string-name>
          <article-title>Information extraction: Methodologies and applications</article-title>
          .
          <source>October</source>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Vyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bali</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Choudhury</surname>
          </string-name>
          .
          <article-title>Pos tagging of english-hindi code-mixed social media content</article-title>
          . volume
          <volume>14</volume>
          , pages
          <fpage>974</fpage>
          {
          <fpage>979</fpage>
          ,
          <year>October 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Westerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Spence</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B. Van Der</given-names>
            <surname>Heide</surname>
          </string-name>
          .
          <article-title>Social media as information source: Recency of updates and credibility of information</article-title>
          . volume
          <volume>19</volume>
          , pages
          <fpage>171</fpage>
          {
          <fpage>183</fpage>
          ,
          <year>January 2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>