=Paper= {{Paper |id=Vol-1737/T7-3 |storemode=property |title=A Hybrid Approach for Entity Extraction in Code-Mixed Social Media Data |pdfUrl=https://ceur-ws.org/Vol-1737/T7-3.pdf |volume=Vol-1737 |authors=Deepak Gupta,Shweta,Shubham Tripathi,Asif Ekbal,Pushpak Bhattacharyya |dblpUrl=https://dblp.org/rec/conf/fire/GuptaSTEB16 }} ==A Hybrid Approach for Entity Extraction in Code-Mixed Social Media Data== https://ceur-ws.org/Vol-1737/T7-3.pdf
     A Hybrid Approach for Entity Extraction in Code-Mixed
                     Social Media Data

                 Deepak Gupta                              Shweta                       Shubham Tripathi
            Comp. Sc. & Engg. Deptt.              Comp. Sc.& Engg. Deptt.              Electrical Engg. Deptt.
                IIT Patna, India                     IIT Patna, India                    MNIT Jaipur, India
          deepak.pcs16@iitp.ac.in               shweta.pcs14@iitp.ac.in     stripathi1770@gmail.com
                                       Asif Ekbal                Pushpak Bhattacharyya
                               Comp. Sc. & Engg. Deptt.          Comp. Sc. & Engg. Deptt.
                                   IIT Patna, India                  IIT Patna, India
                                    asif@iitp.ac.in                     pb@iitp.ac.in

ABSTRACT                                                         ment has been increased a lot. Over the past decade, In-
Entity extraction is one of the important tasks in various       dian language content on various media types such as blogs,
natural language processing (NLP) application areas. There       email, website, chats has increased significantly. And it is
has been a significant amount of works related to entity ex-     observed that with the advent of smart phones more people
traction, but mostly for a few languages (such as English,       are using social media such as whatsapp, twitter, facebook
some European languages and few Asian languages) and             to share their opinion on people, products, services, organi-
doamins such as newswire. Nowadays social media have be-         zations, governments. The abundant of social media data
come a convenient and powerful way to express one’s opinion      created many new opportunities for information access, but
and sentiment. India is a diverse country with a lot of lin-     also many new challenges. To deals with these challenges
guistic and cultural variations. Texts written in social media   many of the research is going on and it have become one of
are informal in nature, and perople often use more than one      the prime present-day research areas. Non-English speakers,
script while writing. User generated content such as tweets,     especially Indians, do not always use Unicode to write some-
blogs and personal websites of people are written using Ro-      thing in social media in Indian languages. Instead, they use
man script or sometimes users may use both Roman as well         their roman script or transliteration and frequently use En-
as indigenous scripts. Entity extraction is, in general, a       glish words or phrases through code-mixing and also often
more challenging task for such an informal text, and mix-        mix multiple languages in addition to anglicisms to express
ing of codes further complicates the process. In this paper,     their thoughts. Although English is the principal language
we propose a hybrid approah for enity extraction from code       for social media communications, there is a necessity to de-
mixed language pairs such as English-Hindi and English-          velop mechanism for other languages, including Indian lan-
Tamil. We use a rich linguistic feature set to train Condi-      guages. According to the Indian constitution there are 22
tional Random Field (CRF) classifier. The output of clas-        official language in India. However Census of India of 2001,
sifier is post-processed with a carefully hand-crafted feature   reported India has 122 major languages and 1599 other lan-
set. The proposed system achieve the F-scores of 62.17% and      guages. The 2001 Census recorded 30 languages which were
44.12% for English-Hindi and English-Tamil language pairs,       spoken by more than a million native speakers and 122 which
respectively. Our system attains the best F-score among          were spoken by more than 10, 000 people. Language diver-
all the systems submitted in Fire 2016 shared task for the       sity and dialect changes instigate frequent code-mixing in
English-Tamil language pairs.                                    India. Hence, Indians are multi-lingual by adaptation and
                                                                 necessity, and frequently change and mix languages in social
                                                                 media contexts, which poses additional difficulties for auto-
CCS Concepts                                                     matic social media text processing on Indian language. The
•Computing methodologies → Natural Language Pro-                 growth of Indian language content is expected to increase
cessing; •Information System → Information Extrac-               by more than 70% every year. Hence there is a great need
tion; •Algorithm → Conditional Random Field(CRF);                to process this huge data automatically.        Named Entity
                                                                 Recognition (NER) is one of the key information extraction
Keywords                                                         tasks, which is concerned with identifying names of enti-
                                                                 ties such as people, location, organization and product. It
Code-mixing, Entity Extraction, Named Entity Recogni-            can be divided into two main phases: entity detection and
tion, Conditional Random Field(CRF), Indian Language,            entity typing (also called classification)[7]. Recently, Infor-
Social Media data                                                mation extraction over micro-blogs have become an active
                                                                 research topic [4], following early experiments which showed
1.   INTRODUCTION                                                this genre to be extremely challenging for state-of-the-art
   Code-mixing refers to the mixing of two or more languages     algorithms[5, 2]. For instance, named entity recognition
or language varieties. Code switching and code mixing are        methods typically have 85-90% accuracy on longer texts,
interchangeably used by the peoples. With the availabil-         but 30-50% on tweets[16]. First, the shortness of micro-blogs
ity of easy internet access to people, social media involve-     (maximum 140 characters for tweets) makes them hard to
interpret. Consequently, ambiguity is a major problem since                 {ti , tj . . . tk } from S whose characteristics is similar to
semantic annotation methods cannot easily make use of co-                   any of the entity from entity set E.
reference information. Unlike longer news articles, there is
a low amount of discourse information per microblog docu-                 2. Entity classification step: Classify each of the to-
ment, and threaded structure is fragmented across multiple                   kens of set TE into one of the entity type from entity
documents, flowing in multiple directions. Second, micro-                    set E.
texts exhibit much more language variation, tend to be less
grammatical than longer posts, contain unorthodox capi-
talization, and make frequent use of emoticons, abbrevia-            3.     DATASET
tions and hashtags, which can form an important part of the             There are two language pair data set available to evalu-
meaning. To combat these problems, research has focused              ate the system performance. It was crawled from tweeter,
on microblog-specific information extraction algorithms (e.g.        mainly the crawled tweet are in English-Hindi and English-
named entity recognition for Twitter using CRFs[16] or hy-           Tamil language mix. There are 22 types of entities present
brid methods[18]. Particular attention is given to micro-text        in the training data set in which the majority of entities
normalization[8], as a way of removing some of the linguis-          are from ‘Entertainment’, ‘Person’ ‘Location’ and ‘Organi-
tic noise prior to part-of-speech tagging and entity recog-          zation’. The statistics of the training data set is shown in the
nition. In literature primarily machine learning and rule            Table-1. We have also shown some of the sample tweets from
based approach has been used for named entity recognition            both Language pair in Table-2. English-Tamil language pair
(NER). Machine learning (ML) based techniques for NER                tweets contains some of the tweets from only Tamil language
make use of a large amount of NE annotated training data             only. English-Hindi tweet data set contains total 2700 tweets
to acquire higher level knowledge by extracting relevant fea-        from 2699 tweeter users. Similarly English-Tamil tweet data
tures from the labeled data. Several ML techniques have              set contains total 2183 tweets from 1866 tweeter users.
already been applied for the NER tasks such as Support
vector vector classifier[9], Maximum Entropy[3, 10], Markov                                       English-Hindi        English-Tamil
                                                                              Entities
Model(HMM)[1], Conditional Random Field (CRF)[12] etc.                                              # Entity             # Entity
The rule based techniques have also been explored in the                   COUNT                       132                   94
task by[6, 13, 19]. The hybrid approaches that combines                   PLANTS                         1                   3
different Machine learning based approaches are also used                  PERIOD                       44                   53
by Rohini et al.[17] by combining Maximum entropy, Hid-                 LOCOMOTIVE                      13                   5
den Markov Model and handcrafted rules to build an NER                ENTERTAINMENT                    810                  260
system. Entity extraction has been actively researched for                 MONEY                        25                   66
over 20 years. Most of the research has, however been fo-                   TIME                        22                   18
cused on resource rich languages, such as English, French                LIVTHINGS                       7                   16
and Spanish. However entity extraction and recognition
                                                                          DISEASE                        7                   5
from social media text on for Indian language have been in-
                                                                         ARTIFACT                       25                   18
troduced on FIRE-15 workshop[15]. The code-mixing entity
                                                                           MONTH                        10                   25
extraction from social media text on for Indian mix language
introduced in the FIRE-2016. Entity extraction from code-                FACILITIES                     10                   23
mixing social media text poses some key challenge which are               PERSON                       712                  661
as follows:                                                              MATERIALS                      24                   28
                                                                         LOCATION                      194                  188
     1. The released data set contains code mixing as well as               YEAR                       143                   54
        uni-language utterance.                                             DATE                        33                   14
                                                                       ORGANIZATION                    109                   68
     2. Set of entity are not limited to only traditional set of
                                                                         QUANTITY                        2                   0
        entity e.g. Person Name, Location Name, Organiza-
                                                                             DAY                        67                   15
        tion Name etc. There are 22 different types of entities
                                                                            SDAY                        23                   6
        are there to extract from text.
                                                                         DISTANCE                        0                   4
     3. There are lack of resources/tools for Indian languages.             Total                     2413                 1624
        code-mixing makes problems more difficult for pre-
        processing tasks required for NER such as sentence           Table 1: Training dataset statistics for both the lan-
        splitter, tokenization, part-of-speech tagging and chunk-    guage pair. ENTERTAINMENT type of entity has
        ing etc.                                                     the maximum no. of entity in both the language
                                                                     pair.
2.     PROBLEM DEFINITION
   The problem definition of code-mixing entity extraction
comprises two sub-problem entity extraction and entity clas-         3.1      Conditional Random Field(CRF)
sification. Mathematically the problem of code-mixing en-            Lafferty et al.[11] define the the probability of a particu-
tity extraction can be described as follows: Lets S is code-         lar label sequence y given observation sequence x to be a
mixing sentence having n tokens t1 , t2 . . . tn . E is the set of   normalized product of potential functions, each of the form
k pre-defined entity E = {E1 , E2 , . . . Ek }.                                  X                              X
                                                                            exp(     λj tj (yi−1 , yi , x, i) +   µk sk (yi , x, i)) (1)
     1. Entity Extraction step: Extract set of tokens TE =                          j                          k
 Langauage Pair                       Sample Tweet                           given token is basically the result of moving a window
                       @YOUniqueDoc Nandu,muje shaq hai
                                                                             of n characters along the text. We extracted charac-
                       ki humari notice ke bagair tere ghar ke
  English-Hindi        secret route ki help se you met kya.                  ter n-grams of length one (unigram), two(bigram) and
                       My intution is never wrong                            three (trigram), and use these as features of the clas-
                       A RiftWood Productions presents to you the            sifiers.
                       season finale of Le’ Bill & Giddy,La Muje’r
                       Ungala nenachu neengale romba proud                3. Word normalization : Words are normalized in or-
                       ah feel panra vishayam enna ?!                        der to capture the similarity between two different
  English-Tamil        Post ur comments. Will be read                        words that share some common properties. Each up-
                       on sun music at 5pm live ;)                           percase letter is replaced by ‘A’, lowercase by ’a’ and
                       IruMugan will be a Class + Mass movie like
                       Thani Oruvan. The biggest plus is the
                                                                             number by ’0’.
                       screenplay - Thambi Ramaiyah..
                                                                                        Words      Normalization
     Table 2: Sample tweet from both language pair                                      NH10          AA00
                                                                                        Maine         Aaaaa
                                                                                        NCR            AAA
where tj (yi−1 , yi , x, i) is a transition feature function of the
entire observation sequence and the labels at positions i and
i − 1 in the label sequence; sk (yi , x, i) is a state feature func-      4. Prefix and Suffix: Prefix and suffix of fixed length
tion of the label at position i and the observation sequence;                character sequences (here, 3) are stripped from each
and λj and µk are parameters to be estimated from training                   token and used as features of the classifier.
data. When defining feature functions, we construct a set
of real-valued features b(x, i) of the observation to expresses           5. Word Class Feature: This feature was defined to
some characteristic of the empirical distribution of the train-              ensure that the words having similar structures belong
ing data that should also hold of the model distribution. An                 to the same class. In the first step we normalize all
example of such a feature is                                                 the words following the process as mentioned above.
                                                                            Thereafter, consecutive same characters are squeezed
            1 if the observation word at position i is ‘religion’
b(x, i) =                                                                    into a single character. For example, the normalized
            0   otherwise
                                                                             word AAAaaa is converted to Aa. We found this fea-
Each feature function takes on the value of one of these real-               ture to be effective for the biomedical domain, and we
valued observation features b(x, i) if the current state (in the             directly adapted this without any modification.
case of a state function) or previous and current states (in
                                                                          6. Word Position: In order to capture the word con-
the case of a transition function) take on particular values.
                                                                             text in the sentence, we have used a numeric value to
All feature functions are therefore real-valued. For example,
                                                                             indicate the position of word in the sentence. The nor-
consider the following transition function:
                                                                            malized position of word in the sentence is used as a
                          b(x, i) if yi−1 = B-ORG and yi = I-ORG             features. The feature values lies in the ranges between
tj (yi−1 , yi , x, i) =                                                      0 and 1.
                          0       otherwise

This allows the probability of a label sequence y given an                7. Number of Upper case Characters: This features
observation sequence x to be written as                                      takes into account the number of uppercase alphabets
                         1        X                                          in the word. The feature is relative in nature and
           P (y|x, λ) =      exp(    λj Fj (y, x))     (2)                   ranges between 0 and 1.
                        Z(x)       j
                                                                          8. Test Word Probability: This feature finds the prob-
where Fj (y, x) can be written as follows:                                   ability of the test word to be labeled the same as in
                                n
                                X                                            training data. The length of this vector feature is the
                  Fj (y, x) =         fj (yi−1 , yi , x, i)         (3)      total number of labels or output tags, where each bit
                                i=1                                          represents an output tag, It is initialized with 0. If the
where each fj (yi−1 , yi , x, i) is either a state function                  test word does not appear in training, every bits retain
s(yi−1 , yi , x, i) or a transition function t(yi−1 , yi , x, i).            their initially marked value 0. Based on the probability
                                                                             value, we have two features:
4.     FEATURE EXTRACTION                                                     (a) Top@1-Probability: For the output tag with
  The proposed system uses an exhaustive set of features                          highest probability, its corresponding bit in the
for NE recognition. These features are described below.                           feature vector is set to 1. All other bits remain as
                                                                                  0.
     1. Context word: Local contextual information is use-
        ful to determine the type of the current word. We use                (b) Top@2-Probability: For the output tag with
        the contexts of previous two and next two words as                       highest and second highest probability, their cor-
        features.                                                                responding bit are set to 1 in the feature vector.
                                                                                 All other bits remain as 0.
     2. Character n-gram: Character n-gram is a contigu-
        ous sequence of n characters extracted from a given               9. Binary Features: These binary features are identi-
        word. The set of n-grams that can be generated for a                 fied after the through analysis of training data.
    (a) isSufficientLength: Since most of the entity           5.     EXPERIMENTAL SETUP
        from training data have a significant length. There-      To extract the entity from code mixed data we have fol-
        fore we set a binary feature to fire when the length   lowed three step approach, which are described in this sec-
        of token is greater than a specific threshold value.   tion. Fig-1 shows a architecture diagram of our proposed
        The threshold value 4 is used to extract the bi-       approach.
        nary features.
    (b) isAllCapital: This value of this feature is set
                                                               5.1      Pre-processing
        when all the character of current token is in up-        Pre-processing stage is an important task before applying
        percase.                                               any classifier. The release data set was in raw text sentence
                                                               having the list of entity. There are two step was performed
    (c) isFirstCharacterUpper: This value of this fea-         as part of pre-processing.
        ture is set when the first character of current to-
        ken is in uppercase..                                       1. Tokenization: Since the data set are crawled from
                                                                       Twitter therefore a suitable tokenizer which can deals
    (d) isInitCap: This feature checks whether the cur-                with social media data need to be used. We used the
        rent token starts with a capital letter or not. This           CMU PoS tagger[14] for tokenization and PoS tagging
        provides an evidence for the target word to be of              of tweets.
        NE type for the English language.
                                                                    2. Token Encoding: We used the IOB encoding for tag-
    (e) isInitPunDigit: We define a binary-valued fea-                 ging token. The IOB format (Inside, Outside, Begin-
        ture that checks whether the current token starts              ning) is a common tagging format for tagging tokens
        with a punctuation or a digit. It indicates that               in a chunking task in computational linguistics (ex.
        the respective word does not belong to any lan-                Named Entity Recognition). The B- prefix before a
        guage. Few such examples are 4u,:D, :P etc.                    tag indicates that the tag is the beginning of a chunk,
     (f) isDigit: This feature is fired when the current               and an I- prefix before a tag indicates that the tag
         token is numeric.                                             is inside a chunk. An O tag indicates that a token
                                                                       belongs to no chunk.
    (g) isDigitAlpha: We define this feature in such a
        way that checks whether the current token is al-       5.2      Sequence Labeling
        phanumeric. The word for which this feature has           In literature primarily HMM, MEMM and CRF has been
        a true value has a tendency of not being labeled       used for sequence labeling task. Here we used the CRF
        as any named entity type.                              classifier for label the sequence of token. The features set
    (h) isHashTag: Since we are dealing with tweeter           described in section-4 were finally formulated to provide as
        data , therefore we encounter a lot of hashtag is      input to our CRF classifier[11]. We used the CRF++1 imple-
        tweets. We define the binary feature that checks       mentation of CRF. The default parameter setting was used
        whether the current token starts with # or not.        to carry out the experiment.

                                                               5.3      Post-processing
                                                                 The rule and dictionary based post-processing was per-
 Input : Disease Name list as DN                               formed followed by labeling obtained from CRF classifier.
           Living things list as LT ;                          The detailed explanation are as follows:
           Special Days list as SD
           List of pair obtained from CRF as L(W,C)            English-Hindi
 Output: Post-processed list of (token,label) pair             For English-Hindi code mixed data we used dictionary of
           obtained after post-processing as PL(W,C’)          Disease Name, Living Things and Special Days. A list con-
 PL(W,C’)=L(W,C)                                               sist of 250 disease name was obtained from wiki page2 . A
 while L(W,C) is non-empty do                                  manual list are created of 668 living things from different
    if DN contains Wi then                                     web page source. Similarly for list of special days, a man-
         C=DISEASE;                                            ual list of 92 special days was obtained from this website3 .
         C’=C;                                                 The dictionary element was fired in the order as mentioned
    else if LT contains Wi then                                in Algorithm-1. At last regular expressions are formed to
         C=LIVINGTHINGS;                                       correct PERIOD, MONEY and TIME on post-processed
         C’=C;                                                 output.
    else if SD contains Wi then
         C=SPECIALDAYS;                                        English-Tamil
         C’=C;                                                 Since none of our team member are native Tamil speaker so
    else                                                       we could not do deep error analysis of CRF predicted output.
         C’=C;                                                 We used the same dictionary which was used in English-
 end                                                           Hindi data set, because those dictionary contains only En-
 return PL(W,C’);                                              glish word. Finally regular expressions are formed to correct
Algorithm 1: Post-processing algorithm for code mixed          1
data set                                                         https://taku910.github.io/crfpp/
                                                               2
                                                                 https://simple.wikipedia.org/wiki/List of diseases
                                                               3
                                                                 www.drikpanchang.com/calendars/indian/indiancalendar.html
                                           Run-1                   Run-2                 Run-3                 Best-Run
 S. No.    Team
                                     P       R     F          P      R      F       P      R     F       P        R       F
 1         Irshad-IIITHyd          80.92     59  68.24              NA                    NA           80.92     59.00  68.24
 2         Deepak-IITPatna         81.15   50.39 62.17              NA                    NA           81.15    50.39 62.17
 3         VeenaAmritha-T1         75.19   29.46 42.33         75  29.17   42.00   79.88 41.37 54.51   79.88     41.37  54.51
 4         BharathiAmritha-T2      76.34   31.15 44.25       77.72 31.84   45.17          NA           77.72     31.84  45.17
 5         Rupal-BITSPilani        58.66   32.93 42.18       58.84 35.32   44.14   59.15 34.62 43.68   58.84     35.32  44.14
 6         SomnathJU               37.49   40.28 38.83              NA                    NA           37.49     40.28  38.83
 7         Nikhil-BITSHyd          59.28   19.64 29.50        61.8 26.39   36.99          NA           61.80     26.39  36.99
 8         ShivkaranAmritha-T3     48.17    24.9 32.83              NA                    NA           48.17     24.90  32.83
 9         AnujSaini               72.24   18.85 29.90              NA                    NA           72.24     18.85  29.90

Table 3: Official results obtained by the various teams participated in the CMEE-IL task- FIRE 2016 for
code mixed English-Hindi language pair. Here P, R and F denotes precision, recall and F-score respectively.

                                           Run-1                   Run-2                 Run-3                 Best-Run
 S. No.    Team
                                     P       R     F          P      R     F        P      R     F       P        R       F
 1         Deepak-IITPatna         79.92   30.47 44.12              NA                    NA           79.92    30.47 44.12
 2         VeenaAmritha-T1         77.38    8.72 15.67       74.74 9.93 17.53      79.51 21.88 34.32   79.51     21.88  34.32
 3         BharathiAmritha-T2       77.7   15.43 25.75       79.56 19.59 31.44            NA           79.56     19.59  31.44
 4         RupalBITSPilani-R2      58.66   10.87 18.20       58.71 12.21 20.22     58.94 11.94 19.86   58.71     12.21  20.22
 5         ShivkaranAmritha-T3     47.62   13.42 20.94              NA                    NA           47.62     13.42  20.94

Table 4: Official results obtained by the various teams participated in the CMEE-IL task- FIRE 2016 for
code mixed English-Tamil language pair. Here P, R and F denotes precision, recall and F-score respectively.


the PERIOD, MONEY and TIME on post-processed out-                  shown in bold font. Our system got the highest Precision
put.                                                               of 81.15% among all the submitted system. The proposed
                                                                   approach achieved F-score of 62.17% on English-Hindi lan-
                                                                   guage pair. Table-4 shows the official result for English-
                                                                   Tamil language pair data set. Our system(Deepak-IITPatna)
                                                                   performance are shown in bold font. Our system is the best
                                                                   performing system among all the submitted system. We
                                                                   achieved the 79.92%, 30.47% and 44.12% precision(p), re-
                                                                   call(r) and F-score(f) respectively. The reason for lower F-
                                                                   score on Tamil-English could be the lack of good features
                                                                   which can help to recognize a Tamil word as named entity.

                                                                   7.   CONCLUSION & FUTURE WORK
                                                                      This paper describes a code mixed named entity recogni-
Figure 1: Proposed model architecture for Code-                    tion from Social media text in English-Hindi and English-
mixed entity extraction                                            Tamil language pair data. Our proposed approach is a hy-
                                                                   brid model of machine learning and rule based system. The
                                                                   experimental results show that our system is the best per-
6.   RESULT & ANALYSIS                                             former among the systems participated in the CMEE-IL task
   An entity extraction model for English-Hindi & English-         for code mixed English-Tamil language pair. For English-
Tamil language pair are trained by using CRF as base clas-         Hindi language pair system achieved the highest precision
sifier. We test our system on the test data for the concerned      value 81.15% among all the submitted system. In future we
language pair. The proposed approach was used to extract           would like to build a more robust code mixed NER system
the entity from both the language pair data.                       by using deep learning system.
The developed entity extraction and identification system
has been evaluated using the precision(P), recall (R) and          8.   REFERENCES
F-measure (F). The organizers of the CMEE-IL task FIRE              [1] D. M. Bikel, S. Miller, R. Schwartz, and
2016, released the data in two phases: in the first phase,              R. Weischedel. Nymble: a high-performance learning
training data is released along with the corresponding NE               name-finder. In Proceedings of the fifth conference on
annotation file. In the second phase, the test data is released         Applied natural language processing, pages 194–201.
and no NE annotation file is provided. The extracted NE                 Association for Computational Linguistics, 1997.
annotation file for test data was finally sent to the organiz-      [2] K. Bontcheva, L. Derczynski, and I. Roberts.
ers for evaluation. The organizers evaluate the different runs          Crowdsourcing named entity recognition and entity
submitted by the various teams and send the official results            linking corpora. The Handbook of Linguistic
to the participating teams.                                             Annotation (to appear), 2014.
The official results for English-Hindi language pair are shown      [3] A. Borthwick. A maximum entropy approach to named
in Table-3. Our system(Deepak-IITPatna) performance are                 entity recognition. PhD thesis, Citeseer, 1999.
 [4] A. E. Cano Basave, A. Varga, M. Rowe, M. Stankovic,        [18] M. Van Erp, G. Rizzo, and R. Troncy. Learning with
     and A.-S. Dadzie. Making sense of microposts (#                 the web: Spotting named entities on the intersection
     msm2013) concept extraction challenge. 2013.                    of nerd and machine learning. In # MSM, pages
 [5] L. Derczynski, D. Maynard, G. Rizzo, M. van Erp,                27–30, 2013.
     G. Gorrell, R. Troncy, J. Petrak, and K. Bontcheva.        [19] T. Wakao, R. Gaizauskas, and Y. Wilks. Evaluation of
     Analysis of named entity recognition and linking for            an algorithm for the recognition and classification of
     tweets. Information Processing & Management,                    proper names. In Proceedings of the 16th conference on
     51(2):32–49, 2015.                                              Computational linguistics-Volume 1, pages 418–423.
 [6] R. Grishman. The nyu system for muc-6 or where’s                Association for Computational Linguistics, 1996.
     the syntax? In Proceedings of the 6th conference on
     Message understanding, pages 167–175. Association
     for Computational Linguistics, 1995.
 [7] R. Grishman and B. Sundheim. Message
     understanding conference-6: A brief history. In
     Proceedings of the 16th Conference on Computational
     Linguistics - Volume 1, COLING ’96, pages 466–471,
     Stroudsburg, PA, USA, 1996. Association for
     Computational Linguistics.
 [8] B. Han and T. Baldwin. Lexical normalisation of short
     text messages: Makn sens a# twitter. In Proceedings
     of the 49th Annual Meeting of the Association for
     Computational Linguistics: Human Language
     Technologies-Volume 1, pages 368–378. Association for
     Computational Linguistics, 2011.
 [9] H. Isozaki and H. Kazawa. Efficient support vector
     classifiers for named entity recognition. In Proceedings
     of the 19th international conference on Computational
     linguistics-Volume 1, pages 1–7. Association for
     Computational Linguistics, 2002.
[10] N. Kumar and P. Bhattacharyya. Named entity
     recognition in hindi using memm. Techbical Report,
     IIT Mumbai, 2006.
[11] J. Lafferty, A. McCallum, and F. C. Pereira.
     Conditional random fields: Probabilistic models for
     segmenting and labeling sequence data. 2001.
[12] W. Li and A. McCallum. Rapid development of hindi
     named entity recognition using conditional random
     fields and feature induction. ACM Transactions on
     Asian Language Information Processing (TALIP),
     2(3):290–294, 2003.
[13] D. McDonald. Internal and external evidence in the
     identification and semantic categorization of proper
     names. Corpus processing for lexical acquisition, pages
     21–39, 1996.
[14] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel,
     N. Schneider, and N. A. Smith. Improved
     part-of-speech tagging for online conversational text
     with word clusters. Association for Computational
     Linguistics, 2013.
[15] P. R. Rao, C. Malarkodi, and S. L. Devi. Esm-il:
     Entity extraction from social media text for indian
     languages@ fire 2015–an overview.
[16] A. Ritter, S. Clark, O. Etzioni, et al. Named entity
     recognition in tweets: an experimental study. In
     Proceedings of the Conference on Empirical Methods
     in Natural Language Processing, pages 1524–1534.
     Association for Computational Linguistics, 2011.
[17] R. Srihari, C. Niu, and W. Li. A hybrid approach for
     named entity and sub-type tagging. In Proceedings of
     the sixth conference on Applied natural language
     processing, pages 247–254. Association for
     Computational Linguistics, 2000.