=Paper=
{{Paper
|id=Vol-1587/T4-4
|storemode=property
|title=A Hidden Markov Model Based System for Entity Extraction from Social Media English Text at FIRE 2015
|pdfUrl=https://ceur-ws.org/Vol-1587/T4-4.pdf
|volume=Vol-1587
|authors=Kamal Sarkar
|dblpUrl=https://dblp.org/rec/conf/fire/Sarkar15
}}
==A Hidden Markov Model Based System for Entity Extraction from Social Media English Text at FIRE 2015==
<pdf width="1500px">https://ceur-ws.org/Vol-1587/T4-4.pdf</pdf>
<pre>
       A Hidden Markov Model Based System for Entity
    Extraction from Social Media English Text at FIRE 2015

                                                             Kamal Sarkar
                                              Computer Science & Engineering Dept.
                                                      Jadavpur University
                                                     Kolkata-700032, India
                                                   jukamal2001@yahoo.com

ABSTRACT                                                                       This paper presents a description of HMM (Hidden Markov
This paper presents the experiments carried out by us at Jadavpur           Model) based system for Entity Extraction from Social Media
University as part of the participation in FIRE 2015 task: Entity           Text in Indian Languages. This named entity recognition system
Extraction from Social Media Text - Indian Languages (ESM-IL).              (NER) considers a variety of entity types: artifact, entertainment,
The tool that we have developed for the task is based on Trigram            facilities, location, locomotive, materials, organization, person,
Hidden Markov Model that utilizes information like gazetteer list,          plants, count, distance, money, quantity, date, day, period, time
POS tag and some other word level features to enhance the                   and year, month, Living thing and Sday.
observation probabilities of the known tokens as well as unknown               The task “Entity Extraction from Social Media Text - Indian
tokens. We submitted runs for English only. A statistical HMM               Languages (ESM-IL)” was defined to build the NER systems for
(Hidden Markov Models) based model has been used to                         four Indian languages - English, Malayalam, Tamil and Hindi for
implement our system. The system has been trained and tested on             which training data and test data were provided. We have
the datasets released for FIRE 2015 task: Entity Extraction from            participated for English language only.
Social Media Text - Indian Languages (ESM-IL). Our system is                   The earliest works on named entity recognition (NER)
the best performer for English language and it obtains precision,           primarily uses two major approaches to NER: Rule based
recall and F-measures of 61.96, 39.46 and 48.21 respectively.               (Linguistic) approaches and Machine Learning (ML) based
                                                                            approaches.
Categories and Subject Descriptors                                             The rule based approaches typically use a set of hand crafted
                                                                            rules [1][2][3].
H.3 [Information Storage and Retrieval]: H.3.1 Content
Analysis and Indexing; H.3.3 Information Search and Retrieval;                 Machine learning (ML) based techniques for NER make use of
H.3.4 Systems and Software; H.2.3 [Database Management]:                    a large amount of NE annotated training data to acquire higher
Languages-Query Languages                                                   level language knowledge from the labeled data. Several ML
                                                                            techniques have already been applied for the NER tasks such as
General Terms                                                               Markov Model (HMM) [4], Maximum Entropy (MaxEnt) [5][6],
                                                                            Conditional Random Field (CRF)[7] etc.
Languages, Performance, Experimentation
                                                                                The hybrid approaches that combines different ML
Keywords                                                                    approaches are also used. Srihari et al.(2000) [8] combines
Named Entity Recognition, Entity Extraction, Social Media,                  MaxEnt, Hidden Markov Model (HMM) and handcrafted rules to
HMM.                                                                        build an NER system.
                                                                               NER systems also use gazetteer lists for identifying names.
1. INTRODUCTION                                                             Both the linguistic approach [1][3] and the ML based
The objective of named entity recognition is to identify and                approach[5][8] may use gazetteer lists.
classify every word/term in a document into some predefined                    The NER tasks for Hindi have been presented in [9][10][11].
categories like person name, location name, organization name,                 A discussion on the training data is given in Section 2. The
miscellaneous name (date, time, percentage and monetary                     HMM based NER system is described in Section 3. Various
expressions etc.) etc.                                                      features used in NER are then discussed. Next we present the
   NER is an important task, having applications in Information             experimental results and related discussions in Section 5. Finally
Extraction, Question Answering, Machine Translation,                        Section 6 concludes the paper.
Summarization, Cross-lingual information access and other NLP
applications. Over the past decade, Indian language content on
various social media( twitter, facebook etc.) is rapidly increasing.
When the different companies are interested to ascertain public             2. TRAINING DATA PREPARATION
views on their products and services, they need natural language            The training data released for the FIRE shared task contains two
processing software systems which identify entities and relations           files: one file contains the raw text file and another file contains
among the entities. So, there is a need for automatic entity                the NE annotation file in which each row has 6 columns: tweet-id,
extraction system.                                                          user-id, NE-tag, NE raw string, NE-start index and NE_length.
                                                                            Index column is the starting character position of NE calculated

                                                                       89
for each tweet. The participants are instructed to produce the
output in the same format after testing the system on the test data.
                                                                                    Raw                          NE
Our system uses the two files supplied for training data and                                                                                Test
                                                                                    Training                     annotation
converts the data into the IOB format before training and the data                                                                          tweet
                                                                                    Corpus                       file
converted in IOB (Inside, Outside and Beginning) format (a
format used for the CoNLL-2003 shared task on NER) is used for
training. IOB format uses a B−XXX tag that indicates the first
word of an entity type XXX and I−XXX that is used for
subsequent words of an entity. The tag “O” indicates the word is                    POS tagger                                           POS
outside of an NE (i.e., not a part of a named entity).                                                                                   tagger


3. HMM BASED NAMED ENTITY
TAGGING
                                                                                    Assign                                              Assign special
                                                                                    special                                             tags(meta tag
A named entity recognizer based on Hidden Markov Model                              tags(meta                                           and gazetteer
                                                 n
(HMM) finds the best sequence of NE tags t1 that is optimal for                     tag     and                                         tag)
                                                                                    gazetteer
                                 n
a given observation sequence o1 . The tagging problem becomes                       tag)
                                             n       n          n
equivalent to searching for arg max P(o1 | t1 ) P(t1 ) (by the
                                 t1n
application of Bayes’ law), that is, we need to compute:
                                                                                     Label
tˆ1n  arg max P(o1n | t1n ) P(t1n )                     (1).
                                                                                     pseudo
                                                                                     tokens in
                                                                                                                                          Testing
              t1n                                                                                                                         phase
                                                                                     IOB format
          n                            n
Where t1 is a tag sequence and o1 is an observation sequence,

P(t1n )   is the prior probability of the tag sequence and

P(o1n | t1n ) is the likelihood of the word sequence.                               Training
                                                                                    HMM based
    In general, HMM based sequence labeling tasks such as POS
                                                                                    NE tagger
tagging use words in a sentence as an observation sequence [12]
13]. But, we use MontyTagger [14] to assign POS tags to the data                                                                        NE tagged
released for the task, that is, some additional information such as                                                                     tweet  in
POS for each token in a tweet becomes now available. We also                         HMM                                                IOB
use some other information such as whether the token contains                        model                                              format
any digit, whether the token contains any hash tag or not etc. We
use this information in a form of meta tag (details are presented in
the subsequent sections). We use gazetteer information also. If                Figure 1. Architecture for our developed HMM based NE
any token is found in the specific gazetteer list, we use the               extraction system
gazetteer tag in place of POS tag (details are presented in the
subsequent sections).
    Unlike the traditional HMM based NER system, to use this                For our proposed trigram model, the probability of a tag depends
                                                                                                                             n
additional information for named entity recognition task, we                on two previous tags and thus P(t1 ) is computed as:
consider a triplet as an observation symbol: <word, POS-
                                                                                               n
tag/gazetteer tag , meta-tag >. This is a pseudo token used as an
observed symbol, that is, for a tweet of n words, the
                                                                                P(t1n )   P(ti | ti 1 , ti 2 )                                        (2)
                                                                                              i 1
corresponding observation sequence will be as follows:                         Depending on the assumption that the probability of a word
    (<word1, X-tag1, meta-tag1>, <word2, X-tag2, meta-tag2>,                                                                                      n   n
<word3, X-tag3, meta-tag3>, .........., <wordn, X-tagn, meta-tagn>)         appearing is dependent only on its own tag, P(o1 | t1 ) can be
. Here an observation symbol oi corresponds to <wordi, X-tagi,              simplified to:
meta-tagi> and X-tag can be either POS tag or gazetteer tag).                                         n
    Since Equation (1) is too hard to compute directly, HMM                     P(o1n | t1n )   P(oi | ti )                                              (3)
taggers follows Markov assumption according to which the                                             i 1
probability of a tag is dependent only on short memory (a small,                Plugging the above mentioned two equations (2) and (3) into
fixed number of previous tags). For example, a bigram tagger                (1) results in the following equation by which a bigram tagger
considers that the probability of a tag depends only on the                 estimates the most probable tag sequence:
                                                                                                                                  n
previous tag
                                                                            tˆ1n  arg max P(t1n | o1n ) P(t1n )  arg max  P(oi | ti ) P(ti | ti 1 )         (4)
                                                                                      t1n                              t1n       i 1


                                                                       90
Where: the tag transition probabilities, P(ti | ti 1 ) , represent the           (<word1, X-tag1, meta-tag1>, <word2, X-tag2, meta-tag2>,
                                                                               <word3, X-tag3, meta-tag3>, .........., <wordn, X-tagn, meta-tagn>)
probability of a tag given the previous tag. P(oi | ti ) represents            . Here an observation symbol oi corresponds to <wordi, X-tagi,
the probability of an observed symbol given a tag.                             meta-tagi> and X-tag can be either POS tag or gazetteer tag).
Considering a special tag tn+1 to indicate the end sentence                     After assigning the tag sequence to the observation sequence as
boundary and two special tags t-1 and t0 at the starting boundary              mentioned above, X-tag and meta-tag information are removed
of the sentence and adding these three special tags to the tag set             from the output and thus the output for an input sentence is
[15], gives the following equation for NE tagging:                             converted to a NE-tagged sentence.
tˆ1n  arg max P(t1n | o1n ) P(t1n )                                             We have used the Viterbi algorithm presented in [16] for
                                                                               finding the most likely tag sequence for a given observation
           t1n
                                                                               sequence.
                 n                                            (5)
arg max[ P(oi | ti ) P(ti | ti 1 , ti 2 )]P(tn 1 | tn )                        One of the important problems to apply Viterbi decoding
    t1n      i 1                                                              algorithm is how to handle unknown triplets in the input. The
                                                                               unknown triplets are triplets which are not present in the training
The equation (5) is still computationally expensive because we
                                                                               set and hence their observation probabilities are not known. To
need to consider all possible tag sequence of length n. So,
                                                                               handle this problem, we estimate the observation probability of an
dynamic programming approach is used to compute the equation
                                                                               unknown one by analyzing X-tag, meta-tag and the suffix of the
(5).
                                                                               word associated with the corresponding the triplet. We estimate
   At the training phase of HMM based NE tagging, observation
                                                                               the observation probability of an unknown observed triplet in the
probability matrix and tag transition probability matrix are
                                                                               following ways:
created. Architecture of our developed NE tagger is shown in
Figure 1.                                                                      The observation probabilities of unknown triplet < word, X-tag,
                                                                               meta-tag> corresponding to a word in the input sentence are
                                                                               decided according to the suffix of a pseudo word formed by
3.1 Computing Tag Transition Probabilities                                     adding X-tag and meta-tag to the end of the word. We find the
                                                                               observation probabilities of such unknown pseudo words using
                                                                               suffix analysis of all rare pseudo words (frequency <=2) in the
As we can see from the equation (4), to find the most likely tag
                                                                               training corpus for the concerned language [13][15].
sequence for an observation sequence, we need to compute two
kinds of probabilities: tag transition probabilities and word
likelihoods or observation probabilities.                                      4. SPECIAL TAGS
   Our developed trigram HMM tagger requires to compute tag
trigram probability, P(ti | ti 1 , ti 2 ) , which is computed by the         4.1 Meta Tag
maximum likelihood estimate from tag trigram counts. To                        Each token has some properties by which one token differs from
overcome the data sparseness problem, tag trigram probability is               another. For example, a token may only consist of digits or it may
smoothed using deleted interpolation technique [13][15] which                  contain hash. To capture such information specific to a token, we
uses the maximum likelihood estimates from counts for tag                      use Meta tag. For example, if a token is consisting of only digits,
trigram, tag bigram and tag unigram.                                           meta tag that we will assign to the token is ALLDIGITS which we
                                                                               write ALDT in short.
3.2 Computing Observation Probabilities
                                                                               The various meta tags that we use for our task are described
The observation probability of a observed triplet <word, X-tag,                below. Meta tag for a token is determined using the following
meta-tag >, which is the observed symbol in our case, is computed              rules which are fired in the following order.
using the following equation [12][13].
    P(o | t )  CC((oo,t))                                    (7)              Meta-tag=”YYYY”(default)
                                                                               if the first letter of the token is a capital letter then
                                                                               metatag = "ICAP"
3.3 Viterbi Decoding                                                           end if

                                                                               if the first token is abbreviation then
The task of a decoder is to find the best hidden state sequence                metatag = "ABBR"
given an input HMM and a sequence of observations.                             End If
    The Viterbi algorithm is the most common decoding algorithm
used for HMM based tagging task. This is a standard application                if contains "#" at the begining of the token and the first character
of the classic dynamic programming algorithm[16].                              after hash is a capital letter then
    Given a tag transition probability matrix and the observation                     metatag = "CHAS"
probability matrix, Viterbi decoding (used at the testing phase)               ElseIf contains "#" at the begining of the token Then
accepts a tweet in Indian language and finds the most likely tag               metatag = "HASH"
sequence for the test tweet which is also X-tagged and Meta                    End If
tagged. Here a tweet is submitted to the viterbi as the observation
sequence of triplets:                                                          if contains "@" at the begining of the token then
                                                                                  metatag = "ATSY"

                                                                          91
End If                                                                                             person names
                                                                                 Blocation         A list of first words             1243
If last charater is a colon(":") And the first letter is capital then
                                                                                                   extracted from list of
   metatag = "CCOL"
                                                                                                   location names
ElseIf last charater is a colon(":") Then
   metatag = "COLN"                                                              Ilocation         A list of words extracted         257
End If                                                                                             from a list of location
                                                                                                   names where a extracted
if contains hyphen and the first character is capital then                                         word is not the first word
   metatag = "CHYP"                                                                                of the location name
ElseIf hyphen occurs after 3 characters from the begining then                   facilities        A list of facility names          14
    metatag = "HYPH"                                                                               such as school, college
End If                                                                                             etc.
                                                                                 months            A list of English month           12
if the token is 4 digits then                                                                      names
      metatag = "DFOR"                                                           days              A list of English day             7
ElseIf the token is two digits then                                                                names
    metatag = "DTWO"
                                                                                 period            A list of words indicating        34
ElseIf the token is one digit then
                                                                                                   “period” such as “month”,
    metatag = "DONE"
                                                                                                   “year” etc.
ElseIf token contains at least one digit then
    metatag = "DIGT"                                                             Count             A list of words indicating        58
End If                                                                           expressions       “count”
                                                                                 Monetary          A list of words indicating        18
If contains one comma and contains at least one digit then                       expressions       monetary       expressions
    metatag = "DCOM"                                                                               such as lakh, crore etc.
ElseIf the last character is a comma and first character is capital
then
metatag = "CLCO"                                                              We follow the following rules for assigning this type of tag to the
ElseIf contains one comma at the end of the token then                        token:
    metatag = "LCOM"
ElseIf contains more than one comma and first character is                    X-tag=POS-tag (default tag)
capital then                                                                  if Token is found in the BPerson list then
    metatag = "CMCO"                                                            X-tag="BPER"
End If
                                                                              elseif Token is found in the IPerson list then
If token contains all dots then                                                    X-tag = "IPER"
    metatag = "ALDT"
                                                                              elseif Token is found in Blocation list then
End If
                                                                                   X-tag="BLOC"
4.2 Gazetteer tag                                                             elseif Token is found in ILocation list then
In earlier sections, we have mentioned that POS tag for a token is                 X-tag = "ILOC"
replaced by a gazetteer tag if the token is found in a particular             elseif Token is found in the list of facilities then
gazetteer list. if the length of a raw word is greater than equal to 2
, before searching in the gazetteer list, we remove from the token                 X-tag="FACI"
the symbols such as ",",".",":","#" and "@". The description of               elseif Token is found in the list of month names then
gazetteer list is shown in Table 1.
                                                                                    X-tag = "MONT"
                                                                              elseif Token is found in the list of day names then
      Table 1: Description of Gazetteer lists                                       X-tag= "DAYS"
                                                                              elseif Token is found in the list of period indicating expressions
   Gazetteer        description                    Number      of             then
   name                                            entries in the                    X-tag = "PERD"
                                                   list                       elseif Token is found in the list of expression denoting countthen
   Bperson          List of first names            657                               X-tag = "COUN"
                    separated from a list of                                  elseif Token is found in the list of monetary expressions then
                    person names                                                    X-tag = "MONY"
                                                                              End If
   Iperson          List of words representing     563
                    second     names,    third
                    names,      last    names
                    extracted from a list of

                                                                         92
5. EVALUATION AND RESULTS                                                    Table 3. Official results obtained by the various systems
                                                                             participated in the ESM-IL task- FIRE 2015 for English
We train separately our developed named entity recognizer based              language
on the training data and tune the parameters of our system on the
                                                                              Teams                            P          R           F
training data for the English language. After learning the tuning
parameters, we test our system on the test data for the concerned             Shriya - Amritha      Run1           0.08       0.064       0.071
language. The description of the data for English language is
shown in the Table2
    After getting the NE-tagged output in IOB format from the                 Sanjay - Amritha      Run1        0.057         0.028       0.038
HMM model, we observed that the NE tagged output contains
                                                                                                    Run2        0.043         0.021       0.028
some occurrences of a sequence of I-XXXs where the left
boundary of each such sequence is a transition from the tag “O” to
I-XXXs (but, according to the IOB format, the left boundary of a              Chintak - LDRP        Run1           7.30       4.17        5.31
named entity is a transition from any tag to B-XXX).
                                                                                                    Run2           5.35       5.67        5.50

Table2. The description of the data for English language
                                                                              KSarkar - JU          Run1        61.96         39.46       48.21
  Language             Total of tweets           NE Types
               Training data     Test data
                                                                              Vira -                Run1           4.13       3.39        3.72
  English      11003             9641            21
                                                                              Charotar Univ

We have also observed that the word sequence to which this type               Pallavi - HITS        Run1        50.48         32.03       39.19
of tag sequence is assigned is not really a named entity. So,
considering this as the errors of the model, we replace such a                                      Run2        50.21         37.06       42.64
sequence of I-XXXs in the output by a sequence of “o”. After
applying this post-processing on the output produced by the                                         Run3            -           -           -
HMM model, the final output file is generated.
    Our developed NER system has been evaluated using the                     Sombuddha - JU        Run1        46.92         32.41       38.34
traditional precision (P), recall (R) and F-measure (F). For
training, tuning and testing our system, we have used the dataset                                   Run2        58.09         31.85       41.15
for English language, released by the organizers of the ESM-IL
task- FIRE 2015. The organizers of the ESM-IL task- FIRE 2015                                       Run3       49.10      31.59       38.45
released the data in two phases: in the first phase, training data is
                                                                                                    Run4       46.50      30.20       36.61
released along with the corresponding NE annotation file. In the
second phase, the test data is released and no NE annotation file is
                                                                                                    Run5       58.09      31.85       41.15
provided. The contestants are instructed to generate NE
annotation file for test data using their developed systems. NE
annotation file for test data was finally sent to the organizers for
evaluation. The organizers evaluate the different runs submitted
by the various teams and send the official results to the
participating teams.
                                                                             6. CONCLUSION
   We have shown in Table 3 the results obtained by our
submitted run indicated by team id “KSarkar – JU”. As we can                 This paper describes a named entity recognition system for Entity
see from the table, our system outperforms the other systems                 Extraction from Social Media Text in English language. The
participated in the ESM-IL task. Table 3 only shows the FIRE                 features such as Gazetteer list, POS tag and some other word level
2015 official results for English language only. The overall FIRE            features have been introduced into the HMM model. The
2015 official results for ESM-IL task including all languages are            experimental results show that our system is the best performer
shown in Table 4.                                                            among the systems participated in the ESM-IL task for English
                                                                             language. The named entity recognition system has been
                                                                             developed using Visual Basic platform so that a suitable user
                                                                             interface can be designed for the novice users. The system has
                                                                             been designed in such a way that only changing the training
                                                                             corpus in a file can make the system portable to a new Indian
                                                                             language.


                                                                        93
     Table 4. Language wise official results obtained by the various systems participated in the ESM-IL task- FIRE 2015

       Language            Hindi                         Tamil                              Malayalam                          English

       eams                P       R          F          P           R          F           P           R           F          P          R       F

       Shriya - Run1        2.61       1.44       1.86       0.19    0.066          0.09        0.23        0.29        0.26       0.08   0.064   0.071
       Amritha

       Sanjay - Run1        2.27       0.16       0.29       0.30        0.07       0.12        0.075       0.036       0.04    0.057     0.028   0.038
       Amritha
                Run2           -        -          -         0.250   0.077          0.11          -           -          -      0.043     0.021   0.028

       Chintak - Run1      67.11       0.76       1.51         -          -          -            -           -          -         7.30   4.17        5.31
       LDRP
                 Run2      74.73 46.84 57.59                   -          -          -            -           -          -         5.35   5.67        5.50

       KSarkar Run1            -        -          -           -          -          -            -           -          -      61.96     39.46   48.21
       - JU

       Vira -      Run1    25.65 16.14 19.82                   -          -          -            -           -          -         4.13   3.39        3.72
       Charotar
       Univ

       Pallavi    - Run1   81.21 44.57 57.55                 70.42   14.13 23.54                  -           -          -      50.48     32.03   39.19
       HITS
                   Run2    80.86 44.25 57.20                 64.52   22.14 32.97                  -           -          -      50.21     37.06   42.64

                   Run3    81.49 41.58 55.06                   -          -          -            -           -          -          -         -        -

       Sombudd Run1 err                 -          -           -          -          -            -           -          -      46.92     32.41   38.34
       ha - JU
               Run2 err                 -          -           -          -          -            -           -          -      58.09     31.85   41.15

                   Run3 err        -          -          -           -          -           -           -           -          49.10      31.59 38.45

                   Run4 -          -          -          -           -          -           -           -           -          46.50      30.20 36.61

                   Run5 -          -          -          -           -          -           -           -           -          58.09      31.85 41.15


References                                                                               [6] Kumar, N. and Bhattacharyya, P. 2006. Named Entity
                                                                                             Recognition in Hindi using MEMM. In Technical Report,
[1] Grishman, R. 1995. The New York University System MUC-                                   IIT Bombay, India.
    6 or Where’s the syntax? In Proceedings of the Sixth
    Message Understanding Conference.                                                    [7] Li, Wei and McCallum, A. 2004. Rapid Development of
                                                                                             Hindi Named Entity Recognition using Conditional Random
[2] McDonald, D. D. 1996. Internal and external evidence in the                              Fields and Feature Induction (Short Paper). In ACM
    identification and semantic categorization of proper names.                              Transactions on Computational Logic.
    In B. Boguraev and J. Pustejovsky, editors, Corpus
    Processing for Lexical Acquisition, 21–39.                                           [8] Srihari, R., Niu, C. and Li, Wei. 2000. A Hybrid Approach
                                                                                             for Named Entity and Sub-Type Tagging. In Proceedings of
[3] Wakao, T., Gaizauskas, R. and Wilks, Y. 1996. Evaluationof                               the sixth conference on Applied natural language
    an algorithm for the recognition and classification of proper                            processing.
    names. In Proceedings of COLING-96.
                                                                                         [9] Cucerzan, S. and Yarowsky, D. 1999. Language Independent
[4] Bikel, D. M., Miller, S., Schwartz, R. and Weischedel, R.                                Named Entity Recognition Combining Morphological and
    1997. Nymble: A High Performance Learning Name-finder.                                   Contextual Evidence. In Proceedings of the Joint SIGDAT
    In Proceedings of the Fifth Conference on Applied Natural                                Conference on EMNLP and VLC, 1999, 90–99.
    Language Processing, 194–201.
                                                                                         [10] Li, Wei and McCallum, A. 2004. Rapid Development of
[5] Borthwick, A. 1999. A Maximum Entropy Approach to                                         Hindi Named Entity Recognition using Conditional Random
    Named Entity Recognition. Ph.D. thesis, Computer Science                                  Fields and Feature Induction (Short Paper). In ACM
    Department, New York University.                                                          Transactions on Computational Logic.

                                                                                94
[11] Ekbal A. and Bandyopadhyay, S. 2009. A conditional
     random field approach for named entity recognition in
     Bengali and Hindi. Linguistic Issues in Language
     Technology, 2(1).
[12] Sarkar K. and Gayen, V. 2012. A practical part-of-speech
     tagger for Bengali. In Proceedings of the third International
     conference on Emerging Applications of Information
     Technology (EAIT), Kolkata. pp. 36-40.
[13] Gayen, V. and Sarkar, K. 2014. "An HMM based named
     entity recognition system for indian languages: the JU system
     at ICON 2013." arXiv preprint arXiv:1405.7397 (2014).
[14] Liu, H. 2004. MontyLingua: an end to end natural language
     processor with common sense, 2004, retrieved in 2005 from
     web.media.mit.edu/~hugo/montylingua
[15] Brants, T. 2000. TnT – A statistical part-of-speech tagger. In
     proceedings of the 6th Applied NLP Conference, pp. 224-
     231.
[16] Jurafsky, D. and Martin, J. H. 2002. Speech and Language
     Processing: An Intoduction to Natural Language
     Processing, Computational Linguistics and Speech
     Recognition, Preason Education Series.


                                                                      95

</pre>