=Paper=
{{Paper
|id=Vol-1587/T4-4
|storemode=property
|title=A Hidden Markov Model Based System for Entity Extraction from Social Media English Text at FIRE 2015
|pdfUrl=https://ceur-ws.org/Vol-1587/T4-4.pdf
|volume=Vol-1587
|authors=Kamal Sarkar
|dblpUrl=https://dblp.org/rec/conf/fire/Sarkar15
}}
==A Hidden Markov Model Based System for Entity Extraction from Social Media English Text at FIRE 2015==
A Hidden Markov Model Based System for Entity Extraction from Social Media English Text at FIRE 2015 Kamal Sarkar Computer Science & Engineering Dept. Jadavpur University Kolkata-700032, India jukamal2001@yahoo.com ABSTRACT This paper presents a description of HMM (Hidden Markov This paper presents the experiments carried out by us at Jadavpur Model) based system for Entity Extraction from Social Media University as part of the participation in FIRE 2015 task: Entity Text in Indian Languages. This named entity recognition system Extraction from Social Media Text - Indian Languages (ESM-IL). (NER) considers a variety of entity types: artifact, entertainment, The tool that we have developed for the task is based on Trigram facilities, location, locomotive, materials, organization, person, Hidden Markov Model that utilizes information like gazetteer list, plants, count, distance, money, quantity, date, day, period, time POS tag and some other word level features to enhance the and year, month, Living thing and Sday. observation probabilities of the known tokens as well as unknown The task “Entity Extraction from Social Media Text - Indian tokens. We submitted runs for English only. A statistical HMM Languages (ESM-IL)” was defined to build the NER systems for (Hidden Markov Models) based model has been used to four Indian languages - English, Malayalam, Tamil and Hindi for implement our system. The system has been trained and tested on which training data and test data were provided. We have the datasets released for FIRE 2015 task: Entity Extraction from participated for English language only. Social Media Text - Indian Languages (ESM-IL). Our system is The earliest works on named entity recognition (NER) the best performer for English language and it obtains precision, primarily uses two major approaches to NER: Rule based recall and F-measures of 61.96, 39.46 and 48.21 respectively. (Linguistic) approaches and Machine Learning (ML) based approaches. Categories and Subject Descriptors The rule based approaches typically use a set of hand crafted rules [1][2][3]. H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; Machine learning (ML) based techniques for NER make use of H.3.4 Systems and Software; H.2.3 [Database Management]: a large amount of NE annotated training data to acquire higher Languages-Query Languages level language knowledge from the labeled data. Several ML techniques have already been applied for the NER tasks such as General Terms Markov Model (HMM) [4], Maximum Entropy (MaxEnt) [5][6], Conditional Random Field (CRF)[7] etc. Languages, Performance, Experimentation The hybrid approaches that combines different ML Keywords approaches are also used. Srihari et al.(2000) [8] combines Named Entity Recognition, Entity Extraction, Social Media, MaxEnt, Hidden Markov Model (HMM) and handcrafted rules to HMM. build an NER system. NER systems also use gazetteer lists for identifying names. 1. INTRODUCTION Both the linguistic approach [1][3] and the ML based The objective of named entity recognition is to identify and approach[5][8] may use gazetteer lists. classify every word/term in a document into some predefined The NER tasks for Hindi have been presented in [9][10][11]. categories like person name, location name, organization name, A discussion on the training data is given in Section 2. The miscellaneous name (date, time, percentage and monetary HMM based NER system is described in Section 3. Various expressions etc.) etc. features used in NER are then discussed. Next we present the NER is an important task, having applications in Information experimental results and related discussions in Section 5. Finally Extraction, Question Answering, Machine Translation, Section 6 concludes the paper. Summarization, Cross-lingual information access and other NLP applications. Over the past decade, Indian language content on various social media( twitter, facebook etc.) is rapidly increasing. When the different companies are interested to ascertain public 2. TRAINING DATA PREPARATION views on their products and services, they need natural language The training data released for the FIRE shared task contains two processing software systems which identify entities and relations files: one file contains the raw text file and another file contains among the entities. So, there is a need for automatic entity the NE annotation file in which each row has 6 columns: tweet-id, extraction system. user-id, NE-tag, NE raw string, NE-start index and NE_length. Index column is the starting character position of NE calculated 89 for each tweet. The participants are instructed to produce the output in the same format after testing the system on the test data. Raw NE Our system uses the two files supplied for training data and Test Training annotation converts the data into the IOB format before training and the data tweet Corpus file converted in IOB (Inside, Outside and Beginning) format (a format used for the CoNLL-2003 shared task on NER) is used for training. IOB format uses a B−XXX tag that indicates the first word of an entity type XXX and I−XXX that is used for subsequent words of an entity. The tag “O” indicates the word is POS tagger POS outside of an NE (i.e., not a part of a named entity). tagger 3. HMM BASED NAMED ENTITY TAGGING Assign Assign special special tags(meta tag A named entity recognizer based on Hidden Markov Model tags(meta and gazetteer n (HMM) finds the best sequence of NE tags t1 that is optimal for tag and tag) gazetteer n a given observation sequence o1 . The tagging problem becomes tag) n n n equivalent to searching for arg max P(o1 | t1 ) P(t1 ) (by the t1n application of Bayes’ law), that is, we need to compute: Label tˆ1n arg max P(o1n | t1n ) P(t1n ) (1). pseudo tokens in Testing t1n phase IOB format n n Where t1 is a tag sequence and o1 is an observation sequence, P(t1n ) is the prior probability of the tag sequence and P(o1n | t1n ) is the likelihood of the word sequence. Training HMM based In general, HMM based sequence labeling tasks such as POS NE tagger tagging use words in a sentence as an observation sequence [12] 13]. But, we use MontyTagger [14] to assign POS tags to the data NE tagged released for the task, that is, some additional information such as tweet in POS for each token in a tweet becomes now available. We also HMM IOB use some other information such as whether the token contains model format any digit, whether the token contains any hash tag or not etc. We use this information in a form of meta tag (details are presented in the subsequent sections). We use gazetteer information also. If Figure 1. Architecture for our developed HMM based NE any token is found in the specific gazetteer list, we use the extraction system gazetteer tag in place of POS tag (details are presented in the subsequent sections). Unlike the traditional HMM based NER system, to use this For our proposed trigram model, the probability of a tag depends n additional information for named entity recognition task, we on two previous tags and thus P(t1 ) is computed as: consider a triplet as an observation symbol:. This is a pseudo token used as an observed symbol, that is, for a tweet of n words, the P(t1n ) P(ti | ti 1 , ti 2 ) (2) i 1 corresponding observation sequence will be as follows: Depending on the assumption that the probability of a word ( , , n n , .........., ) appearing is dependent only on its own tag, P(o1 | t1 ) can be . Here an observation symbol oi corresponds to and X-tag can be either POS tag or gazetteer tag). n Since Equation (1) is too hard to compute directly, HMM P(o1n | t1n ) P(oi | ti ) (3) taggers follows Markov assumption according to which the i 1 probability of a tag is dependent only on short memory (a small, Plugging the above mentioned two equations (2) and (3) into fixed number of previous tags). For example, a bigram tagger (1) results in the following equation by which a bigram tagger considers that the probability of a tag depends only on the estimates the most probable tag sequence: n previous tag tˆ1n arg max P(t1n | o1n ) P(t1n ) arg max P(oi | ti ) P(ti | ti 1 ) (4) t1n t1n i 1 90 Where: the tag transition probabilities, P(ti | ti 1 ) , represent the ( , , , .........., ) probability of a tag given the previous tag. P(oi | ti ) represents . Here an observation symbol oi corresponds to and X-tag can be either POS tag or gazetteer tag). Considering a special tag tn+1 to indicate the end sentence After assigning the tag sequence to the observation sequence as boundary and two special tags t-1 and t0 at the starting boundary mentioned above, X-tag and meta-tag information are removed of the sentence and adding these three special tags to the tag set from the output and thus the output for an input sentence is [15], gives the following equation for NE tagging: converted to a NE-tagged sentence. tˆ1n arg max P(t1n | o1n ) P(t1n ) We have used the Viterbi algorithm presented in [16] for finding the most likely tag sequence for a given observation t1n sequence. n (5) arg max[ P(oi | ti ) P(ti | ti 1 , ti 2 )]P(tn 1 | tn ) One of the important problems to apply Viterbi decoding t1n i 1 algorithm is how to handle unknown triplets in the input. The unknown triplets are triplets which are not present in the training The equation (5) is still computationally expensive because we set and hence their observation probabilities are not known. To need to consider all possible tag sequence of length n. So, handle this problem, we estimate the observation probability of an dynamic programming approach is used to compute the equation unknown one by analyzing X-tag, meta-tag and the suffix of the (5). word associated with the corresponding the triplet. We estimate At the training phase of HMM based NE tagging, observation the observation probability of an unknown observed triplet in the probability matrix and tag transition probability matrix are following ways: created. Architecture of our developed NE tagger is shown in Figure 1. The observation probabilities of unknown triplet < word, X-tag, meta-tag> corresponding to a word in the input sentence are decided according to the suffix of a pseudo word formed by 3.1 Computing Tag Transition Probabilities adding X-tag and meta-tag to the end of the word. We find the observation probabilities of such unknown pseudo words using suffix analysis of all rare pseudo words (frequency <=2) in the As we can see from the equation (4), to find the most likely tag training corpus for the concerned language [13][15]. sequence for an observation sequence, we need to compute two kinds of probabilities: tag transition probabilities and word likelihoods or observation probabilities. 4. SPECIAL TAGS Our developed trigram HMM tagger requires to compute tag trigram probability, P(ti | ti 1 , ti 2 ) , which is computed by the 4.1 Meta Tag maximum likelihood estimate from tag trigram counts. To Each token has some properties by which one token differs from overcome the data sparseness problem, tag trigram probability is another. For example, a token may only consist of digits or it may smoothed using deleted interpolation technique [13][15] which contain hash. To capture such information specific to a token, we uses the maximum likelihood estimates from counts for tag use Meta tag. For example, if a token is consisting of only digits, trigram, tag bigram and tag unigram. meta tag that we will assign to the token is ALLDIGITS which we write ALDT in short. 3.2 Computing Observation Probabilities The various meta tags that we use for our task are described The observation probability of a observed triplet , which is the observed symbol in our case, is computed rules which are fired in the following order. using the following equation [12][13]. P(o | t ) CC((oo,t)) (7) Meta-tag=”YYYY”(default) if the first letter of the token is a capital letter then metatag = "ICAP" 3.3 Viterbi Decoding end if if the first token is abbreviation then The task of a decoder is to find the best hidden state sequence metatag = "ABBR" given an input HMM and a sequence of observations. End If The Viterbi algorithm is the most common decoding algorithm used for HMM based tagging task. This is a standard application if contains "#" at the begining of the token and the first character of the classic dynamic programming algorithm[16]. after hash is a capital letter then Given a tag transition probability matrix and the observation metatag = "CHAS" probability matrix, Viterbi decoding (used at the testing phase) ElseIf contains "#" at the begining of the token Then accepts a tweet in Indian language and finds the most likely tag metatag = "HASH" sequence for the test tweet which is also X-tagged and Meta End If tagged. Here a tweet is submitted to the viterbi as the observation sequence of triplets: if contains "@" at the begining of the token then metatag = "ATSY" 91 End If person names Blocation A list of first words 1243 If last charater is a colon(":") And the first letter is capital then extracted from list of metatag = "CCOL" location names ElseIf last charater is a colon(":") Then metatag = "COLN" Ilocation A list of words extracted 257 End If from a list of location names where a extracted if contains hyphen and the first character is capital then word is not the first word metatag = "CHYP" of the location name ElseIf hyphen occurs after 3 characters from the begining then facilities A list of facility names 14 metatag = "HYPH" such as school, college End If etc. months A list of English month 12 if the token is 4 digits then names metatag = "DFOR" days A list of English day 7 ElseIf the token is two digits then names metatag = "DTWO" period A list of words indicating 34 ElseIf the token is one digit then “period” such as “month”, metatag = "DONE" “year” etc. ElseIf token contains at least one digit then metatag = "DIGT" Count A list of words indicating 58 End If expressions “count” Monetary A list of words indicating 18 If contains one comma and contains at least one digit then expressions monetary expressions metatag = "DCOM" such as lakh, crore etc. ElseIf the last character is a comma and first character is capital then metatag = "CLCO" We follow the following rules for assigning this type of tag to the ElseIf contains one comma at the end of the token then token: metatag = "LCOM" ElseIf contains more than one comma and first character is X-tag=POS-tag (default tag) capital then if Token is found in the BPerson list then metatag = "CMCO" X-tag="BPER" End If elseif Token is found in the IPerson list then If token contains all dots then X-tag = "IPER" metatag = "ALDT" elseif Token is found in Blocation list then End If X-tag="BLOC" 4.2 Gazetteer tag elseif Token is found in ILocation list then In earlier sections, we have mentioned that POS tag for a token is X-tag = "ILOC" replaced by a gazetteer tag if the token is found in a particular elseif Token is found in the list of facilities then gazetteer list. if the length of a raw word is greater than equal to 2 , before searching in the gazetteer list, we remove from the token X-tag="FACI" the symbols such as ",",".",":","#" and "@". The description of elseif Token is found in the list of month names then gazetteer list is shown in Table 1. X-tag = "MONT" elseif Token is found in the list of day names then Table 1: Description of Gazetteer lists X-tag= "DAYS" elseif Token is found in the list of period indicating expressions Gazetteer description Number of then name entries in the X-tag = "PERD" list elseif Token is found in the list of expression denoting countthen Bperson List of first names 657 X-tag = "COUN" separated from a list of elseif Token is found in the list of monetary expressions then person names X-tag = "MONY" End If Iperson List of words representing 563 second names, third names, last names extracted from a list of 92 5. EVALUATION AND RESULTS Table 3. Official results obtained by the various systems participated in the ESM-IL task- FIRE 2015 for English We train separately our developed named entity recognizer based language on the training data and tune the parameters of our system on the Teams P R F training data for the English language. After learning the tuning parameters, we test our system on the test data for the concerned Shriya - Amritha Run1 0.08 0.064 0.071 language. The description of the data for English language is shown in the Table2 After getting the NE-tagged output in IOB format from the Sanjay - Amritha Run1 0.057 0.028 0.038 HMM model, we observed that the NE tagged output contains Run2 0.043 0.021 0.028 some occurrences of a sequence of I-XXXs where the left boundary of each such sequence is a transition from the tag “O” to I-XXXs (but, according to the IOB format, the left boundary of a Chintak - LDRP Run1 7.30 4.17 5.31 named entity is a transition from any tag to B-XXX). Run2 5.35 5.67 5.50 Table2. The description of the data for English language KSarkar - JU Run1 61.96 39.46 48.21 Language Total of tweets NE Types Training data Test data Vira - Run1 4.13 3.39 3.72 English 11003 9641 21 Charotar Univ We have also observed that the word sequence to which this type Pallavi - HITS Run1 50.48 32.03 39.19 of tag sequence is assigned is not really a named entity. So, considering this as the errors of the model, we replace such a Run2 50.21 37.06 42.64 sequence of I-XXXs in the output by a sequence of “o”. After applying this post-processing on the output produced by the Run3 - - - HMM model, the final output file is generated. Our developed NER system has been evaluated using the Sombuddha - JU Run1 46.92 32.41 38.34 traditional precision (P), recall (R) and F-measure (F). For training, tuning and testing our system, we have used the dataset Run2 58.09 31.85 41.15 for English language, released by the organizers of the ESM-IL task- FIRE 2015. The organizers of the ESM-IL task- FIRE 2015 Run3 49.10 31.59 38.45 released the data in two phases: in the first phase, training data is Run4 46.50 30.20 36.61 released along with the corresponding NE annotation file. In the second phase, the test data is released and no NE annotation file is Run5 58.09 31.85 41.15 provided. The contestants are instructed to generate NE annotation file for test data using their developed systems. NE annotation file for test data was finally sent to the organizers for evaluation. The organizers evaluate the different runs submitted by the various teams and send the official results to the participating teams. 6. CONCLUSION We have shown in Table 3 the results obtained by our submitted run indicated by team id “KSarkar – JU”. As we can This paper describes a named entity recognition system for Entity see from the table, our system outperforms the other systems Extraction from Social Media Text in English language. The participated in the ESM-IL task. Table 3 only shows the FIRE features such as Gazetteer list, POS tag and some other word level 2015 official results for English language only. The overall FIRE features have been introduced into the HMM model. The 2015 official results for ESM-IL task including all languages are experimental results show that our system is the best performer shown in Table 4. among the systems participated in the ESM-IL task for English language. The named entity recognition system has been developed using Visual Basic platform so that a suitable user interface can be designed for the novice users. The system has been designed in such a way that only changing the training corpus in a file can make the system portable to a new Indian language. 93 Table 4. Language wise official results obtained by the various systems participated in the ESM-IL task- FIRE 2015 Language Hindi Tamil Malayalam English eams P R F P R F P R F P R F Shriya - Run1 2.61 1.44 1.86 0.19 0.066 0.09 0.23 0.29 0.26 0.08 0.064 0.071 Amritha Sanjay - Run1 2.27 0.16 0.29 0.30 0.07 0.12 0.075 0.036 0.04 0.057 0.028 0.038 Amritha Run2 - - - 0.250 0.077 0.11 - - - 0.043 0.021 0.028 Chintak - Run1 67.11 0.76 1.51 - - - - - - 7.30 4.17 5.31 LDRP Run2 74.73 46.84 57.59 - - - - - - 5.35 5.67 5.50 KSarkar Run1 - - - - - - - - - 61.96 39.46 48.21 - JU Vira - Run1 25.65 16.14 19.82 - - - - - - 4.13 3.39 3.72 Charotar Univ Pallavi - Run1 81.21 44.57 57.55 70.42 14.13 23.54 - - - 50.48 32.03 39.19 HITS Run2 80.86 44.25 57.20 64.52 22.14 32.97 - - - 50.21 37.06 42.64 Run3 81.49 41.58 55.06 - - - - - - - - - Sombudd Run1 err - - - - - - - - 46.92 32.41 38.34 ha - JU Run2 err - - - - - - - - 58.09 31.85 41.15 Run3 err - - - - - - - - 49.10 31.59 38.45 Run4 - - - - - - - - - 46.50 30.20 36.61 Run5 - - - - - - - - - 58.09 31.85 41.15 References [6] Kumar, N. and Bhattacharyya, P. 2006. Named Entity Recognition in Hindi using MEMM. In Technical Report, [1] Grishman, R. 1995. The New York University System MUC- IIT Bombay, India. 6 or Where’s the syntax? In Proceedings of the Sixth Message Understanding Conference. [7] Li, Wei and McCallum, A. 2004. Rapid Development of Hindi Named Entity Recognition using Conditional Random [2] McDonald, D. D. 1996. Internal and external evidence in the Fields and Feature Induction (Short Paper). In ACM identification and semantic categorization of proper names. Transactions on Computational Logic. In B. Boguraev and J. Pustejovsky, editors, Corpus Processing for Lexical Acquisition, 21–39. [8] Srihari, R., Niu, C. and Li, Wei. 2000. A Hybrid Approach for Named Entity and Sub-Type Tagging. In Proceedings of [3] Wakao, T., Gaizauskas, R. and Wilks, Y. 1996. Evaluationof the sixth conference on Applied natural language an algorithm for the recognition and classification of proper processing. names. In Proceedings of COLING-96. [9] Cucerzan, S. and Yarowsky, D. 1999. Language Independent [4] Bikel, D. M., Miller, S., Schwartz, R. and Weischedel, R. Named Entity Recognition Combining Morphological and 1997. Nymble: A High Performance Learning Name-finder. Contextual Evidence. In Proceedings of the Joint SIGDAT In Proceedings of the Fifth Conference on Applied Natural Conference on EMNLP and VLC, 1999, 90–99. Language Processing, 194–201. [10] Li, Wei and McCallum, A. 2004. Rapid Development of [5] Borthwick, A. 1999. A Maximum Entropy Approach to Hindi Named Entity Recognition using Conditional Random Named Entity Recognition. Ph.D. thesis, Computer Science Fields and Feature Induction (Short Paper). In ACM Department, New York University. Transactions on Computational Logic. 94 [11] Ekbal A. and Bandyopadhyay, S. 2009. A conditional random field approach for named entity recognition in Bengali and Hindi. Linguistic Issues in Language Technology, 2(1). [12] Sarkar K. and Gayen, V. 2012. A practical part-of-speech tagger for Bengali. In Proceedings of the third International conference on Emerging Applications of Information Technology (EAIT), Kolkata. pp. 36-40. [13] Gayen, V. and Sarkar, K. 2014. "An HMM based named entity recognition system for indian languages: the JU system at ICON 2013." arXiv preprint arXiv:1405.7397 (2014). [14] Liu, H. 2004. MontyLingua: an end to end natural language processor with common sense, 2004, retrieved in 2005 from web.media.mit.edu/~hugo/montylingua [15] Brants, T. 2000. TnT – A statistical part-of-speech tagger. In proceedings of the 6th Applied NLP Conference, pp. 224- 231. [16] Jurafsky, D. and Martin, J. H. 2002. Speech and Language Processing: An Intoduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Preason Education Series. 95