Feature Based Approach to Named Entity Recognition and Linking for Tweets Souvick Ghosh Promita Maitra Dipankar Das Jadavpur University Jadavpur University Jadavpur University Kolkata, West Bengal 700032 Kolkata, West Bengal 700032 Kolkata, West Bengal 700032 India India India souvick.gh@gmail.co promita.maitra@gmail.com dipankar.dipnil2005@gmail.com ABSTRACT In this paper, we describe our approach for Named Entity rEcognition and Linking Challenge (NEEL) at the #Micro- posts2016. The task is to automatically recognize entities and their types from English microposts, and link them to corresponding DBpedia 2015 entries. If the resources do not exist, we use NIL identifiers instead. The task is unique as twitter data is informal in nature with non-conformational Figure 1: Workflow of the system. spellings, random contractions and various other noises. For this task, we developed our system using a hybrid model. We have used various existing named entity recognition (NER) structure and context makes it difficult to extract relevant systems and combined them with our classifier to improve information. Due to this complexity, existing named entity the results. recognition systems (NER) do not perform very well with tweet data. In NEEL challenge [8] of #Microposts2016 [6], Keywords we were required to automatically identify the named en- tities and their types from Twitter data and link them to Named Entity Extraction; Named Entity Linking; Social the corresponding URIs of the DBpedia 2015-04 dataset1 . Media; DBpedia; Twitter Identifying the named entities and linking them to an exist- ing knowledge base enriches the text with more contextual 1. INTRODUCTION and semantic information. The mentions which could not In present day world, the relevance and importance of var- be linked to any existent DBpedia resource page were rec- ious social media platforms are immeasurable. Microposts ognized as NIL mentions. These mentions were clustered such as tweets are limited in number of characters. However, to ensure that the same entity, which does not have a corre- the conciseness of the text is barely a pointer to its useful- sponding entry in DBpedia, will be referenced with the same ness. From opinion mining during political campaigns to live NIL identifier. We have developed three systems for the feeds during sports events, from product reviews to vacation NEEL challenge, the major difference between the systems posts, Twitter is almost ubiquitous. Twitter promotes in- being the features used for each run. Our system follows a stant communication. Most celebrities use it to form their hybrid approach where Stanford Named Entity Recognition own digital presence. It also serves as a common forum system is used to identify the entity mentions. In the next where people have the capability to rise from obscurity to step, we run ARK Twitter Part-of-Speech Tagger to iden- prominence through sharing of opinions. If we compare mi- tify the mentions which are missed formerly. We use our own croposts to any standard long document such as blog or news classifier to detect the type of the mentions. The named en- articles, there exist a number of differences. Long articles tity linking to DBpedia resources is done using Babelfy2 . It are usually well written. They follow a definite structure, must be noted that we followed a feature-based approach for include headings and follow the rules of English grammar. the NEEL challenge. We also combined the existing tools Microposts, on the other hand, are short, noisy and hardly for Named Entity Recognition and Linking. Each of the show any adherence to formal grammar. Presence of extra- existing tools, like the Stanford NER, ARK Part-of-Speech neous characters like hashtags, abbreviations and the lack of Tagger and Babelfy are state-of-the-art. We explored their strengths and weaknesses in our work. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- 2. OUR SYSTEM tion on the firstc page. Copyright 2016Copyrights held byfor components of this work author(s)/owner(s); owned by copying others than permitted ACM only must be honored. for private andAbstracting academic with credit is permitted. To copy otherwise, or re- purposes. Our system follows four steps in pipeline as shown in Fig- publish, Publishedto post ason servers part or to#Microposts2016 of the redistribute to lists, requires prior specific Workshop permission proceedings, ure 1. Mention detection in two stages, followed by mention and/or a fee.online available Requestaspermissions from permissions@acm.org. CEUR Vol-1691 (http://ceur-ws.org/Vol-1691) type classification, mention linking and NIL clustering. WOODSTOCK ’97 El Paso, Texas USA #Microposts2016, Apr 11th, 2016, Montréal, Canada. 1 c 2016 ACM. ISBN 123-4567-24-567/08/06. . . $15.00 http://wiki.dbpedia.org/dbpedia-data-set-2015-04 2 DOI: 10.475/123 4 http://babelfy.org/ · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 2.1 Preprocessing 2.3.2 Features for Run 2 From the training data, the mentions referring to the 7 We made use of various text based features and bag of types of entities were extracted to form 7 bags of words. words in Run 1. In Run 2, we explored various contextual Using the initial words as seeds, the Wikipedia dumps were features in addition to the features of Run 1. So we combined crawled to expand the set of words. These lists represent ten new features with the previous twelve features for Run potential candidates for Named Entity mentions. 2. The ten additional features used in Run 2 were as follows: - Context score for Person entity 2.2 Detection of Entity Mentions - Context score for Location entity In this step, the named entity mentions in the given tweets - Context score for Character entity are identified using two different approaches. - Context score for Organization entity - Context score for Event entity 2.2.1 Using Stanford Named Entity Recognizer - Context score for Thing entity The Stanford Named Entity Recognizer 3 was used to ex- - Context score for Product entity tract the named entities. It is a CRF classifier implementing - Frequency of Part-of-speech of mention linear chain Conditional Random Field. We use the 3 class - Frequency of previous Part-of-speech model to extract the named entities belonging to classes Lo- - Frequency of next Part-of-speech cation, Person and Organization. While the recall was very Context score of a particular mention is calculated for a low, the precision of Stanford NER was quite good. three word window of the mention. For each class, we have the number of occurrences of each word in a three word 2.2.2 Using ARK Twitter Part-of-Speech Tagger window. While calculating the feature value, we assign the The tweets were tokenized and assigned Part of Speech sum of the frequency of the words forming that fixed-size tags using the ARK Twitter Part-of-Speech Tagger [1]. We window as the context score of mention. used the Twitter POS model with 25-tag tagset. The proper nouns (NNP and NNS tagged as ˆ) and possessive proper 2.3.3 Run 3 nouns (tagged Z) along with hashtags (#) and at-mentions We wanted to apply a Feed-Forward neural network (also () were extracted as probable candidates for Named Entity called the back-propagation networks and multilayer per- mentions. The mentions which were already identified using ceptron) to our feature set and see how it performs as these Stanford NER are not considered for classification step as kind of Artificial Neural Networks are useful in constructing they are already classified by the tagger itself. The rest of a function where the complexity of the feature values makes the mentions are classified using our classifier in the next the decision for building such a function by hand almost im- step. possible. We took the same features of Run 2 and employed a feed-forward neural network based regression model with 2.3 Classification of Entity Types 5 hidden layers. In the machine learning software WEKA [2], we use the For the previous two runs, i.e. Run1 and Run2, the tags following features to form a feature set and used the Ran- from Stanford NER were considered as the primary influence dom Forest classifier to generate a pruned C4.5 Decision over our classifier tags as its accuracy was quite good. For Tree for 7-way classification of the named entity mentions Run 3 however, we omit the Stanford NER influence and let âĂŞ Thing, Event, Character, Location, Organization, Per- only the neural network model do the tagging to check the son and Product, while providing the identified noun entities efficiency of our classifier. from previous steps as input. We checked the accuracy by using various classifiers like NaÃŕve Bayes, k-Nearest Neigh- 2.4 Linking Mentions to DBpedia bour and Support Vector Machine on training data with a We used the Babelfy java API service [3] to address the 10-fold cross validation. Random Forest gave the best re- task of entity linking to DBpedia 2015-04 resources. It is a sults. unified, multilingual, graph-based approach to Entity Link- ing and Word Sense Disambiguation based on a loose iden- 2.3.1 Features for Run 1 tification of candidate meanings coupled with a densest sub- The features used for Run 1 were as follows: graph heuristic which selects high-coherence semantic inter- - Length of the mention string pretations [4] . The Babelfy parameters that we tuned ac- - If the mention is all capitalized cording to our preferences are: - If the mention contains mixed case setAnnotationType was set to identify both concepts and - If the mention contains digits named entities, - If internal period is present in mention string setMatchingType was set to exact matching, - If present in list of Persons setMultiTokenExpression was on to identify multi-word - If present in list of Things tokens, - If present in list of Events setScoredCandidates was set in a way so that it obtains - If present in list of Characters only top-scored candidate from the disambiguation list. - If present in list of Locations The rest of the parameters were kept to their default - If present in list of Organizations value. The named entities identified by both Babelfy and - If present in list of Products ARK Tagger were allowed to the linking stage. Initially, we The above-mentioned lists are basically the bag of words provided the original tweet texts as input to Babelfy. We produced from the training data in the pre-processing step. observed that the number of named entities and concepts recognized and linked solely by Babelfy service was quite 3 http://nlp.stanford.edu/software/CRF-NER.shtml low. The named entity recognition suffered because of the 75 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016 using the existing Named Entity Recognizer systems and Table 1: Summary of Experimental Results Twitter-specific Part-of-Speech Taggers in conjunction with Precision Recall F1 Run1 the classifier developed by us. The Named Entity Linking Strong Mention Match 0.729 0.626 0.674 was done mainly by using Babelfy, which performs as a mul- Strong Typed Mention Match 0.301 0.259 0.278 tilingual encyclopedic dictionary and a semantic network. Strong Link Match 0.586 0.161 0.252 The performance of our system suffered because of certain Mention ceaf 0.699 0.600 0.646 restrictions in time. The classification module was slightly Run2 biased and the accuracy of classification suffered as result of Strong Mention Match 0.729 0.626 0.674 that. Identifying and selecting better features would have Strong Typed Mention Match 0.144 0.124 0.133 Strong Link Match 0.586 0.161 0.252 improved results. Also a disambiguation module to treat Mention ceaf 0.699 0.600 0.646 overlapping classes would have been useful. The accuracy of Run3 the linking would also improve by taking a semantic similar- Strong Mention Match 0.729 0.626 0.674 ity approach using synonym sets for the mentions or context Strong Typed Mention Match 0.411 0.353 0.380 word overlapping from the sets while NIL clustering. Strong Link Match 0.586 0.161 0.252 Mention ceaf 0.699 0.600 0.646 5. REFERENCES [1] K. Gimpel, N. Schneider, B. OĆonnor, D. Das, noisy nature of tweet text. However, the accuracy of the D. Mills, J. Eisenstein, M. Heilman, D. Yogatama, linked resources was satisfactory. So, we modified our sys- J. Flanigan, and N. A. Smith. Part-of-speech tagging tem by altering the tweets slightly. We removed the # and for twitter: Annotation, features, and experiments. In and considered only the alphabets from an already recog- ACL 2011, pages 42–47, 2011. nized named entity (tagged by the ARK tagger). After suc- [2] M. Hall, E. Frank, G. Holmes, B. Pfahringer, cessfully linking such named entities, we searched for more P. Reutemann, and I. H. Witten. The weka data mining entities which were syntactically similar to the previously software: An update. SIGKDD Explorations, known entities. We linked these new entities to correspond- 11:231–244, 2009. ing DBpedia resources and also obtained the disambiguation [3] A. Moro, F. Cecconi, and R. Navigli. Multilingual word scores. sense disambiguation and entity linking for everybody. In 13th International Semantic Web Conference, 2.5 Clustering of NIL Mentions Posters and Demonstrations (ISWC 2014), pages The entities which could not be linked to any existing DB- 25–28, Riva del Garda, Italy, 2014. pedia resource are supposed to have NIL identifiers so that [4] A. Moro, A. Raganato, and R. Navigli. Entity linking each NIL may be reused if there are multiple mentions in meets word sense disambiguation: a unified approach. the text which represent the same (s/similar/identical) en- Transactions of the Association for Computational tity. We have considered only a spelling based approach here Linguistics (TACL), pages 231–244, 2014. to calculate the similarity between entities. Two unlinked [5] D. Nadeau. Semi-Supervised Named Entity Recognition: entities are taken to be similar if one of them contains the Learning to Recognize 100 Entity Types with Little other (letter only). In that case, the new entity is assigned Supervision. PhD thesis, Ottawa-Carleton Institute for the same NIL identifier as that of the previous one. Computer Science, School of Information Technology and Engineering, 2007. 3. RESULTS [6] D. Preoţiuc-Pietro, D. Radovanović, A. E. We evaluated our approach on the development set con- Cano-Basave, K. Weller, and A.-S. Dadzie, editors. sisting of 100 tweets made available by the organizers. In Proceedings, 6th Workshop on Making Sense of Table 1 we have reported on the official metrics for entity Microposts (#Microposts2016): Big things come in detection, tagging, clustering and linking. The precision, small packages, Montréal, Canada, 11th of Apr 2016, recall and f-scores for the above-mentioned three runs show 2016. that the Run 3 produces best results for the task with f- [7] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named score 0.674, 0.380, 0.252 and 0.646 in the categories Strong entity recognition in tweets: An experimental study. In Mention Match, Strong Typed Mention Match, Strong Link EMNLP 2011, pages 1524–1534, 2011. Match and Mention Ceaf respectively. [8] G. Rizzo, M. van Erp, J. Plu, and R. Troncy. Making While all the Runs yield same score in other categories, Sense of Microposts (#Microposts2016) Named Entity in Strong Typed Mention Match, we observe better result rEcognition and Linking (NEEL) Challenge. In for our feed-forward neural network model. Our systems for Preoţiuc-Pietro et al. [6], pages 50–59. the three different runs only differ in entity type classifica- [9] M. A. Yosef, S. Bauer, J. Hoffart, M. Spaniol, and tion module while all other subtasks follow the same system G. Weikum. Hyena: Hierarchical type classification for in all three cases. This results in same result in the last entity names. In Proceedings of COLING 2012: two categories which were mainly the evaluation metrics for Posters, pages 1361–1370, Mumbai, 2012. linking and nil clustering. 4. CONCLUSION In this paper, we have described our approach for the #Microposts2016 Named Entity rEcognition and Linking (NEEL) challenge. We have developed a hybrid system 76 · #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016