Named Entity Extraction and Linking in #Microposts Priyanka Sinha Biswanath Barik TCS Innovation Lab Kolkata TCS Innovation Lab Kolkata Indian Institute of Technology Kharagpur Tata Consultancy Services Limited priyanka27.s@tcs.com biswanath.barik@tcs.com ABSTRACT creation and entity linking method. Section 3 describes the The task of Named Entity Extraction and Linking (NEEL) setup for web access. The result of our work is discussed in challange 2015 [5] is considered as two successive tasks : Section 4. Section 5 illustrates the future scope of our work Named Entity Extraction (NEE) from the tweets and Named followed by the references. Entity Linking (NEL) with DBpedia. For NEE task we use CRF++ [1] to create a language model on the given training data. For entity linking, we use DBpedia Spotlight. 2. METHODOLOGY In our approach we have divided the Named Entity Extrac- tion and Linking (NEEL) [5] task into two consecutive sub- Categories and Subject Descriptors tasks, namely, Named Entity Extraction and Named Entity H.4 [Information Systems Applications]: Miscellaneous Linking. General Terms Experiment 2.1 Named Entity Extraction The NER task is viewed here as a sequence labeling prob- lem. Given an input tweet, this step aims to identify the Keywords word sequences that constitute a Named Entity and classify Twitter, Entity, Linking, Social Media, DBpedia each such entity into one of the predefined classes. For en- tity recognition and classification task, we have developed a 1. INTRODUCTION model on the given training data using Conditional Random Information Extraction (IE) from short messages or microblogs Fields (CRFs) which is an undirected graphical model used like tweets is an emerging field of research due to its com- mainly for sequence labeling. mercial applications like ecommerce, recommendation etc. and social administration like social security. Entity link- As we have discussed in the previous section that the con- ing (or entity resolution) is one such task which deals with text of the tweets is short, sometimes noisy and informal and identifying and extracting the Named Entities that belong thus, their syntactic structures are not always comparable to to the tweets and disambiguating them by linking to the the normal texts. [4] showed that the Part-of-Speech (POS) correct reference entities in the knowledge base. features of surface tokens, Shallow Parsing (or Chunking) information, Capitalization indicators etc. are useful for im- The entity linking problem is well explored on normal text. proving NE recognition from tweets provided these modules However, the existing techniques of entity linking do not should be trained on twitter data. In this experiment, we work well on short messages as the microblogs do not have have added POS tag information to the training data us- sufficient context to classify (or disambiguate) the mentions. ing Twitter NER[3], used word features and some binary In this work we have identified the mention by creating an features like punctuations, digits, dots, hashtags, @, capi- entity recognition model on the given training data and link talization indicators, existence of URLs, underscore, hyphen them to the DBpedia using DBpedia Spotlight. etc. as features indicating or not indicating NEs for training NE recognition model. We were motivated to use [3] as it The rest of the paper is organized as follows: Section 2 de- allows to tokenize and distinguish between nouns and other scribes our proposed approach which includes data prepara- punctuations and tweet related artefacts well. We used [1] tion and feature selection for named entity recognition model as it was relatively simple to adapt to our task. 2.1.1 Data Preparation In the data preparation step, we have identified the word Copyright c 2015 held by author(s)/owner(s); copying permitted sequences refering to a Named Entity(NE) in the training only for private and academic purposes. data using the gold standard. The training data is tok- Published as part of the #Microposts2015 Workshop proceedings, available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395) enized, part-of-speech(POS) tagged using Twitter NER[3] and converted into ’BIO’ format. For example, the NEs #Microposts2015, May 18th, 2015, Florence, Italy. identified in the tweet ID: 100678378755067904, tweet ”RT @HadleyFreeman: NOTHING on US news networks about · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 London riots. Can you imagine the BBC ignoring, say, riots using the B/I tags we find the longest consecutive entities in NYC? #americanewsfail” as follows that make up a single entity. For example, in the tweet above, ”London riots” would be treated as a single entity. For each tweet, DBpedia Spotlight REST API is accessed with RT ~ O confidence and support set to 0 with accepted return text @HadleyFreeman @ B-Person in XML. We use the DBpedia Spotlight’s annotate endpoint : ~ O to obtain all the links at once. For each entity returned NOTHING N O from DBpedia Spotlight, if the surface form is found to be on P O a substring of any of the entities and if a substring match is US ^ B-Location found the corresponding URI is returned. For named entities news N O for which no match is found, if it is an existing nil entity then networks N O the nil id is returned, else the nil counter is incremented and about P O returned. London ^ B-Event riots N I-Event 3. SETUP . , O We used perl for transforming the data. We used the CMU Can V O Twitter NLP[3] package for generating POS, CRF++[1] pack- you O O age and DBpedia Spotlight[2] REST API. imagine V O the D O BBC ^ B-Organization 3.1 Web access ignoring V O We use JSP to create our REST API, which uses perl which , , O in turn uses curl to connect to DBpedia Spotlight[2] REST say V O endpoints. , , O riots N O 4. EVALUATION in P O The precision for strong link match with the training set NYC ^ B-Location itself is 30.49%, recall is 30.29% and f1 is 30.39%. For the ? , O tagging of correct entity type the precision with the training #americanewsfail # O set itself is 82.89%, recall 82.35% and f1 82.62%. 2.1.2 Feature Selection The precision for strong link match with the development set We have experimented with various feature types, various is 14.82%, recall is 7.97% and f1 is 10.37%. For the tagging window lengths and their combinations and come up with of correct entity type the precision with the training set itself the following feature set which gave us a good result. We is 41.65%, recall 22.41% and f1 29.14%. experimented with some context window lengths and 5 gave us good results. 5. FUTURE WORK As we can see using the CMU POS tagger[3] and CRF[1] discovers the entities well, but the way we do linking needs • Contextual (Word) Features: a context window of size more work. five: Wi 2 Wi 1 Wi Wi+1 Wi+2 • Part-of-Speech (POS) Features: a context of size five: 6. REFERENCES Pi 2 Pi 1 Pi Pi+1 Pi+2 [1] Crf++: Yet another crf toolkit. [2] J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes. • Word having Capitalization: binary feature Improving efficiency and accuracy in multilingual entity extraction. In Proceedings of the 9th International • Word having Punctuation: binary feature Conference on Semantic Systems (I-Semantics), 2013. • Is a Digit: binary feature [3] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel, N. Schneider, and N. A. Smith. Improved • Word having a Dot: binary feature part-of-speech tagging for online conversational text with word clusters. In In Proceedings of NAACL, 2013. • Word having hashtag: binary feature [4] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named • Word having @: binary feature entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 2.2 Named Entity Linking 1524–1534, Edinburgh, Scotland, UK, July 2011. For linking, we use the annotations returned by DBpedia [5] G. Rizzo, A. E. Cano Basave, B. Pereira, and A. Varga. Spotlight REST API as the candidates and look for the Making Sense of Microposts (#Microposts2015) Named longest matching surface forms. Entity rEcognition and Linking (NEEL) Challenge. In M. Rowe, M. Stankovic, and A.-S. Dadzie, editors, 5th We take the output of the NEE task and collect the named Workshop on Making Sense of Microposts entities that are extracted and their categories. To identify (#Microposts2015), pages 44–53, 2015. correct start position we check for # and @. For each tweet, 67 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015