Named Entity Extraction and Linking in #Microposts

                            Priyanka Sinha                                        Biswanath Barik
                    TCS Innovation Lab Kolkata                                TCS Innovation Lab Kolkata
              Indian Institute of Technology Kharagpur                     Tata Consultancy Services Limited
                      priyanka27.s@tcs.com                                  biswanath.barik@tcs.com


 ABSTRACT                                                          creation and entity linking method. Section 3 describes the
 The task of Named Entity Extraction and Linking (NEEL)            setup for web access. The result of our work is discussed in
 challange 2015 [5] is considered as two successive tasks :        Section 4. Section 5 illustrates the future scope of our work
 Named Entity Extraction (NEE) from the tweets and Named           followed by the references.
 Entity Linking (NEL) with DBpedia. For NEE task we use
 CRF++ [1] to create a language model on the given training
 data. For entity linking, we use DBpedia Spotlight.
                                                                   2.    METHODOLOGY
                                                                   In our approach we have divided the Named Entity Extrac-
                                                                   tion and Linking (NEEL) [5] task into two consecutive sub-
 Categories and Subject Descriptors                                tasks, namely, Named Entity Extraction and Named Entity
 H.4 [Information Systems Applications]: Miscellaneous             Linking.

 General Terms
 Experiment
                                                                   2.1     Named Entity Extraction
                                                                   The NER task is viewed here as a sequence labeling prob-
                                                                   lem. Given an input tweet, this step aims to identify the
 Keywords                                                          word sequences that constitute a Named Entity and classify
 Twitter, Entity, Linking, Social Media, DBpedia                   each such entity into one of the predefined classes. For en-
                                                                   tity recognition and classification task, we have developed a
 1.   INTRODUCTION                                                 model on the given training data using Conditional Random
 Information Extraction (IE) from short messages or microblogs     Fields (CRFs) which is an undirected graphical model used
 like tweets is an emerging field of research due to its com-      mainly for sequence labeling.
 mercial applications like ecommerce, recommendation etc.
 and social administration like social security. Entity link-      As we have discussed in the previous section that the con-
 ing (or entity resolution) is one such task which deals with      text of the tweets is short, sometimes noisy and informal and
 identifying and extracting the Named Entities that belong         thus, their syntactic structures are not always comparable to
 to the tweets and disambiguating them by linking to the           the normal texts. [4] showed that the Part-of-Speech (POS)
 correct reference entities in the knowledge base.                 features of surface tokens, Shallow Parsing (or Chunking)
                                                                   information, Capitalization indicators etc. are useful for im-
 The entity linking problem is well explored on normal text.       proving NE recognition from tweets provided these modules
 However, the existing techniques of entity linking do not         should be trained on twitter data. In this experiment, we
 work well on short messages as the microblogs do not have         have added POS tag information to the training data us-
 sufficient context to classify (or disambiguate) the mentions.    ing Twitter NER[3], used word features and some binary
 In this work we have identified the mention by creating an        features like punctuations, digits, dots, hashtags, @, capi-
 entity recognition model on the given training data and link      talization indicators, existence of URLs, underscore, hyphen
 them to the DBpedia using DBpedia Spotlight.                      etc. as features indicating or not indicating NEs for training
                                                                   NE recognition model. We were motivated to use [3] as it
 The rest of the paper is organized as follows: Section 2 de-      allows to tokenize and distinguish between nouns and other
 scribes our proposed approach which includes data prepara-        punctuations and tweet related artefacts well. We used [1]
 tion and feature selection for named entity recognition model     as it was relatively simple to adapt to our task.


                                                                   2.1.1     Data Preparation
                                                                   In the data preparation step, we have identified the word
 Copyright c 2015 held by author(s)/owner(s); copying permitted    sequences refering to a Named Entity(NE) in the training
 only for private and academic purposes.                           data using the gold standard. The training data is tok-
 Published as part of the #Microposts2015 Workshop proceedings,
 available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395)   enized, part-of-speech(POS) tagged using Twitter NER[3]
                                                                   and converted into ’BIO’ format. For example, the NEs
 #Microposts2015, May 18th, 2015, Florence, Italy.                 identified in the tweet ID: 100678378755067904, tweet ”RT
                                                                   @HadleyFreeman: NOTHING on US news networks about


· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
 London riots. Can you imagine the BBC ignoring, say, riots      using the B/I tags we find the longest consecutive entities
 in NYC? #americanewsfail” as follows                            that make up a single entity. For example, in the tweet
                                                                 above, ”London riots” would be treated as a single entity. For
                                                                 each tweet, DBpedia Spotlight REST API is accessed with
 RT ~ O                                                          confidence and support set to 0 with accepted return text
 @HadleyFreeman @ B-Person                                       in XML. We use the DBpedia Spotlight’s annotate endpoint
 : ~ O                                                           to obtain all the links at once. For each entity returned
 NOTHING N O                                                     from DBpedia Spotlight, if the surface form is found to be
 on P O                                                          a substring of any of the entities and if a substring match is
 US ^ B-Location                                                 found the corresponding URI is returned. For named entities
 news N O                                                        for which no match is found, if it is an existing nil entity then
 networks N O                                                    the nil id is returned, else the nil counter is incremented and
 about P O                                                       returned.
 London ^ B-Event
 riots N I-Event                                                 3.    SETUP
 . , O
                                                                 We used perl for transforming the data. We used the CMU
 Can V O
                                                                 Twitter NLP[3] package for generating POS, CRF++[1] pack-
 you O O
                                                                 age and DBpedia Spotlight[2] REST API.
 imagine V O
 the D O
 BBC ^ B-Organization                                            3.1    Web access
 ignoring V O                                                    We use JSP to create our REST API, which uses perl which
 , , O                                                           in turn uses curl to connect to DBpedia Spotlight[2] REST
 say V O                                                         endpoints.
 , , O
 riots N O                                                       4.    EVALUATION
 in P O                                                          The precision for strong link match with the training set
 NYC ^ B-Location                                                itself is 30.49%, recall is 30.29% and f1 is 30.39%. For the
 ? , O                                                           tagging of correct entity type the precision with the training
 #americanewsfail # O                                            set itself is 82.89%, recall 82.35% and f1 82.62%.

  2.1.2    Feature Selection                                     The precision for strong link match with the development set
 We have experimented with various feature types, various        is 14.82%, recall is 7.97% and f1 is 10.37%. For the tagging
 window lengths and their combinations and come up with          of correct entity type the precision with the training set itself
 the following feature set which gave us a good result. We       is 41.65%, recall 22.41% and f1 29.14%.
 experimented with some context window lengths and 5 gave
 us good results.                                                5.    FUTURE WORK
                                                                 As we can see using the CMU POS tagger[3] and CRF[1]
                                                                 discovers the entities well, but the way we do linking needs
    • Contextual (Word) Features: a context window of size       more work.
      five: Wi 2 Wi 1 Wi Wi+1 Wi+2
    • Part-of-Speech (POS) Features: a context of size five:     6.    REFERENCES
      Pi 2 Pi 1 Pi Pi+1 Pi+2                                     [1] Crf++: Yet another crf toolkit.
                                                                 [2] J. Daiber, M. Jakob, C. Hokamp, and P. N. Mendes.
    • Word having Capitalization: binary feature                     Improving efficiency and accuracy in multilingual entity
                                                                     extraction. In Proceedings of the 9th International
    • Word having Punctuation: binary feature
                                                                     Conference on Semantic Systems (I-Semantics), 2013.
    • Is a Digit: binary feature                                 [3] O. Owoputi, B. O’Connor, C. Dyer, K. Gimpel,
                                                                     N. Schneider, and N. A. Smith. Improved
    • Word having a Dot: binary feature                              part-of-speech tagging for online conversational text
                                                                     with word clusters. In In Proceedings of NAACL, 2013.
    • Word having hashtag: binary feature
                                                                 [4] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named
    • Word having @: binary feature                                  entity recognition in tweets: An experimental study. In
                                                                     Proceedings of the 2011 Conference on Empirical
                                                                     Methods in Natural Language Processing, pages
 2.2      Named Entity Linking                                       1524–1534, Edinburgh, Scotland, UK, July 2011.
 For linking, we use the annotations returned by DBpedia
                                                                 [5] G. Rizzo, A. E. Cano Basave, B. Pereira, and A. Varga.
 Spotlight REST API as the candidates and look for the
                                                                     Making Sense of Microposts (#Microposts2015) Named
 longest matching surface forms.
                                                                     Entity rEcognition and Linking (NEEL) Challenge. In
                                                                     M. Rowe, M. Stankovic, and A.-S. Dadzie, editors, 5th
 We take the output of the NEE task and collect the named
                                                                     Workshop on Making Sense of Microposts
 entities that are extracted and their categories. To identify
                                                                     (#Microposts2015), pages 44–53, 2015.
 correct start position we check for # and @. For each tweet,


                                                                                                                             67
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015