Feature Based Approach to Named Entity Recognition and
                  Linking for Tweets

                         Souvick Ghosh                                            Promita Maitra                          Dipankar Das
                    Jadavpur University                                      Jadavpur University                        Jadavpur University
               Kolkata, West Bengal 700032                              Kolkata, West Bengal 700032                Kolkata, West Bengal 700032
                          India                                                    India                                      India
                  souvick.gh@gmail.co                                  promita.maitra@gmail.com dipankar.dipnil2005@gmail.com


ABSTRACT
In this paper, we describe our approach for Named Entity
rEcognition and Linking Challenge (NEEL) at the #Micro-
posts2016. The task is to automatically recognize entities
and their types from English microposts, and link them to
corresponding DBpedia 2015 entries. If the resources do not
exist, we use NIL identifiers instead. The task is unique as
twitter data is informal in nature with non-conformational                                                  Figure 1: Workflow of the system.
spellings, random contractions and various other noises. For
this task, we developed our system using a hybrid model. We
have used various existing named entity recognition (NER)                                       structure and context makes it difficult to extract relevant
systems and combined them with our classifier to improve                                        information. Due to this complexity, existing named entity
the results.                                                                                    recognition systems (NER) do not perform very well with
                                                                                                tweet data. In NEEL challenge [8] of #Microposts2016 [6],
Keywords                                                                                        we were required to automatically identify the named en-
                                                                                                tities and their types from Twitter data and link them to
Named Entity Extraction; Named Entity Linking; Social                                           the corresponding URIs of the DBpedia 2015-04 dataset1 .
Media; DBpedia; Twitter                                                                         Identifying the named entities and linking them to an exist-
                                                                                                ing knowledge base enriches the text with more contextual
1.      INTRODUCTION                                                                            and semantic information. The mentions which could not
   In present day world, the relevance and importance of var-                                   be linked to any existent DBpedia resource page were rec-
ious social media platforms are immeasurable. Microposts                                        ognized as NIL mentions. These mentions were clustered
such as tweets are limited in number of characters. However,                                    to ensure that the same entity, which does not have a corre-
the conciseness of the text is barely a pointer to its useful-                                  sponding entry in DBpedia, will be referenced with the same
ness. From opinion mining during political campaigns to live                                    NIL identifier. We have developed three systems for the
feeds during sports events, from product reviews to vacation                                    NEEL challenge, the major difference between the systems
posts, Twitter is almost ubiquitous. Twitter promotes in-                                       being the features used for each run. Our system follows a
stant communication. Most celebrities use it to form their                                      hybrid approach where Stanford Named Entity Recognition
own digital presence. It also serves as a common forum                                          system is used to identify the entity mentions. In the next
where people have the capability to rise from obscurity to                                      step, we run ARK Twitter Part-of-Speech Tagger to iden-
prominence through sharing of opinions. If we compare mi-                                       tify the mentions which are missed formerly. We use our own
croposts to any standard long document such as blog or news                                     classifier to detect the type of the mentions. The named en-
articles, there exist a number of differences. Long articles                                    tity linking to DBpedia resources is done using Babelfy2 . It
are usually well written. They follow a definite structure,                                     must be noted that we followed a feature-based approach for
include headings and follow the rules of English grammar.                                       the NEEL challenge. We also combined the existing tools
Microposts, on the other hand, are short, noisy and hardly                                      for Named Entity Recognition and Linking. Each of the
show any adherence to formal grammar. Presence of extra-                                        existing tools, like the Stanford NER, ARK Part-of-Speech
neous characters like hashtags, abbreviations and the lack of                                   Tagger and Babelfy are state-of-the-art. We explored their
                                                                                                strengths and weaknesses in our work.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-          2.     OUR SYSTEM
tion on the firstc page.
 Copyright           2016Copyrights
                            held byfor   components of this work
                                      author(s)/owner(s);             owned by
                                                                   copying        others than
                                                                               permitted
ACM
 only must   be honored.
       for private     andAbstracting
                             academic  with  credit is permitted. To copy otherwise, or re-
                                          purposes.                                               Our system follows four steps in pipeline as shown in Fig-
publish,
 Publishedto post
               ason   servers
                    part      or to#Microposts2016
                          of the   redistribute to lists, requires prior specific
                                                            Workshop              permission
                                                                          proceedings,          ure 1. Mention detection in two stages, followed by mention
and/or  a fee.online
 available     Requestaspermissions  from permissions@acm.org.
                           CEUR Vol-1691         (http://ceur-ws.org/Vol-1691)                  type classification, mention linking and NIL clustering.
WOODSTOCK ’97 El Paso, Texas USA
#Microposts2016, Apr 11th, 2016, Montréal, Canada.                                              1
c 2016 ACM. ISBN 123-4567-24-567/08/06. . . $15.00                                                  http://wiki.dbpedia.org/dbpedia-data-set-2015-04
                                                                                                2
DOI: 10.475/123 4                                                                                   http://babelfy.org/


· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016
2.1      Preprocessing                                            2.3.2    Features for Run 2
  From the training data, the mentions referring to the 7           We made use of various text based features and bag of
types of entities were extracted to form 7 bags of words.         words in Run 1. In Run 2, we explored various contextual
Using the initial words as seeds, the Wikipedia dumps were        features in addition to the features of Run 1. So we combined
crawled to expand the set of words. These lists represent         ten new features with the previous twelve features for Run
potential candidates for Named Entity mentions.                   2. The ten additional features used in Run 2 were as follows:
                                                                    - Context score for Person entity
2.2      Detection of Entity Mentions                               - Context score for Location entity
  In this step, the named entity mentions in the given tweets       - Context score for Character entity
are identified using two different approaches.                      - Context score for Organization entity
                                                                    - Context score for Event entity
2.2.1      Using Stanford Named Entity Recognizer                   - Context score for Thing entity
   The Stanford Named Entity Recognizer 3 was used to ex-           - Context score for Product entity
tract the named entities. It is a CRF classifier implementing       - Frequency of Part-of-speech of mention
linear chain Conditional Random Field. We use the 3 class           - Frequency of previous Part-of-speech
model to extract the named entities belonging to classes Lo-        - Frequency of next Part-of-speech
cation, Person and Organization. While the recall was very          Context score of a particular mention is calculated for a
low, the precision of Stanford NER was quite good.                three word window of the mention. For each class, we have
                                                                  the number of occurrences of each word in a three word
2.2.2      Using ARK Twitter Part-of-Speech Tagger                window. While calculating the feature value, we assign the
   The tweets were tokenized and assigned Part of Speech          sum of the frequency of the words forming that fixed-size
tags using the ARK Twitter Part-of-Speech Tagger [1]. We          window as the context score of mention.
used the Twitter POS model with 25-tag tagset. The proper
nouns (NNP and NNS tagged as ˆ) and possessive proper
                                                                  2.3.3    Run 3
nouns (tagged Z) along with hashtags (#) and at-mentions             We wanted to apply a Feed-Forward neural network (also
() were extracted as probable candidates for Named Entity         called the back-propagation networks and multilayer per-
mentions. The mentions which were already identified using        ceptron) to our feature set and see how it performs as these
Stanford NER are not considered for classification step as        kind of Artificial Neural Networks are useful in constructing
they are already classified by the tagger itself. The rest of     a function where the complexity of the feature values makes
the mentions are classified using our classifier in the next      the decision for building such a function by hand almost im-
step.                                                             possible. We took the same features of Run 2 and employed
                                                                  a feed-forward neural network based regression model with
2.3      Classification of Entity Types                           5 hidden layers.
   In the machine learning software WEKA [2], we use the             For the previous two runs, i.e. Run1 and Run2, the tags
following features to form a feature set and used the Ran-        from Stanford NER were considered as the primary influence
dom Forest classifier to generate a pruned C4.5 Decision          over our classifier tags as its accuracy was quite good. For
Tree for 7-way classification of the named entity mentions        Run 3 however, we omit the Stanford NER influence and let
âĂŞ Thing, Event, Character, Location, Organization, Per-      only the neural network model do the tagging to check the
son and Product, while providing the identified noun entities     efficiency of our classifier.
from previous steps as input. We checked the accuracy by
using various classifiers like NaÃŕve Bayes, k-Nearest Neigh-
                                                                  2.4     Linking Mentions to DBpedia
bour and Support Vector Machine on training data with a              We used the Babelfy java API service [3] to address the
10-fold cross validation. Random Forest gave the best re-         task of entity linking to DBpedia 2015-04 resources. It is a
sults.                                                            unified, multilingual, graph-based approach to Entity Link-
                                                                  ing and Word Sense Disambiguation based on a loose iden-
2.3.1      Features for Run 1                                     tification of candidate meanings coupled with a densest sub-
  The features used for Run 1 were as follows:                    graph heuristic which selects high-coherence semantic inter-
  - Length of the mention string                                  pretations [4] . The Babelfy parameters that we tuned ac-
  - If the mention is all capitalized                             cording to our preferences are:
  - If the mention contains mixed case                               setAnnotationType was set to identify both concepts and
  - If the mention contains digits                                named entities,
  - If internal period is present in mention string                  setMatchingType was set to exact matching,
  - If present in list of Persons                                    setMultiTokenExpression was on to identify multi-word
  - If present in list of Things                                  tokens,
  - If present in list of Events                                     setScoredCandidates was set in a way so that it obtains
  - If present in list of Characters                              only top-scored candidate from the disambiguation list.
  - If present in list of Locations                                  The rest of the parameters were kept to their default
  - If present in list of Organizations                           value. The named entities identified by both Babelfy and
  - If present in list of Products                                ARK Tagger were allowed to the linking stage. Initially, we
  The above-mentioned lists are basically the bag of words        provided the original tweet texts as input to Babelfy. We
produced from the training data in the pre-processing step.       observed that the number of named entities and concepts
                                                                  recognized and linked solely by Babelfy service was quite
3
    http://nlp.stanford.edu/software/CRF-NER.shtml                low. The named entity recognition suffered because of the


                                                                                                                             75
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016
                                                                  using the existing Named Entity Recognizer systems and
      Table 1: Summary of Experimental Results                    Twitter-specific Part-of-Speech Taggers in conjunction with
                                 Precision    Recall    F1
 Run1
                                                                  the classifier developed by us. The Named Entity Linking
 Strong Mention Match            0.729        0.626     0.674     was done mainly by using Babelfy, which performs as a mul-
 Strong Typed Mention Match      0.301        0.259     0.278     tilingual encyclopedic dictionary and a semantic network.
 Strong Link Match               0.586        0.161     0.252     The performance of our system suffered because of certain
 Mention ceaf                    0.699        0.600     0.646     restrictions in time. The classification module was slightly
 Run2                                                             biased and the accuracy of classification suffered as result of
 Strong Mention Match            0.729        0.626     0.674     that. Identifying and selecting better features would have
 Strong Typed Mention Match      0.144        0.124     0.133
 Strong Link Match               0.586        0.161     0.252
                                                                  improved results. Also a disambiguation module to treat
 Mention ceaf                    0.699        0.600     0.646     overlapping classes would have been useful. The accuracy of
 Run3                                                             the linking would also improve by taking a semantic similar-
 Strong Mention Match            0.729        0.626     0.674     ity approach using synonym sets for the mentions or context
 Strong Typed Mention Match      0.411        0.353     0.380     word overlapping from the sets while NIL clustering.
 Strong Link Match               0.586        0.161     0.252
 Mention ceaf                    0.699        0.600     0.646
                                                                  5.   REFERENCES
                                                                  [1] K. Gimpel, N. Schneider, B. OĆonnor, D. Das,
noisy nature of tweet text. However, the accuracy of the              D. Mills, J. Eisenstein, M. Heilman, D. Yogatama,
linked resources was satisfactory. So, we modified our sys-           J. Flanigan, and N. A. Smith. Part-of-speech tagging
tem by altering the tweets slightly. We removed the # and             for twitter: Annotation, features, and experiments. In
and considered only the alphabets from an already recog-              ACL 2011, pages 42–47, 2011.
nized named entity (tagged by the ARK tagger). After suc-         [2] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
cessfully linking such named entities, we searched for more           P. Reutemann, and I. H. Witten. The weka data mining
entities which were syntactically similar to the previously           software: An update. SIGKDD Explorations,
known entities. We linked these new entities to correspond-           11:231–244, 2009.
ing DBpedia resources and also obtained the disambiguation        [3] A. Moro, F. Cecconi, and R. Navigli. Multilingual word
scores.                                                               sense disambiguation and entity linking for everybody.
                                                                      In 13th International Semantic Web Conference,
2.5    Clustering of NIL Mentions                                     Posters and Demonstrations (ISWC 2014), pages
   The entities which could not be linked to any existing DB-         25–28, Riva del Garda, Italy, 2014.
pedia resource are supposed to have NIL identifiers so that       [4] A. Moro, A. Raganato, and R. Navigli. Entity linking
each NIL may be reused if there are multiple mentions in              meets word sense disambiguation: a unified approach.
the text which represent the same (s/similar/identical) en-           Transactions of the Association for Computational
tity. We have considered only a spelling based approach here          Linguistics (TACL), pages 231–244, 2014.
to calculate the similarity between entities. Two unlinked        [5] D. Nadeau. Semi-Supervised Named Entity Recognition:
entities are taken to be similar if one of them contains the          Learning to Recognize 100 Entity Types with Little
other (letter only). In that case, the new entity is assigned         Supervision. PhD thesis, Ottawa-Carleton Institute for
the same NIL identifier as that of the previous one.                  Computer Science, School of Information Technology
                                                                      and Engineering, 2007.
3.    RESULTS                                                     [6] D. Preoţiuc-Pietro, D. Radovanović, A. E.
   We evaluated our approach on the development set con-              Cano-Basave, K. Weller, and A.-S. Dadzie, editors.
sisting of 100 tweets made available by the organizers. In            Proceedings, 6th Workshop on Making Sense of
Table 1 we have reported on the official metrics for entity           Microposts (#Microposts2016): Big things come in
detection, tagging, clustering and linking. The precision,            small packages, Montréal, Canada, 11th of Apr 2016,
recall and f-scores for the above-mentioned three runs show           2016.
that the Run 3 produces best results for the task with f-         [7] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named
score 0.674, 0.380, 0.252 and 0.646 in the categories Strong          entity recognition in tweets: An experimental study. In
Mention Match, Strong Typed Mention Match, Strong Link                EMNLP 2011, pages 1524–1534, 2011.
Match and Mention Ceaf respectively.                              [8] G. Rizzo, M. van Erp, J. Plu, and R. Troncy. Making
   While all the Runs yield same score in other categories,           Sense of Microposts (#Microposts2016) Named Entity
in Strong Typed Mention Match, we observe better result               rEcognition and Linking (NEEL) Challenge. In
for our feed-forward neural network model. Our systems for            Preoţiuc-Pietro et al. [6], pages 50–59.
the three different runs only differ in entity type classifica-   [9] M. A. Yosef, S. Bauer, J. Hoffart, M. Spaniol, and
tion module while all other subtasks follow the same system           G. Weikum. Hyena: Hierarchical type classification for
in all three cases. This results in same result in the last           entity names. In Proceedings of COLING 2012:
two categories which were mainly the evaluation metrics for           Posters, pages 1361–1370, Mumbai, 2012.
linking and nil clustering.

4.    CONCLUSION
  In this paper, we have described our approach for the
#Microposts2016 Named Entity rEcognition and Linking
(NEEL) challenge. We have developed a hybrid system


                                                                                                                               76
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016