=Paper=
{{Paper
|id=Vol-1691/paper_11
|storemode=property
|title=A Reverse Approach to Named Entity Extraction and Linking in Microposts
|pdfUrl=https://ceur-ws.org/Vol-1691/paper_11.pdf
|volume=Vol-1691
|authors=Kara Greenfield,Rajmonda Caceres,Michael Coury,Kelly Geyer,Youngjune Gwon,Jason Matterer,Alyssa Mensch,Cem Sahin,Olga Simek
|dblpUrl=https://dblp.org/rec/conf/msm/GreenfieldCCGGM16
}}
==A Reverse Approach to Named Entity Extraction and Linking in Microposts==
A Reverse Approach to Named Entity Extraction and
Linking in Microposts*
Kara Greenfield, Rajmonda Caceres, Michael Coury, Kelly Geyer, Youngjune Gwon,
Jason Matterer, Alyssa Mensch, Cem Sahin, Olga Simek
MIT Lincoln Laboratory, 244 Wood St, Lexington MA, United States
{kara.greenfield, rajmonda.caceres, michael.coury, kelly.geyer, gyj, jason.matterer,
alyssa.mensch, cem.sahin, osimek}@ll.mit.edu
ABSTRACT curating an ontology mapping from the DBpedia class ontology to
In this paper, we present a pipeline for named entity extraction the named entity ontology that is being used in the NEEL
and linking that is designed specifically for noisy, grammatically evaluation (Person, Organization, Location, Fictional Character,
inconsistent domains where traditional named entity techniques Thing, Product, Event).
perform poorly. Our approach leverages a large knowledge base For each DBpedia entry that mapped to one of the named entity
to improve entity recognition, while maintaining the use of classes of interest, we generated a set of candidate names for that
traditional NER to identify mentions that are not co-referent with entity which correspond to ways in which an author might
any entities in the knowledge base. reference that entity when writing a micropost. We then searched
the tweets for those candidate names. Finally, we down-selected
Keywords from the found instances of candidate names, resolving overlaps
Named entity recognition; entity linking; twitter; DBpedia, social and false alarms in the candidate name generation.
media
We fused several named entity recognition systems in order to
1. INTRODUCTION extract named entity mentions that do not have corresponding
This paper describes the MIT Lincoln Laboratory submission to entities in DBpedia. We filtered out any named mentions that
the Named Entity Extraction and Linking (NEEL) challenge at were previously identified as linked named entity mentions,
#Microposts2016 [1]. While named entity recognition is a well- leaving a set of typed NIL named entity mentions. We then
studied problem in traditional natural language processing applied clustering to the NIL mentions.
domains such as newswire, maintaining high precision and recall
when adapting it to micropost genres continues to prove difficult
[2]. In traditional named entity extraction and linking systems,
named entity recognition is done before entity linking and
clustering. Any misses in the named entity recognition aren’t
recoverable by later steps in the pipeline.
In this system, we build upon the work developed in [3],
leveraging the existence of a knowledge base which contains
entities corresponding to many of the named mentions we wish to
extract thus allowing us to reduce our reliance on named entity
recognition. Our end-to-end system has parallel pipelines for
those entity mentions that are linkable to the database and those
which are not linkable.
2. SYSTEM ARCHITECTURE
Our overall system architecture is shown in Figure 1. For entities
which are in the knowledge base (DBpedia), we began by hand-
*This work was sponsored by the Defense Advanced Research Projects Agency
under Air Force Contract FA8721-05-C-0002. Opinions, interpretations,
conclusions, and recommendations are those of the authors and are not
necessarily endorsed by the United States Government.
Copyright c 2016 held by author(s)/owner(s); copying permitted
Figure 1 System Architecture
only for private and academic purposes.
Published as part of the #Microposts2016 Workshop proceedings, 3. SYSTEM COMPONENTS
available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691)
3.1 Ontology Mapping
#Microposts2016, Apr 11th, 2016, Montréal, Canada. Our goal for the ontology mapping was to have as high of a recall
for each of the entity types as possible, simultaneously optimizing
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016
for precision only so much as to avoid computational bottlenecks Fictional Character .1538
in later steps in the pipeline. We experienced high variance
Event 0
between entity types in the degree of difficulty of manually
creating the ontology mapping. As seen in Table 1, this resulted in Finally, events are often written very differently from their
vastly different levels of recall for the different entity types. Our canonical spellings, rendering candidate name generation a poor
mapping contained 100% of the linked Person entities in the dev choice for this entity type. In future work, we would like to train
set, but only 11% of the Fictional Character entities. In future an event nugget detector on the micropost genre in order to extract
work, we would like to explore either automating or the Event entities. Our system was unable to correctly generate
crowdsourcing a more comprehensive ontology mapping. candidate names for any of the Thing mentions that were included
Table 1 Ontology Mapping Recall in our ontology mapping, although the candidate generation did
work for many of the Thing mentions that were not included in
Entity Type Recall the ontology.
Person 1 3.3 Linkable Mention Detection
Organization .6364 We searched all of the tweets for all of our generated candidate
mentions. Search results were limited to mentions which were
Location .8667 either bound on both ends by white space, punctuation, or the
Product .8889 beginning / end of the tweet or which were part of an at-mention
or hash-tag. For results that were part of an at-mention or hash-
Thing .5 tag, we expanded the returned result to encompass the entire at-
Fictional Character .1111 mention or hash-tag.
Event .5 3.4 Entity Linking
We experimented with two methods of entity linking. The first
3.2 Candidate Name Generation method was a random forest trained on several features of each
In writing microposts, authors are constrained in the number of (mention, entity) pair. The features used were: COMMONNESS,
characters that they can write. This has led to the development of IDF$%&'() , TEN, TCN, TF+,-.,-/, , TF012132104 , and REDIRECT
authors shortening their words (often as much as possible) while [4]. The random forest classifier attempts to detect whether or not
maintaining understandability by a human reader. Spelling a given mention corresponds to a given entity. We then perform
mistakes and the existence of multiple standard spellings of consistency resolution in order to assure that each mentions
named entities are two means by which variation in mention resolves to at most a single entity. Results can be seen in Table 5.
spelling can occur, but in the micropost genre, deliberate
shortened alternate spellings are a much more common form of We also experimented with leveraging AIDA [5] for entity
spelling variation. In order to address this, we examined the linking. This method was able to correctly recall 25% of the
Location mentions and 26% of the Person mentions, but did not
mentions in all of the named entity classes of interest and
perform well on the other entity types. We hypothesize that this is
attempted to identify rules by which authors shorten entity names.
We then applied these rules to all of the entities in our mapped due to a combination of cascaded performance degradation from
ontology in order to generate candidate name spellings. earlier steps in the pipeline and the fact that the current version of
AIDA is based off of an older version of DBpedia, which doesn’t
Authors use different rules when shortening a name depending on contain more recent entities.
the context: using the name as part of plain text versus using the
name as part of a hash-tag or at-mention. The main difference is 3.5 Named Entity Recognition
that entity mentions which are hash-tags or at-mentions often We experimented with several different named entity recognition
contain the characters from descriptive words in addition to systems: Stanford NER [6], MITIE [7], twitter_nlp [8], and
characters from the canonical form of the entity name as the text TwitIE [9]. For MITIE, we used both the off-the-shelf model and
of the at-mention or hash-tag. We found that authors follow a model that was custom trained on the NEEL training data (for
different rules depending on what type of entity the mention is. all of the NEEL entity types); the custom training improved F1
For example, abbreviating the canonical form of a Person entity is scores on all entity types. Ultimately we fused the results from all
very common, but abbreviating a Thing entity is very rare. On the of the systems by applying a majority vote. The results presented
other hand, the canonical forms of Location entities are often in Table 3 are in the format: precision; recall; F1.
partially abbreviated (i.e. abbreviating only the words which occur Table 3 Named Entity Recognition Precision, Recall, and F1
after a comma in the canonical spelling). Our candidate name
generation computes various abbreviations and shortenings of the NER System Person Location Organization
canonical name.
Stanford .84; .27; .41 .81; .76; .78 .57; .12; .2
Table 2 Candidate Name Generation Recall
MITIE .48; .1; .17 .33; .18; .24 .1; .06; .06
Entity Type Recall
MITIE (trained .78; .5; .61 .29; .24; .26 .33; .15; .21
Person .8961 on NEEL data)
Organization .32 Twitter_NLP .56; .08; .14 .5; .18; .26 .5; .06; .11
Location .5625 TwitIE .41; .06; .11 .5; .29; .37 .62; .15; .25
Product .4273 Fused System .72; .67; .69 .44; .65; .52 .19; .18; .19
Thing 0 Even with considering multiple state of the art named entity
recognition systems and in-domain training, performance on the
68
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016
micropost genre is low. In future work, we would like to 6. ACKKNOWLEDGEMENTS
experiment with more advanced methods of system fusion and We would like to thank Bernadette Johnson and Joseph Campbell
bootstrapping in order to gain a much larger in-domain training for their ongoing support and guidance. We would also like to
corpus. thank Michael Yee and Arjun Majumdar for their support with
3.6 Entity Clustering MITIE.
We use the normalized Damerau–Levenshtein (DL) distance 7. REFERENCES
metric [10] to find the similarity between two unlinked entities.
[1] G. Rizzo, M. van Erp, J. Plu, and R. Troncy. Making Sense
This metric helps us create clusters that are spelling-error tolerant,
of Microposts (#Microposts2016) Named Entity rEcognition
while at the same time capturing slight local words variations
often observed in microposts. and Linking (NEEL) Challenge. in #Microposts2016, pp. 50–
59, 2016.
As an alternative method, we used the Brown clusters produced
by Percy Liang's implementation [11] of the Brown clustering
algorithm [12] on 56,345,753 English tweets, as described in [13]. [2] A. Ritter, S. Clark and O. Etzioni, "Named Entity
Mentions that belonged to the same Brown cluster were clustered Recognition in Tweets: An Experimental Study," in EMNLP
together. '11, 2011.
Table 4 gives the results on our NIL entity clustering task. We [3] I. Yamada, H. Takeda and Y. Takefuji, "An End-to-End
report performance scores with gold standard named entity Entity Linking Approach for Tweets," in #Microposts2015,
mentions. Since the NIL entity clustering step is the last step in 2015.
our system, we expect propagated errors from the other tasks to
[4] E. Meij, W. Weerkamp and M. de Rijke, "Adding semantics
have the biggest impact here. Of note is that the small number of
mentions in the evaluation dev set means that these numbers may to microblog posts," in Proceedings of the fifth ACM
not be representative of algorithm performance on a larger corpus. international conference on Web search and data mining,
2012.
In future work, we would like to experiment with word
embedding based methods for clustering. We performed some [5] J. Hoffart, M. A. Yosef, I. Bordino, H. Furstenau, M. Pinkal,
early exploration into this line of research, but more work is M. Spaniol, B. Taneva, S. Thater and G. Weikum, "Robust
needed into how to map between different word embeddings. Disambiguation of Named Entities in Text," in Conference
on Empirical Methods in Natural Language Processing
Table 4 Mention_CEAF of Clustering Algorithms
(EMNLP), 2011.
Clustering Method Gold Standard NER mentions
(NIL and non-NIL) [6] J. R. Finkel, T. Grenager and C. Manning, "Incorporating
Non-local Information into Information Extraction Systems
Damerau-Levenshtein .587 by Gibbs Sampling," in Proceedings of the 43rd Annual
Meeting of the aAssociation for Computational Linguistics
Brown .531
(ACL 2005), 2005.
4. Experimental Results [7] D. King, "MITLL/MITIE," [Online]. Available:
Our top performing systems on the dev data used a random forest https://github.com/mit-nlp/MITIE.
for entity linking and either Brown clustering or Damerau-
Levenshtein clustering for clustering the NIL mentions. While [8] A. Ritter, S. Clark, Mausam and O. Etzioni, "Named Entity
Brown Clustering and Damerau-Levenshtein clustering returned Recognition in Tweets: An Experimental Study," in EMNLP,
slightly different clusters when run on the dev set, the 2011.
mention_ceaf was the same for both methods. Results are shown
[9] K. Bontcheva, L. Derczynski, A. Funk, M. A. Greenwood,
below.
D. Maynard and N. Aswani, "TwitIE: An Open-Source
Table 5 Overall System Results Information Extraction Pipeline for Microblog Text," in
Metric Precision Recall F1 Proceedings of the International Conference on Recent
Advasnces in Natural Language Processing, ACL, 2013.
strong typed .587 .287 .386
mention match [10] G. V. Bard, Spelling-error Tolerant, Order-independent Pass-
phrases via the Damerau-Levenshtein String-edit Distance
strong link match .799 .418 .549 Metric, vol. 68, Ballarat: Proceedings of the 5th Australian
mention ceaf .375 .766 .504 Symposiu on ACSW Frontiers, 2007, pp. 117--124.
[11] P. Liang, "Semi-supervised Learning for Natural Language,"
5. CONCLUSIONS Massachusetts Institute of Technology, 2005.
In this paper, we described the MIT Lincoln Laboratory
submission to the NEEL 2016 challenge. In this work, we have [12] P. F. Brown, "Class-based n-gram models of natural
expanded upon the linking first approach to named entity language," Computational linguistics, vol. 18, no. 4, pp. 467-
extraction and linking first developed in [3]. We introduced 479, 1992.
methods of candidate name generation which are specifically
tailored to microposts. We also experimented with multiple [13] O. Owoputi, B. O'Connor, C. Dyer, K. Gimpel and N.
approaches to named entity recognition, entity linking, and entity Schnelder, "Part-of-speech tagging for Twitter: Word
clustering and presented comparisons of the performance of the clusters and other advances," in School of Computer Science,
different methods. 2012.
69
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016