Named Entity Recognition in 140 Characters or Less*
                            Kelly Geyer, Kara Greenfield, Alyssa Mensch, Olga Simek

                     MIT Lincoln Laboratory, 244 Wood St, Lexington MA, United States
                      {kelly.geyer, kara.greenfield, alyssa.mensch, osimek}@ll.mit.edu

ABSTRACT
In this paper, we explore the problem of recognizing named               2.   MITIE
entities in microposts, a genre with notoriously little context          The MIT Information Extraction Toolkit (MITIE) [3] is a free,
surrounding each named entity and inconsistent use of grammar,           open-source software library of state-of-the-art NLP tools
punctuation, capitalization, and spelling conventions by authors.        developed at MIT Lincoln Laboratory. MITIE enables the
In spite of the challenges associated with information extraction        automated extraction of named entities and of binary relations (for
from microposts, it remains an increasingly important genre. This        example, a person’s place of birth) from unstructured text in
paper presents the MIT Information Extraction Toolkit (MITIE)            English and Spanish. MITIE utilizes distributional word
and explores its adaptability to the micropost genre.                    embeddings [4] to reduce dimensionality and improve
                                                                         performance, Conditional Random Fields and structured support
CCS Concepts                                                             vector machines for learning syntactic relationships [5], and
•Information systems ➝	
 Information extraction
                                                                         automated hyperparameter optimization to facilitate user
Keywords                                                                 customization. MITIE is built on the high-performance Dlib
                                                                         machine learning library [6] [7], includes interfaces to C, C++,
Named entity recognition, re-training, social media, Twitter
                                                                         Java, R, and Python, and is easy to train on new data sets [8], such
1.   INTRODUCTION                                                        as microposts.
Named entity recognition (NER) is a subtask of information               One of the goals in developing MITIE was to enable fast named
retrieval concerned with the automatic extraction of named               entity recognition. To this end, MITIE is capable of processing
mentions of entities, where the set of possible entity types             53,600 words per second when run single-threaded on a 2.4GHz
originally consisted of people, organizations, and locations. Even       Intel Xeon processor. Even with this speed, accuracy was not
the original work on NER from MUC-6 recognized the need for              compromised and MITIE is able to achieve an F1 score of 88.1 on
systems to be able to extract varying sets of named entity types         the CoNLL 203 NER task [3] [9]
from varying genres [1]. Since then, NER has been used to extract
diverse entity types (such as diseases and products) from diverse        3.   RETRAINING MITIE FOR
genres (such as speech transcripts and microposts). Practitioners        MICROPOSTS
of NER in such diverse domains have been forced to accept that           We utilized the training data from the NEEL 2016 Challenge Data
the systems must be re-trained on in-domain data in order to             Set [10] for our experiments. This corpus consists of 5991 tweets
obtain optimal results. The ability to retrain systems has enabled       which have been annotated for named entity mentions of types:
the success of NER, but knowing the quantity on in-domain                person, organization, location, event, product, character, and
training data that is required is often more of an art than a science.   thing.
In this paper, we examine the requirements for successfully
retraining an NER system to extract an expanded set of entities          Our experiments consisted of varying the number of training
from the micropost genre, a notoriously hard genre for NER [2].          documents and testing on the remainder of the documents,
                                                                         utilizing 5-fold cross validation. No out of domain training data
 *This work was sponsored by the Defense Advanced Research               was used to supplement the in-domain data. Each document
 Projects Agency under Air Force Contract FA8721-05-C-0002.              corresponded to a single tweet. The documents were not
 Opinions, interpretations, conclusions, and recommendations are those   guaranteed to contain any mentions of named entities. For each
 of the authors and are not necessarily endorsed by the United States    experiment, we trained a single MITIE model to simultaneously
 Government.
                                                                         classify all seven of the entity types under consideration.
                                                                         4.   Results
                                                                         Across all of the entity types other than character and thing,
                                                                         training with in-domain data began to show diminishing (but still
                                                                         positive) returns with 500 training documents. This was seen in
                                                                         measuring F1 and precision and recall independently, as shown in
                                                                         Figures 1, 2, and 3. Also of note was that increasing the number of
                                                                         in-domain training documents benefited performance in precision
Copyright c 2016 held by author(s)/owner(s); copying permitted           significantly more than recall for all entity types.
only for private and academic purposes.
Published as part of the #Microposts2016 Workshop proceedings,
available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691)

#Microposts2016, Apr 11th, 2016, Montréal, Canada.


· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016
                                                                      correspondingly begins showing large           performance    gains
                                                                      beginning with 3000 training documents.
                                                                      Also of note was the consistently poor performance in recognizing
                                                                      mentions of thing entities. We hypothesize that this is due to thing
                                                                      being a poorly defined entity type, but have not yet tested that
                                                                      hypothesis.

                                                                      5.   CONCLUSIONS
                                                                      In this paper, we presented exploratory analysis comparing the
                                                                      number of in-domain training documents used with named entity
                                                                      recognition performance in the micropost genre. In this analysis,
                       Figure 1: F1 Score                             we also compared performance on recognizing different entity
                                                                      types. Additionally, we presented the MIT Information Extraction
                                                                      Toolkit, an open-source structural SVM approach to named entity
                                                                      recognition and binary relation extraction.
                                                                      6.   FUTURE WORK
                                                                      In future work we would like to explore other dimensions of
                                                                      retraining NER systems. Some particular questions of interest are
                                                                      examining whether the patterns seen in the number of in-domain
                                                                      training micropost documents required are mirrored in other
                                                                      domains and identifying a causal factor that explains the varying
                                                                      performance in recognizing entities of different types.

                       Figure 2: Precision                            7.   ACKNOWLEDGMENTS
                                                                      We would like to thank Davis King, Arjun Majumdar, and
                                                                      Michael Yee for their work on developing MITIE.

                                                                      8.   REFERENCES
                                                                      [1] R. Grishman and B. Sundheim, "Message Understanding
                                                                          Conference-6: A Brief History," COLING, vol. 96, pp. 466-
                                                                          471, 1996.
                                                                      [2] A. Ritter, S. Clark and O. Etzioni, "Named Entity
                                                                          Recognition in Tweets: An Experimental Study," in EMNLP
                                                                          '11, 2011.
                                                                      [3] D.     King,     "MITLL/MITIE,"        [Online].     Available:
                        Figure 3: Recall
                                                                          https://github.com/mit-nlp/MITIE.
4.1   Comparison Between Entity Types                                 [4] P. Dhillon, D. Foster and L. Ungar, "Eigenwords: Spectral
We considered the hypothesis that the difference in performance
                                                                          Word Embeddings," Journal of Machine Learning Research
in correctly recognizing different types of entity mentions was due
                                                                          (JMLR), vol. 16, 2015.
to the number of times that that entity type appeared in the
training data. This hypothesis however, proved to be false. Of        [5] T. Joachims, T. Finley and C.-N. Yu, "Cutting-Plane
particular interest is the performance in recognizing event               Training of Structural SVMs," Machine Learning, vol. 77,
mentions. Despite the fact that this was a particularly rare entity       no. 1, pp. 27-59, 2009.
in this corpus, MITIE excelled at recognizing event mentions,
particularly with regard to precision.                                [6] D.      King,    "davisking/dlib,"     [Online].     Available:
                                                                          https://github.com/davisking/dlib.
                                                                      [7] D. E. King, "Dlib-ml: A Machine Learning Toolkit," Journal
                                                                          of Machine Learning Research, vol. 10, pp. 1755-1758,
                                                                          2009.
                                                                      [8] S. Haleen and A. Halterman, "mitie-trainer," [Online].
                                                                          Available: https://github.com/Sotera/mitie-trainer.
                                                                      [9] T. K. Sang, E. F. and F. De Meulder, "Introduction to the
                                                                          CoNLL-2003 shared task: Language-independent named
                                                                          entity recognition," in The seventh conference on Natural
                                                                          language learning at HLT-NAACL, 2003.
               Figure 4: Entity Type Distribution
                                                                      [10] R. e. al., "NEEL Challenge Data Set," 2016.
While not a sufficient condition, there is a threshold quantity of
mentions of a given entity type which is necessary for NER
accuracy to be significantly above chance performance. As seen in
Figure 4, the character entity type is extremely rare and


                                                                                                                                       79
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016