<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Reverse Approach to Named Entity Extraction and * Linking in Microposts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kara Greenfield</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rajmonda Caceres</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Coury</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kelly Geyer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Youngjune Gwon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason Matterer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alyssa Mensch</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cem Sahin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olga Simek</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>MIT Lincoln Laboratory</institution>
          ,
          <addr-line>244 Wood St, Lexington MA</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>[11] P. Liang, "Semi-supervised Learning for Natural Language," Massachusetts Institute of Technology</institution>
          ,
          <addr-line>2005</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>[13] O. Owoputi, B. O'Connor, C. Dyer, K. Gimpel and N. Schnelder, "Part-of-speech tagging for Twitter: Word clusters and other advances," in School of Computer Science</institution>
          ,
          <addr-line>2012</addr-line>
        </aff>
      </contrib-group>
      <fpage>67</fpage>
      <lpage>69</lpage>
      <abstract>
        <p>In this paper, we present a pipeline for named entity extraction and linking that is designed specifically for noisy, grammatically inconsistent domains where traditional named entity techniques perform poorly. Our approach leverages a large knowledge base to improve entity recognition, while maintaining the use of traditional NER to identify mentions that are not co-referent with any entities in the knowledge base.</p>
      </abstract>
      <kwd-group>
        <kwd>Named entity recognition</kwd>
        <kwd>entity linking</kwd>
        <kwd>twitter</kwd>
        <kwd>DBpedia</kwd>
        <kwd>social media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1.   INTRODUCTION</title>
      <p>
        This paper describes the MIT Lincoln Laboratory submission to
the Named Entity Extraction and Linking (NEEL) challenge at
#Microposts2016 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While named entity recognition is a
wellstudied problem in traditional natural language processing
domains such as newswire, maintaining high precision and recall
when adapting it to micropost genres continues to prove difficult
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In traditional named entity extraction and linking systems,
named entity recognition is done before entity linking and
clustering. Any misses in the named entity recognition aren’t
recoverable by later steps in the pipeline.
      </p>
      <p>
        In this system, we build upon the work developed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
leveraging the existence of a knowledge base which contains
entities corresponding to many of the named mentions we wish to
extract thus allowing us to reduce our reliance on named entity
recognition. Our end-to-end system has parallel pipelines for
those entity mentions that are linkable to the database and those
which are not linkable.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2.   SYSTEM ARCHITECTURE</title>
      <p>Our overall system architecture is shown in Figure 1. For entities
which are in the knowledge base (DBpedia), we began by
hand*This work was sponsored by the Defense Advanced Research Projects Agency
under Air Force Contract FA8721-05-C-0002. Opinions, interpretations,
conclusions, and recommendations are those of the authors and are not
necessarily endorsed by the United States Government.</p>
      <p>Copyright c 2016 held by author(s)/owner(s); copying permitted
only for private and academic purposes.</p>
      <p>Published as part of the #Microposts2016 Workshop proceedings,
available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691)</p>
      <p>We fused several named entity recognition systems in order to
extract named entity mentions that do not have corresponding
entities in DBpedia. We filtered out any named mentions that
were previously identified as linked named entity mentions,
leaving a set of typed NIL named entity mentions. We then
applied clustering to the NIL mentions.</p>
    </sec>
    <sec id="sec-3">
      <title>3.   SYSTEM COMPONENTS</title>
    </sec>
    <sec id="sec-4">
      <title>3.1   Ontology Mapping</title>
      <p>Our goal for the ontology mapping was to have as high of a recall
for each of the entity types as possible, simultaneously optimizing</p>
    </sec>
    <sec id="sec-5">
      <title>3.2   Candidate Name Generation</title>
      <p>In writing microposts, authors are constrained in the number of
characters that they can write. This has led to the development of
authors shortening their words (often as much as possible) while
maintaining understandability by a human reader. Spelling
mistakes and the existence of multiple standard spellings of
named entities are two means by which variation in mention
spelling can occur, but in the micropost genre, deliberate
shortened alternate spellings are a much more common form of
spelling variation. In order to address this, we examined the
mentions in all of the named entity classes of interest and
attempted to identify rules by which authors shorten entity names.
We then applied these rules to all of the entities in our mapped
ontology in order to generate candidate name spellings.
Authors use different rules when shortening a name depending on
the context: using the name as part of plain text versus using the
name as part of a hash-tag or at-mention. The main difference is
that entity mentions which are hash-tags or at-mentions often
contain the characters from descriptive words in addition to
characters from the canonical form of the entity name as the text
of the at-mention or hash-tag. We found that authors follow
different rules depending on what type of entity the mention is.
For example, abbreviating the canonical form of a Person entity is
very common, but abbreviating a Thing entity is very rare. On the
other hand, the canonical forms of Location entities are often
partially abbreviated (i.e. abbreviating only the words which occur
after a comma in the canonical spelling). Our candidate name
generation computes various abbreviations and shortenings of the
canonical name.</p>
      <sec id="sec-5-1">
        <title>Event</title>
        <p>0
Finally, events are often written very differently from their
canonical spellings, rendering candidate name generation a poor
choice for this entity type. In future work, we would like to train
an event nugget detector on the micropost genre in order to extract
the Event entities. Our system was unable to correctly generate
candidate names for any of the Thing mentions that were included
in our ontology mapping, although the candidate generation did
work for many of the Thing mentions that were not included in
the ontology.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>3.3   Linkable Mention Detection</title>
      <p>We searched all of the tweets for all of our generated candidate
mentions. Search results were limited to mentions which were
either bound on both ends by white space, punctuation, or the
beginning / end of the tweet or which were part of an at-mention
or hash-tag. For results that were part of an at-mention or
hashtag, we expanded the returned result to encompass the entire
atmention or hash-tag.</p>
    </sec>
    <sec id="sec-7">
      <title>3.4   Entity Linking</title>
      <p>
        We experimented with two methods of entity linking. The first
method was a random forest trained on several features of each
(mention, entity) pair. The features used were: COMMONNESS,
IDF$%&amp;'(), TEN, TCN, TF+,-.,-/,, TF012132104, and REDIRECT
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The random forest classifier attempts to detect whether or not
a given mention corresponds to a given entity. We then perform
consistency resolution in order to assure that each mentions
resolves to at most a single entity. Results can be seen in Table 5.
We also experimented with leveraging AIDA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for entity
linking. This method was able to correctly recall 25% of the
Location mentions and 26% of the Person mentions, but did not
perform well on the other entity types. We hypothesize that this is
due to a combination of cascaded performance degradation from
earlier steps in the pipeline and the fact that the current version of
AIDA is based off of an older version of DBpedia, which doesn’t
contain more recent entities.
      </p>
    </sec>
    <sec id="sec-8">
      <title>3.5   Named Entity Recognition</title>
      <p>
        We experimented with several different named entity recognition
systems: Stanford NER [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], MITIE [7], twitter_nlp [8], and
TwitIE [9]. For MITIE, we used both the off-the-shelf model and
a model that was custom trained on the NEEL training data (for
all of the NEEL entity types); the custom training improved F1
scores on all entity types. Ultimately we fused the results from all
of the systems by applying a majority vote. The results presented
in Table 3 are in the format: precision; recall; F1.
Even with considering multiple state of the art named entity
recognition systems and in-domain training, performance on the
for precision only so much as to avoid computational bottlenecks
in later steps in the pipeline. We experienced high variance
between entity types in the degree of difficulty of manually
creating the ontology mapping. As seen in Table 1, this resulted in
vastly different levels of recall for the different entity types. Our
mapping contained 100% of the linked Person entities in the dev
set, but only 11% of the Fictional Character entities. In future
work, we would like to explore either automating or
crowdsourcing a more comprehensive ontology mapping.
#Microposts2016
micropost genre is low. In future work, we would like to
experiment with more advanced methods of system fusion and
bootstrapping in order to gain a much larger in-domain training
corpus.
      </p>
    </sec>
    <sec id="sec-9">
      <title>3.6   Entity Clustering</title>
      <p>We use the normalized Damerau–Levenshtein (DL) distance
metric [10] to find the similarity between two unlinked entities.
This metric helps us create clusters that are spelling-error tolerant,
while at the same time capturing slight local words variations
often observed in microposts.</p>
      <p>As an alternative method, we used the Brown clusters produced
by Percy Liang's implementation [11] of the Brown clustering
algorithm [12] on 56,345,753 English tweets, as described in [13].
Mentions that belonged to the same Brown cluster were clustered
together.</p>
      <p>Table 4 gives the results on our NIL entity clustering task. We
report performance scores with gold standard named entity
mentions. Since the NIL entity clustering step is the last step in
our system, we expect propagated errors from the other tasks to
have the biggest impact here. Of note is that the small number of
mentions in the evaluation dev set means that these numbers may
not be representative of algorithm performance on a larger corpus.
In future work, we would like to experiment with word
embedding based methods for clustering. We performed some
early exploration into this line of research, but more work is
needed into how to map between different word embeddings.</p>
      <p>Gold Standard NER mentions
(NIL and non-NIL)</p>
      <sec id="sec-9-1">
        <title>Damerau-Levenshtein</title>
      </sec>
      <sec id="sec-9-2">
        <title>Brown .587 .531</title>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>4.   Experimental Results</title>
      <p>Our top performing systems on the dev data used a random forest
for entity linking and either Brown clustering or
DamerauLevenshtein clustering for clustering the NIL mentions. While
Brown Clustering and Damerau-Levenshtein clustering returned
slightly different clusters when run on the dev set, the
mention_ceaf was the same for both methods. Results are shown
below.</p>
    </sec>
    <sec id="sec-11">
      <title>5.   CONCLUSIONS</title>
      <p>
        In this paper, we described the MIT Lincoln Laboratory
submission to the NEEL 2016 challenge. In this work, we have
expanded upon the linking first approach to named entity
extraction and linking first developed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. We introduced
methods of candidate name generation which are specifically
tailored to microposts. We also experimented with multiple
approaches to named entity recognition, entity linking, and entity
clustering and presented comparisons of the performance of the
different methods.
      </p>
    </sec>
    <sec id="sec-12">
      <title>6.   ACKKNOWLEDGEMENTS</title>
      <p>We would like to thank Bernadette Johnson and Joseph Campbell
for their ongoing support and guidance. We would also like to
thank Michael Yee and Arjun Majumdar for their support with
MITIE.</p>
    </sec>
    <sec id="sec-13">
      <title>7.   REFERENCES</title>
      <p>[7] D. King, "MITLL/MITIE,"
https://github.com/mit-nlp/MITIE.
[Online].</p>
      <sec id="sec-13-1">
        <title>Available:</title>
        <p>[8] A. Ritter, S. Clark, Mausam and O. Etzioni, "Named Entity
Recognition in Tweets: An Experimental Study," in EMNLP,
2011.
[12] P. F. Brown, "Class-based n-gram models of natural
language," Computational linguistics, vol. 18, no. 4, pp.
467479, 1992.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Erp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Plu</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          .
          <article-title>Making Sense of Microposts (#Microposts2016) Named Entity rEcognition and Linking (NEEL) Challenge</article-title>
          . in #Microposts2016, pp.
          <fpage>50</fpage>
          -
          <lpage>59</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Clark</surname>
          </string-name>
          and
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <article-title>"Named Entity Recognition in Tweets: An Experimental Study,"</article-title>
          <source>in EMNLP '11</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>I.</given-names>
            <surname>Yamada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Takeda</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Takefuji</surname>
          </string-name>
          ,
          <article-title>"An End-to-End Entity Linking Approach for Tweets,"</article-title>
          <source>in #Microposts2015</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E.</given-names>
            <surname>Meij</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Weerkamp and M. de Rijke</surname>
          </string-name>
          ,
          <article-title>"Adding semantics to microblog posts,"</article-title>
          <source>in Proceedings of the fifth ACM international conference on Web search and data mining</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hoffart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Yosef</surname>
          </string-name>
          , I. Bordino,
          <string-name>
            <given-names>H.</given-names>
            <surname>Furstenau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pinkal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spaniol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Taneva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thater</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          ,
          <article-title>"Robust Disambiguation of Named Entities in Text,"</article-title>
          <source>in Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Finkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grenager</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          ,
          <article-title>"Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling," in Proceedings of the 43rd Annual Meeting of the aAssociation for Computational Linguistics (ACL</article-title>
          <year>2005</year>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>