<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Microposts</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>E2E: An End-to-End Entity Linking System for Short and Noisy Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ming-Wei Chang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>minchang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>paulhsu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>haoma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>riloynd</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>kuansanw}@microsoft.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Extraction, Social Media</institution>
          ,
          <addr-line>Entity Linking</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>4</volume>
      <fpage>4</fpage>
      <lpage>5</lpage>
      <abstract>
        <p>We present E2E, an end-to-end entity linking system that is designed for short and noisy text found in microblogs and text messages. Mining and extracting entities from short text is an essential step for many content analysis applications. By jointly optimizing entity recognition and disambiguation as a single task, our system can process short and noisy text robustly.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        In this paper, we describe our entity linking system called
E2E for the #Microposts2014 NEEL Challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our
system focuses on the task of extracting and linking
entities from short and noisy text given entity databases like
Wikipedia or Freebase. An entity linking system usually
needs to perform two key functions: mention recognition
and entity disambiguation. In mention recognition the
system identifies each mention (surface form) of an entity in
the text. In entity disambiguation, the system maps
mentions to canonical entities. E2E has been carefully designed
to treats entity recognition and disambiguation as a single
task.
      </p>
    </sec>
    <sec id="sec-2">
      <title>THE ARCHITECTURE OF E2E</title>
      <sec id="sec-2-1">
        <title>Text Normalization.</title>
        <p>In this stage, a short message is normalized and tokenized.
For tweets, the retweet symbols and some other special
symbols are removed. Punctuation symbols are represented as
separate tokens in general.</p>
        <p>PCeorpmyirsisgihont tco m20ak1e4 dhieglidtalbyorahuatrhdocro(sp)i/eoswonfearl(ls)o;r cpoaprtyionfgthpiserwmoitrktefdor
poenrlsyonfoarl oprricvlaatsesraonodmaucsaediesmgircanptuedrpwositehso.ut fee provided that copies are
nPoutbmliashdeedoradsipstarirbtuotefdthfoer#prMofiticorrocpoomstms2e0rc1i4al Wadovraknsthaogpe apnrdoctheeadticnogpsi,es
baveaariltahbislenotice anadstCheEfUulRlcVitoatli-o1n14o1n (thhet tfirpst:/p/acgeeu.rT-owsco.oprygo/tVhoelr-w1i1se4,1)to
online
republish, to post on servers or to redistribute to lists, requires prior specific
p#eMrmicisrsoiponosatnsd2/0o1r4a, fAeep.ril 7th, 2014, Seoul, Korea.</p>
        <p>Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Candidate Generation.</title>
        <p>The next step is to generate a list of surface form
candidates that could potentially link to entities. E2E uses
a lexicon to generate the candidate surface forms. A
lexicon is a dictionary that maps a surface form to its
possible entity set. For example, the word “giants” could
refer to “New York Giants”, or “San Francisco Giants”, etc.
Our lexicon is mainly composed by extracting information
from Wikipedia and Freebase. The dictionary is constructed
to support fuzzy mention matching based on edit distance.
Note that we over-generate candidates at this stage, and no
filtering is performed.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Joint Recognition and Disambiguation.</title>
        <p>This stage is the key component of the E2E framework.
Given a message, the goal here is to figure out the entity
assignment of each candidate mention generated from
previous stages. Note that a candidate mention may be rejected
altogether (mapped to the null entity).</p>
        <p>Our model is based on a supervised learning method.
Given a message m and a candidate mention a, the entity
assignment is generated from the ranking of all possible
entities in the entity set E(a).</p>
        <p>arg max f (Φ(m, a, e)),
e∈{E(a)∩∅}
(1)
where f is the function of the model, and Φ is a feature
function over the input m, the mention a and the candidate
output e. Note that it is very likely E2E rejects a candidate
and does not link it to an entity (link a to ∅). The joint
approach that recognizes and disambiguates entity mentions
together is crucial for E2E to properly link surface forms to
the corresponding entities.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Overlap Resolution.</title>
        <p>At this point, many of the linked mentions will overlap
each other. Dynamic programming resolves these conflicts
by choosing the best-scoring set of non-overlapping
mentionentity mappings. The experimental results show that
resolving overlap improve the models performance consistently in
different settings.
3.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>SYSTEM IMPLEMENTATION</title>
      <p>Our database is constructed from both Wikipedia and
Freebase. The whole system is implemented in C#.</p>
      <p>
        Entity linking systems often require a large amount of
memory due to the size of the structured/unstructured data
for many entities. High memory consumption restricts the
scale of an entity linking system, limiting the number of
allowed entities that can be handled. Long loading times also
reduce the efficiency of conducting experiments. In E2E, we
adopt the completion trie data structure proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
instead of a hash map dictionary. The completion trie greatly
reduces the memory footprint and loading time of E2E.
      </p>
      <p>
        We have tested two learning methods when developing
E2E: a structured support vector machine algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
a fast implementation of the MART gradient boosting
algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The structural SVM model is a linear model that
takes into account all of the candidates together in the same
tweet. MART learns an ensemble of decision/regression
trees with scalar values at the leaves, but treats each
candidate separately. The submitted results are generated using
MART due to its superior performance on our development
set.
      </p>
      <sec id="sec-3-1">
        <title>Features.</title>
        <p>Three groups of features were used in our system. The
textual features are the features regarding the textual
properties of the surface form and its context. For example,
one feature indicates if the current surface form and the
surrounding words are capitalized or not. We also use
features generated from the output of the in-house named
entity recognition system that is specially designed to be
robust on non-capitalized words. The entity graph features
capture the semantic cohesiveness between the entity-entity
and entity-mention pairs. This group of features was mainly
calculated using the entity database and its structured data.
Finally, the statistical features indicates the word usage and
entity popularity using the information collected from the
web.</p>
        <p>Among the three group features, the statistical feature
group is the most important one. We describe some of the
most important features in the following. Let a denote the
surface form of a candidate, and e denote the an entity.
One important feature is the link probability feature Pl(a),
which indicates the probability that a phrase is used as an
anchor in Wikipedia. For each phrase a, we also collect
statistics about the probability that a phrase is capitalized
in Wikipedia. We refer to this feature as the capitalization
rate feature, Pc(a).</p>
        <p>We also compute features that captures the relationships
between an anchor a and an entity e. The probability P (e|a)
captures the likelihood of an anchor linked to an Wikipedia
entity. We have downloaded Wikipedia page view counts,
representing page view information from 2012.1 According
to the popularity information, we add another probability
feature that captures the relative popularity of the pages
that could be linked from the anchor a. More precisely,
Pv(e|a) = v(ei)/(P{e∈E(a)∩∅} v(e)), where v(e) represents
the view count for the page e.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>RESULTS</title>
      <p>In our experiments, we split the training set into two sets
that contains 1534 and 800 tweets, respectively. The
800tweet data is used as our development set. Our analysis
shows that robust mention detection is often the source of
errors in the current the entity linking systems. In order to
achieve better F1 score, we change the prediction function
to
arg max f (Φ(m, a, e)) − s[e = ∅],
e∈{E(a)∩∅}
(2)
0.85
0.8
0.75
where [·] is an indicator function. When s increases, the
system will produce more entities. From the results in Figure 1,
we found that tuning s does impact results significantly.
After learning parameters and desired value of s are chosen,
we then retrain the E2E using the full training data, and
generate final results with s = 0, 2.5 and 3.5, respectively.</p>
      <sec id="sec-4-1">
        <title>Error Analysis.</title>
        <p>We analyze at our results on the development set with
s = 3.5. In the development set, there are 1304 mentions,
and E2E generates total number of 18746 candidates in the
candidate generation stage. Our error analysis shows that
E2E misses 340 entity mentions and predict extra 284
mentions. Among the errors, E2E has troubles on the
“number” entities (e.g. 1_(number)). Further investigation
reveals that the tokenization choice of E2E plays a big part of
these errors, given that most punctuations are being treated
as separate tokens. Interestingly, E2E only makes 44 cases
where it correctly recognizes the mentions but link to wrong
entities. Most errors occur when E2E fail to recognize
mentions correctly.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>In this paper, we presented E2E, a system that performs
joint entity recognition and disambiguation on short and
noisy text. We found that the substance of a successful
entity linking system consists of successfully combining all
of the components.</p>
      <p>Due to the time limitation, the submitted system still has
plenty of room to improve. For example, one important
direction is to explore the relationships between different
tweets to improve entity linking results. Developing a
robust mention detection algorithm is an important research
direction as well.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. E. Cano</given-names>
            <surname>Basave</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Varga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stankovic</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-S.</given-names>
            <surname>Dadzie</surname>
          </string-name>
          .
          <article-title>Making Sense of Microposts (#Microposts2014) Named Entity Extraction &amp; Linking Challenge</article-title>
          .
          <source>In Proc., #Microposts2014</source>
          , pages
          <fpage>54</fpage>
          -
          <lpage>60</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          and W.-T. Yih.
          <article-title>Dual coordinate descent algorithms for efficient large margin structured prediction</article-title>
          .
          <source>TACL</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <article-title>Greedy function approximation: A gradient boosting machine</article-title>
          .
          <source>Annals of Statistics</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.-J. P.</given-names>
            <surname>Hsu</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Ottaviano</surname>
          </string-name>
          .
          <article-title>Space-efficient data structures for top-k completion</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>