<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Microposts</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/1235</article-id>
      <title-group>
        <article-title>Named Entity Linking in #Tweets with KEA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jörg Waitelonis</string-name>
          <email>joerg.waitelonis@hpi.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <email>harald.sack@hpi.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hasso-Plattner-Institute Prof.-Dr.-Helmert Str.</institution>
          <addr-line>2-3, 14482 Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>6</volume>
      <fpage>61</fpage>
      <lpage>63</lpage>
      <abstract>
        <p>This paper presents the KEA system at the #Microposts 2016 NEEL Challenge. Its task is to recognize and type mentions from English microposts and link them to their corresponding entries in DBpedia. For this task, we have adapted our Named Entity Disambiguation tool originally designed for natural language text to the special requirements of noisy, terse, and poorly worded tweets containing special functional terms and language.</p>
      </abstract>
      <kwd-group>
        <kwd>named entity linking</kwd>
        <kwd>disambiguation</kwd>
        <kwd>microposts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Microposts have become a highly popular medium to share
facts, opinions or emotions. They provide an invaluable
realtime resource of data, ready to be mined for training
predictive models. However, the effectiveness of existing analysis
tools faces critical challenges when applied to microposts.
In fact it is seriously compromised, since Twitter1 messages
often are noisy, terse, poorly worded and posted in many
different languages. They contain special functional
expressions, such as e. g. usernames, hashtags, retweets,
abbreviations, and cyber-slang [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, Twitter being the most
popular micropost service follows a streaming paradigm
imposing that entities must be recognized in real-time.
      </p>
      <p>
        In this paper, we describe our approach to address the
#Micropost 2016 NEEL challenge [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] with the adaptation
of an existing Named Entity Disambiguation system – KEA
– originally designed for the processing of natural language
texts, adapted to the special challenges imposed by
microposts.
      </p>
      <p>KEA originally implements a dictionary and
knowledgebased approach of word sense disambiguation, i. e.
co-occurrence analysis based on articles of the English
Wiki2.</p>
    </sec>
    <sec id="sec-2">
      <title>THE KEA APPROACH</title>
      <p>To address the tasks of the #Micropost 2016 NEEL
challenge, we have adapted our NEL approach KEA. It is
originally configured to be applied on natural language text
and combinations of textual metadata from heterogeneous
sources such as e. g. metadata generated by automated
multimedia analysis or user provided metadata, such as e. g.
tags, comments, and discussions. All this metadata can be
of different provenience, reliability, trustworthiness, as well
as level of abstraction.</p>
      <p>KEA uses DBpedia as a reference knowledge base for
entity linking and basically follows the five-stage approach
depicted in Fig. 1.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Preprocessing</title>
      <p>
        The incoming text is processed by the following linguistic
pipeline. The Stanford Log-linear tagger[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as well as
Stanford Named Entity Recognizer[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] (NER) are applied to
determine part-of-speech as well as named entity types. Next,
an ASCII folding filter converts alphabetic, numeric, and
symbolic Unicode characters, which are not in the the
”Basic Latin” Unicode block into their ASCII equivalents, e. g.
”Ole Rømer” is transformed to ”Ole Romer”. Tokenization is
performed on non-characters except special characters
joining compound words, such as, e. g. ”-”.
      </p>
      <p>The resulting list of tokens is fed into a shingle filter to
construct token n-grams from the token stream. For
example, the sentence ”please divide this sentence into shingles”
might be tokenized into 2-shingles ”please divide”, ”divide
this”, ”this sentence”, ”sentence into”, and ”into shingles”.
Usually, 3-shingles are created as a default. In the case of
a proper noun recognized by the NER at most 5-shingles
are created with the ±2 surrounding tokens. This extension
enables to map also longer compound proper names such as
e. g. ”John F. Kennedy Airport” which cannot be mapped
correctly otherwise with a 3-shingle configuration. The
to2http://wikipedia.org/
3http://dbpedia.org/
Preprocessing</p>
      <p>(tokenizing,
POS-tagging, …)</p>
      <p>Disambiguation
ken stream now contains tokens with sole words, but also
tokens with ’shingled’ words.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Candidate Mapping</title>
      <p>Every token is mapped to a gazetteer, which has been
compiled from DBpedia entities’ labels, redirect labels, and
disambiguation labels being mapped to their appropriate
DBpedia entities. Since the originally used gazetteer in KEA
is based on DBpedia 3.9, entities and labels from the
DBpedia 2015-04 dataset are added for the NEEL challenge.
Labels are indexed lowercase and finally mapped to the
tokens resulting in a list of potential entity candidates for each
token. The mapping is obtained by exact matches only. A
normalization of simple plural forms is applied beforehand.
Hence, for each token of the token stream a set of potential
entity candidates is determined.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Candidate Merging and Filtering</title>
      <p>To resolve possible overlaps of tokens successfully, longer
tokens, which are mapped successfully, are preferred over
shorter ones. Since longer tokens contain more descriptive
terms, they are considered to be more specific. This means,
for example, that ”new york city” is preferred over ”new york”
and ”york city”. Furthermore, tokens are discarded, if they
do not contain nouns or contain sole stopwords, i. e. token
”the times” will not be discarded, because it contains the
noun ”times”.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Scoring (Feature Generation)</title>
      <p>For every entity candidate, features are determined via
a pipeline of analysis components (scorers). These
components asses different characteristics how well a candidate
entity fits to the given input text, which is considered as the
context. We distinguish between local and context-related
features. Local features only consider the candidate as well
as the tokens properties. For example, consider the text
”Armstrong landed on earth’s satellite”: For a candidate
w.l.o.g ”dbp:Neil Armstrong” of the possible candidate list
of the token ’Armstrong’ certain features can be determined,
as e. g. string-distance between the candidate labels and the
token (respectively the surface form), the candidates link
graph popularity, its DBpedia type, the provenance of the
label, the surface form matches best (e. g. main label, or
redirect label), or the level of ambiguity of the token (e. g.
approximated by the number of candidates).</p>
      <p>Context-features assess the relation of a candidate entity
to the other candidates within the given context, e. g.
direct links to other context candidates in the DBpedia link
graph, co-occurrence of the other tokens’ surface forms in
the corresponding Wikipedia article of the candidate under
consideration, co-references in Wikipedia articles, as well as
further graph based features of the link graph induced by
all candidates of the context (context graph). This includes
for example, graph distance measurements, connected
component analysis, or centrality and density observations.</p>
      <p>Overall, after this processing step, every candidate gets
a list of scores assigned being determined via several of the
mentioned methods. Theses lists of scores are considered
as the candidates’ feature vectors, expressing how well a
candidate entity fits to the given context.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Disambiguation</title>
      <p>Since all scores of the analyzed features have a positive
but unlimited value range, a linear feature scaling is applied
to standardize the ranges between 0.0 and 1.0. Different
approaches ranging from statistical analysis to machine
learning techniques can be envisaged to decide which candidate is
chosen as the winner for a token. The most basic approach
considers the weighted sum of the scores as a confidence
score, whereas the weights are optimized via grid search on
a given development or training dataset. The confidence
score is cut-off by a empirically optimized threshold, to
decide, if a candidate entity is to be selected as the assumed
correct result.
3.</p>
    </sec>
    <sec id="sec-8">
      <title>ADAPTATIONS TO THE NEEL CHAL</title>
    </sec>
    <sec id="sec-9">
      <title>LENGE</title>
      <p>To be applicable also for microposts as in the NEEL
challenge, the KEA processing has been adapted in two ways.
We distinguish between modifications made especially for
the general domain of ”microposts/tweets” and modifications
resulting from the observation of the provided training data
set.
3.1</p>
    </sec>
    <sec id="sec-10">
      <title>Adaptations to the Domain</title>
      <p>For the NEEL challenge, we have utilized characteristic
tweet information by excluding ”@” and ”#” from the
tokenization to later identify twitter user names and hash tags
properly. With respect to the provided NEEL challenge
guidelines of annotations, the filter is extended to restrict
the system to tokens containing singular and plural proper
nouns, user names, as well as hashtags only. The stopword
list is extended with twitter specific functional terms (e. g.
”RT”, ”MT”, etc.) to be ignored in further processing. KEA
is configured to consider a single micropost (tweet) as the
given context for disambiguation. Furthermore, the
threshold of the achieved confidence score is used to cut-off
uncertain candidates resulting in NIL annotations. Tokens
identified as user name or hashtag which cannot successfully be
mapped to candidate entities are also annotated with NIL.
3.2</p>
    </sec>
    <sec id="sec-11">
      <title>Adaptations to the Training Set</title>
      <p>From the provided training dataset all surface forms have
been extracted to extend the gazetteer for candidate
mapping. We have optimized the scorer weights as well as the
overall threshold according to the results achieved for the
training and development datasets. Furthermore, the
stopword list has been extended according to the achieved results
from the training and development datasets, i. e. terms
constantly mapped wrongly because they have not been
annotated in the datasets such as weekdays and months.</p>
      <p>Since KEA did not support the required annotation with
types out of the box, a simple extension of the original
framework has been implemented. For a disambiguated mapped
entity, type annotations are determined simply via lookup in
the DBpedia instance types dataset. For NIL annotations,
where no entity could be determined, the according NER
type, if available, has been chosen.</p>
    </sec>
    <sec id="sec-12">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>For the #Microposts 2016 NEEL challenge we have first
analyzed the provided development dataset without the above
described adaptions to obtain a baseline (cf. Table 1), and
then again with the NEEL challenge modifications (cf.
Table 2).
According to our expectations, the special adaptations for
the NEEL challenge have resulted in significantly better
results compared to the original tool configuration. A closer
inspection of the achieved mappings has shown that KEA
was able to find correct mappings to entities which are not
provided in the NEEL ground truth, e. g.:
#wcyb -&gt; dbp:WCYB-TV
#WSJ -&gt; dbp:The_Wall_Street_Journal
#NSC -&gt; dbp:National_Security_Council
#kyloren -&gt; dpb:Kylo_Ren</p>
      <p>Compared to the training data ground truth, the KEA
system tends to detect mentions overeagerly, i. e. the system
produces more extra annotations than missing annotations,
which results in a loss of precision. Many of KEA’s extra
annotations are common nouns such as affirmative action,
astronaut, petition, signature, mosque, emoji, enemy.</p>
      <p>For the task of NEL on microposts, it is a challenge to
maintain the topicality of the underlying knowledge base.
New hash-tags, neologisms, as well as cyber-slang are rather
difficult to resolve correctly in an automated way because
they are not present in the dictionaries. To cope with this
situation, one possibility would be to include a live analysis
of the Wikipedia update stream to extend or prioritize the
used dictionary of surface forms as well as the underlying
link graph.</p>
      <p>From our observations, a significant part of the achieved
improvements results from the fact that training sets as well
as test sets cover the identical domains (i. e. Star Wars and
Donald Trump). Hence, the extension of the dictionary with
surface forms of the training dataset seems to be very
effective. The conclusion is, that a domain adaption for a given
general purpose system might lead to significantly better
results. Even if this sounds trivial, we did not expect an
improvement of c. 40% in f-measure.</p>
      <p>Unfortunately, many documents of the training data set
(1951 out of 6024) do not have any annotations at all.
Therefore, we are looking forward to future NEEL challenges with
more complete ground truth datasets.
6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Finkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Grenager</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling</article-title>
          .
          <source>In Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL</source>
          <year>2005</year>
          ), pages
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Han</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <article-title>Lexical Normalisation of Short Text Messages: Makn Sens a #Twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1</article-title>
          , HLT '
          <volume>11</volume>
          , pages
          <fpage>368</fpage>
          -
          <lpage>378</lpage>
          , Stroudsburg, PA, USA,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rizzo</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Erp</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Plu</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
          </string-name>
          .
          <article-title>Making Sense of Microposts (#Microposts2016) Named Entity rEcognition and Linking (NEEL) Challenge</article-title>
          . In D. Preo¸
          <article-title>tiuc-</article-title>
          <string-name>
            <surname>Pietro</surname>
            , D. Radovanovi´c,
            <given-names>A. E.</given-names>
          </string-name>
          <string-name>
            <surname>Cano-Basave</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Weller</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>-S</surname>
          </string-name>
          . Dadzie, editors,
          <source>6th Workshop on Making Sense of Microposts (#Microposts2016)</source>
          , pages
          <fpage>50</fpage>
          -
          <lpage>59</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sack</surname>
          </string-name>
          . Business Information Systems Workshops:
          <article-title>BIS 2015 International Workshops</article-title>
          , Poznan´, Poland, June 24-26,
          <year>2015</year>
          ,
          <string-name>
            <given-names>Revised</given-names>
            <surname>Papers</surname>
          </string-name>
          , chapter
          <article-title>The Journey is the Reward - Towards New Paradigms in Web Search</article-title>
          , pages
          <fpage>15</fpage>
          -
          <lpage>26</lpage>
          . Springer International Publishing, Cham,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-speech Tagger</article-title>
          .
          <source>In Proceedings of the 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing</source>
          and Very Large Corpora:
          <article-title>Held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics</article-title>
          - Volume
          <volume>13</volume>
          , EMNLP '
          <volume>00</volume>
          , pages
          <fpage>63</fpage>
          -
          <lpage>70</lpage>
          , Stroudsburg, PA, USA,
          <year>2000</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck</surname>
          </string-name>
          , M. Ro¨der, A.
          <string-name>
            <surname>-C. Ngonga Ngomo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Baron</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Both</surname>
            , M. Bru¨mmer,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Ceccarelli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cornolti</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Cherix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Eickmann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Ferragina</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lemke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moro</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piccinno</surname>
            , G. Rizzo,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Sack</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Speck</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Troncy</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Waitelonis</surname>
            , and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Wesemann. GERBIL - General Entity Annotation Benchmark</surname>
          </string-name>
          <article-title>Framework</article-title>
          .
          <source>In Proceedings of the 24th International Conference on World Wide Web (WWW15)</source>
          , pages
          <fpage>1133</fpage>
          -
          <lpage>1143</lpage>
          . ACM, USA,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>