=Paper= {{Paper |id=Vol-1691//paper_14. |storemode=property |title=Named Entity Linking in #Tweets with KEA |pdfUrl=https://ceur-ws.org/Vol-1691/paper_14.pdf |volume=Vol-1691 |authors=Jörg Waitelonis,Harald Sack |dblpUrl=https://dblp.org/rec/conf/msm/WaitelonisS16 }} ==Named Entity Linking in #Tweets with KEA== https://ceur-ws.org/Vol-1691/paper_14.pdf
                    Named Entity Linking in #Tweets with KEA

                                         Jörg Waitelonis, Harald Sack
                                            Hasso-Plattner-Institute
                              Prof.-Dr.-Helmert Str. 2-3, 14482 Potsdam, Germany
                                   {joerg.waitelonis|harald.sack}@hpi.de



ABSTRACT                                                          pedia2 are combined with a link-graph analysis on the Wiki-
This paper presents the KEA system at the #Microposts             pedia hyperlink graph and the DBpedia3 knowledge base.
2016 NEEL Challenge. Its task is to recognize and type            The basic principles of the KEA named entity linking are
mentions from English microposts and link them to their           summarized in [4]. A comparison of KEA and other state-
corresponding entries in DBpedia. For this task, we have          of-the-art named entity linking systems is provided in [6].
adapted our Named Entity Disambiguation tool originally              In the subsequent sections, KEA will be introduced in
designed for natural language text to the special require-        more detail, followed by adaptions made especially for the
ments of noisy, terse, and poorly worded tweets containing        NEEL challenge, and our achieved results.
special functional terms and language.
                                                                  2.     THE KEA APPROACH
Keywords                                                             To address the tasks of the #Micropost 2016 NEEL chal-
                                                                  lenge, we have adapted our NEL approach KEA. It is orig-
named entity linking, disambiguation, microposts
                                                                  inally configured to be applied on natural language text
                                                                  and combinations of textual metadata from heterogeneous
1.     INTRODUCTION                                               sources such as e. g. metadata generated by automated mul-
   Microposts have become a highly popular medium to share        timedia analysis or user provided metadata, such as e. g.
facts, opinions or emotions. They provide an invaluable real-     tags, comments, and discussions. All this metadata can be
time resource of data, ready to be mined for training predic-     of different provenience, reliability, trustworthiness, as well
tive models. However, the effectiveness of existing analysis      as level of abstraction.
tools faces critical challenges when applied to microposts.          KEA uses DBpedia as a reference knowledge base for en-
In fact it is seriously compromised, since Twitter1 messages      tity linking and basically follows the five-stage approach de-
often are noisy, terse, poorly worded and posted in many          picted in Fig. 1.
different languages. They contain special functional expres-
sions, such as e. g. usernames, hashtags, retweets, abbrevia-     2.1      Preprocessing
tions, and cyber-slang [2]. Moreover, Twitter being the most         The incoming text is processed by the following linguistic
popular micropost service follows a streaming paradigm im-        pipeline. The Stanford Log-linear tagger[5] as well as Stan-
posing that entities must be recognized in real-time.             ford Named Entity Recognizer[1] (NER) are applied to de-
   In this paper, we describe our approach to address the         termine part-of-speech as well as named entity types. Next,
#Micropost 2016 NEEL challenge [3] with the adaptation            an ASCII folding filter converts alphabetic, numeric, and
of an existing Named Entity Disambiguation system – KEA           symbolic Unicode characters, which are not in the the ”Ba-
– originally designed for the processing of natural language      sic Latin” Unicode block into their ASCII equivalents, e. g.
texts, adapted to the special challenges imposed by microp-       ”Ole Rømer” is transformed to ”Ole Romer”. Tokenization is
osts.                                                             performed on non-characters except special characters join-
   KEA originally implements a dictionary and knowledge-          ing compound words, such as, e. g. ”-”.
based approach of word sense disambiguation, i. e.                   The resulting list of tokens is fed into a shingle filter to
co-occurrence analysis based on articles of the English Wiki-     construct token n-grams from the token stream. For exam-
                                                                  ple, the sentence ”please divide this sentence into shingles”
1
    http://twitter.com/                                           might be tokenized into 2-shingles ”please divide”, ”divide
                                                                  this”, ”this sentence”, ”sentence into”, and ”into shingles”.
                                                                  Usually, 3-shingles are created as a default. In the case of
                                                                  a proper noun recognized by the NER at most 5-shingles
                                                                  are created with the ±2 surrounding tokens. This extension
Copyright c 2016 held by author(s)/owner(s); copying permitted
only for private and academic purposes.                           enables to map also longer compound proper names such as
Published as part of the #Microposts2016 Workshop proceedings,    e. g. ”John F. Kennedy Airport” which cannot be mapped
available online as CEUR Vol-1691 (http://ceur-ws.org/Vol-1691)   correctly otherwise with a 3-shingle configuration. The to-
#Microposts2016, Apr 11th, 2016, Montréal, Canada.                2
ACM ISBN 978-1-4503-2138-9.                                           http://wikipedia.org/
                                                                  3
DOI: 10.1145/1235                                                     http://dbpedia.org/




· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016
                  Preprocessing                               Candidate          Scoring
                                          Candidate
                   (tokenizing,                              Merging and         (Feature         Disambiguation
                                          Mapping
                 POS-tagging, …)                               Filtering        Generation)



                                             Figure 1: The overall NEL process.


ken stream now contains tokens with sole words, but also              Overall, after this processing step, every candidate gets
tokens with ’shingled’ words.                                       a list of scores assigned being determined via several of the
                                                                    mentioned methods. Theses lists of scores are considered
2.2    Candidate Mapping                                            as the candidates’ feature vectors, expressing how well a
   Every token is mapped to a gazetteer, which has been             candidate entity fits to the given context.
compiled from DBpedia entities’ labels, redirect labels, and
disambiguation labels being mapped to their appropriate             2.5 Disambiguation
DBpedia entities. Since the originally used gazetteer in KEA          Since all scores of the analyzed features have a positive
is based on DBpedia 3.9, entities and labels from the DB-           but unlimited value range, a linear feature scaling is applied
pedia 2015-04 dataset are added for the NEEL challenge.             to standardize the ranges between 0.0 and 1.0. Different ap-
Labels are indexed lowercase and finally mapped to the to-          proaches ranging from statistical analysis to machine learn-
kens resulting in a list of potential entity candidates for each    ing techniques can be envisaged to decide which candidate is
token. The mapping is obtained by exact matches only. A             chosen as the winner for a token. The most basic approach
normalization of simple plural forms is applied beforehand.         considers the weighted sum of the scores as a confidence
Hence, for each token of the token stream a set of potential        score, whereas the weights are optimized via grid search on
entity candidates is determined.                                    a given development or training dataset. The confidence
                                                                    score is cut-off by a empirically optimized threshold, to de-
2.3    Candidate Merging and Filtering                              cide, if a candidate entity is to be selected as the assumed
  To resolve possible overlaps of tokens successfully, longer       correct result.
tokens, which are mapped successfully, are preferred over
shorter ones. Since longer tokens contain more descriptive          3.     ADAPTATIONS TO THE NEEL CHAL-
terms, they are considered to be more specific. This means,
for example, that ”new york city” is preferred over ”new york”             LENGE
and ”york city”. Furthermore, tokens are discarded, if they           To be applicable also for microposts as in the NEEL chal-
do not contain nouns or contain sole stopwords, i. e. token         lenge, the KEA processing has been adapted in two ways.
”the times” will not be discarded, because it contains the          We distinguish between modifications made especially for
noun ”times”.                                                       the general domain of ”microposts/tweets” and modifications
                                                                    resulting from the observation of the provided training data
2.4    Scoring (Feature Generation)                                 set.
   For every entity candidate, features are determined via
a pipeline of analysis components (scorers). These compo-           3.1 Adaptations to the Domain
nents asses different characteristics how well a candidate en-         For the NEEL challenge, we have utilized characteristic
tity fits to the given input text, which is considered as the       tweet information by excluding ”@” and ”#” from the tok-
context. We distinguish between local and context-related           enization to later identify twitter user names and hash tags
features. Local features only consider the candidate as well        properly. With respect to the provided NEEL challenge
as the tokens properties. For example, consider the text            guidelines of annotations, the filter is extended to restrict
”Armstrong landed on earth’s satellite”: For a candidate            the system to tokens containing singular and plural proper
w.l.o.g ”dbp:Neil Armstrong” of the possible candidate list         nouns, user names, as well as hashtags only. The stopword
of the token ’Armstrong’ certain features can be determined,        list is extended with twitter specific functional terms (e. g.
as e. g. string-distance between the candidate labels and the       ”RT”, ”MT”, etc.) to be ignored in further processing. KEA
token (respectively the surface form), the candidates link          is configured to consider a single micropost (tweet) as the
graph popularity, its DBpedia type, the provenance of the           given context for disambiguation. Furthermore, the thresh-
label, the surface form matches best (e. g. main label, or          old of the achieved confidence score is used to cut-off uncer-
redirect label), or the level of ambiguity of the token (e. g.      tain candidates resulting in NIL annotations. Tokens iden-
approximated by the number of candidates).                          tified as user name or hashtag which cannot successfully be
   Context-features assess the relation of a candidate entity       mapped to candidate entities are also annotated with NIL.
to the other candidates within the given context, e. g. di-
rect links to other context candidates in the DBpedia link          3.2 Adaptations to the Training Set
graph, co-occurrence of the other tokens’ surface forms in             From the provided training dataset all surface forms have
the corresponding Wikipedia article of the candidate under          been extracted to extend the gazetteer for candidate map-
consideration, co-references in Wikipedia articles, as well as      ping. We have optimized the scorer weights as well as the
further graph based features of the link graph induced by           overall threshold according to the results achieved for the
all candidates of the context (context graph). This includes        training and development datasets. Furthermore, the stop-
for example, graph distance measurements, connected com-            word list has been extended according to the achieved results
ponent analysis, or centrality and density observations.            from the training and development datasets, i. e. terms con-



                                                                                                                                62
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016
stantly mapped wrongly because they have not been anno-           of the Wikipedia update stream to extend or prioritize the
tated in the datasets such as weekdays and months.                used dictionary of surface forms as well as the underlying
                                                                  link graph.
3.3    Types                                                         From our observations, a significant part of the achieved
  Since KEA did not support the required annotation with          improvements results from the fact that training sets as well
types out of the box, a simple extension of the original frame-   as test sets cover the identical domains (i. e. Star Wars and
work has been implemented. For a disambiguated mapped             Donald Trump). Hence, the extension of the dictionary with
entity, type annotations are determined simply via lookup in      surface forms of the training dataset seems to be very effec-
the DBpedia instance types dataset. For NIL annotations,          tive. The conclusion is, that a domain adaption for a given
where no entity could be determined, the according NER            general purpose system might lead to significantly better
type, if available, has been chosen.                              results. Even if this sounds trivial, we did not expect an
                                                                  improvement of c. 40% in f-measure.
4.    EXPERIMENTS AND RESULTS                                        Unfortunately, many documents of the training data set
                                                                  (1951 out of 6024) do not have any annotations at all. There-
  For the #Microposts 2016 NEEL challenge we have first
                                                                  fore, we are looking forward to future NEEL challenges with
analyzed the provided development dataset without the above
                                                                  more complete ground truth datasets.
described adaptions to obtain a baseline (cf. Table 1), and
then again with the NEEL challenge modifications (cf. Ta-
ble 2).                                                           6.   REFERENCES
                                                                  [1] J. R. Finkel, T. Grenager, and C. D. Manning.
                                                                      Incorporating Non-local Information into Information
Table 1: Results for the NEEL2016 development                         Extraction Systems by Gibbs Sampling. In Proceedings
data set (baseline, without modifications)                            of the 43nd Annual Meeting of the Association for
 Measure                    Prec. Recall F1 score                     Computational Linguistics (ACL 2005), pages 363–370,
 strong link match          0.399   0.490  0.440                      2005.
 strong typed mention match 0.232   0.213  0.222                  [2] B. Han and T. Baldwin. Lexical Normalisation of Short
 mention ceaf               0.611   0.562  0.586                      Text Messages: Makn Sens a #Twitter. In Proceedings
                                                                      of the 49th Annual Meeting of the Association for
                                                                      Computational Linguistics: Human Language
                                                                      Technologies - Volume 1, HLT ’11, pages 368–378,
Table 2: Results for the NEEL2016 development                         Stroudsburg, PA, USA, 2011.
data set after adaptions and optimization                         [3] G. Rizzo, M. van Erp, J. Plu, and R. Troncy. Making
 Measure                    Prec. Recall F1 score                     Sense of Microposts (#Microposts2016) Named Entity
 strong link match          0.667  0.862  0.752                       rEcognition and Linking (NEEL) Challenge. In
 strong typed mention match 0.572  0.660  0.613                       D. Preoţiuc-Pietro, D. Radovanović, A. E.
 mention ceaf               0.744  0.858  0.797                       Cano-Basave, K. Weller, and A.-S. Dadzie, editors, 6th
                                                                      Workshop on Making Sense of Microposts
                                                                      (#Microposts2016), pages 50–59, 2016.
  According to our expectations, the special adaptations for
the NEEL challenge have resulted in significantly better re-      [4] H. Sack. Business Information Systems Workshops:
sults compared to the original tool configuration. A closer           BIS 2015 International Workshops, Poznań, Poland,
inspection of the achieved mappings has shown that KEA                June 24-26, 2015, Revised Papers, chapter The Journey
was able to find correct mappings to entities which are not           is the Reward - Towards New Paradigms in Web
provided in the NEEL ground truth, e. g.:                             Search, pages 15–26. Springer International Publishing,
                                                                      Cham, 2015.
#wcyb -> dbp:WCYB-TV                                              [5] K. Toutanova and C. D. Manning. Enriching the
#WSJ -> dbp:The_Wall_Street_Journal                                   Knowledge Sources Used in a Maximum Entropy
#NSC -> dbp:National_Security_Council                                 Part-of-speech Tagger. In Proceedings of the 2000 Joint
#kyloren -> dpb:Kylo_Ren                                              SIGDAT Conference on Empirical Methods in Natural
                                                                      Language Processing and Very Large Corpora: Held in
  Compared to the training data ground truth, the KEA                 Conjunction with the 38th Annual Meeting of the
system tends to detect mentions overeagerly, i. e. the system         Association for Computational Linguistics - Volume 13,
produces more extra annotations than missing annotations,             EMNLP ’00, pages 63–70, Stroudsburg, PA, USA,
which results in a loss of precision. Many of KEA’s extra             2000. Association for Computational Linguistics.
annotations are common nouns such as affirmative action,          [6] R. Usbeck, M. Röder, A.-C. Ngonga Ngomo, C. Baron,
astronaut, petition, signature, mosque, emoji, enemy.                 A. Both, M. Brümmer, D. Ceccarelli, M. Cornolti,
                                                                      D. Cherix, B. Eickmann, P. Ferragina, C. Lemke,
5.    CONCLUSION & FUTURE WORK                                        A. Moro, R. Navigli, F. Piccinno, G. Rizzo, H. Sack,
                                                                      R. Speck, R. Troncy, J. Waitelonis, and L. Wesemann.
   For the task of NEL on microposts, it is a challenge to
                                                                      GERBIL – General Entity Annotation Benchmark
maintain the topicality of the underlying knowledge base.
                                                                      Framework. In Proceedings of the 24th International
New hash-tags, neologisms, as well as cyber-slang are rather
                                                                      Conference on World Wide Web (WWW15), pages
difficult to resolve correctly in an automated way because
                                                                      1133–1143. ACM, USA, 2015.
they are not present in the dictionaries. To cope with this
situation, one possibility would be to include a live analysis



                                                                                                                             63
· #Microposts2016 · 6th Workshop on Making Sense of Microposts · @WWW2016