MicroNeel: Combining NLP Tools
        to Perform Named Entity Detection and Linking on Microposts

    Francesco Corcoglioniti, Alessio Palmero Aprosio, Yaroslav Nechaev, Claudio Giuliano
                                   Fondazione Bruno Kessler
                                         Trento, Italy
                {corcoglio,aprosio,nechaev,giuliano}@fbk.eu


                     Abstract                        like news article, perform poorly when applied to
                                                     microposts and are outperformed by NLP solu-
     English. In this paper we present the Mi-       tions specifically-developed for this kind of con-
     croNeel system for Named Entity Recog-          tent (see, e.g., Bontcheva et al. (2013)).
     nition and Entity Linking on Italian mi-            Recognizing these challenges and following
     croposts, which participated in the NEEL-       similar initiatives for the English language, the
     IT task at EVALITA 2016. MicroNeel              NEEL-IT1 task (Basile et al., 2016a) at EVALITA
     combines The Wiki Machine and Tint,             20162 (Basile et al., 2016b) aims at promoting
     two standard NLP tools, with compre-            the research on NLP for the analysis of microp-
     hensive tweet preprocessing, the Twitter-       osts in the Italian language. The task is a combi-
     DBpedia alignments from the Social Me-          nation of Named Entity Recognition (NER), En-
     dia Toolkit resource, and rule-based or su-     tity Linking (EL), and Coreference Resolution for
     pervised merging of produced annotations.       Twitter tweets, which are short microposts of max-
     Italiano. In questo articolo presentiamo il     imum 140 characters that may include hashtags,
     sistema MicroNeel per il riconoscimento e       user mentions, and URLs linking to external Web
     la disambiguazione di entità in micropost      resources. Participating systems have to recognize
     in lingua Italiana, con cui abbiamo parte-      mentions of named entities, assign them a NER
     cipato al task NEEL-IT di EVALITA 2016.         category (e.g., person), and disambiguate them
     MicroNeel combina The Wiki Machine e            against a fragment of DBpedia containing the en-
     Tint, due sistemi NLP standard, con un          tities common to the Italian and English DBpe-
     preprocessing esteso dei tweet, con gli         dia chapters; unlinked (i.e., NIL) mentions have
     allineamenti tra Twitter e DBpedia della        finally to be clustered in coreference sets.
     risorsa Social Media Toolkit, e con un sis-         In this paper we present our MicroNeel system
     tema di fusione delle annotazioni prodotte      that participated in the NEEL-IT task. With Mi-
     basato su regole o supervisionato.              croNeel, we investigate the use on microposts of
                                                     two standard NER and EL tools – The Wiki Ma-
                                                     chine (Palmero Aprosio and Giuliano, 2016) and
1     Introduction                                   Tint (Palmero Aprosio and Moretti, 2016) – that
Microposts, i.e., brief user-generated texts like    were originally developed for more formal texts.
tweets, checkins, status messages, etc., are a       To achieve adequate performances, we comple-
form of content highly popular on social media       ment them with: (i) a preprocessing step where
and an increasingly relevant source for informa-     tweets are enriched with semantically related text,
tion extraction. The application of Natural Lan-     and rewritten to make them less noisy; (ii) a
guage Processing (NLP) techniques to microposts      set of alignments from Twitter user mentions to
presents unique challenges due to their informal     DBpedia entities, provided by the Social Media
nature, noisiness, lack of sufficient textual con-   Toolkit (SMT) resource (Nechaev et al., 2016);
text (e.g., for disambiguation), and use of spe-     and (iii) rule-based and supervised mechanisms
cific abbreviations and conventions like #hash-      for merging the annotations produced by NER,
tags, @user mentions, retweet markers and so on.     EL, and SMT, resolving possible conflicts.
As a consequence, standard NLP tools designed          1
                                                           http://neel-it.github.io/
                                                       2
and trained on more ‘traditional’ formal domains,          http://www.evalita.it/2016
   In the remainder of the paper, Section 2 in-        users, and URLs. Alternatively, a tweet ID can be
troduces the main tools and resources we used.         supplied in input (as done in NEEL-IT), and the
Section 3 describes MicroNeel, whose results at        system retrieves the corresponding text and meta-
NEEL-IT and their discussions are reported in          data (e.g., author information, date and time, lan-
Sections 4 and 5. Section 6 presents the system        guage) from Twitter API, if the tweet has not been
open-source release, while Section 7 concludes.        deleted by the user or by Twitter itself.
                                                         Processing in MicroNeel is structured as a
2       Tools and Resources                            pipeline of three main steps, outlined in Figure 1:
MicroNeel makes use of a certain number of re-         preprocessing, annotation, and merging. Their ex-
sources and tools. In this section, we briefly         ecution on an example tweet is shown in Figure 2.
present the main ones used in the annotation pro-
                                                       3.1    Preprocessing
cess. The description of the rest of them (mainly
used for preprocessing) can be found in Section 3.     During the first step, the original text of the mi-
                                                       cropost is rewritten, keeping track of the map-
2.1      The Wiki Machine                              pings between original and rewritten offsets. The
The Wiki Machine3 (Palmero Aprosio and Giu-            rewritten text is obtained by applying the follow-
liano, 2016) is an open source Entity Linking tool     ing transformations:
that automatically annotates a text with respect to
Wikipedia pages. The output is provided through         • Hashtags in the text are replaced with their to-
two main steps: entity identification, and disam-         kenizations. Given an hashtag, a bunch of 100
biguation. The Wiki Machine is trained using data         tweets using it is retrieved from Twitter. Then,
extracted from Wikipedia and is enriched with             when some camel-case versions of that hashtag
Airpedia (Palmero Aprosio et al., 2013), a dataset        are found, tokenization is done based on the se-
built on top of DBpedia (Lehmann et al., 2015)            quence of uppercase letters used.
that increase its coverage over Wikipedia pages.
                                                        • User mentions are also replaced with their tok-
2.2      Tint                                             enizations (based on camel-case) or the corre-
Tint4 (Palmero Aprosio and Moretti, 2016) is an           sponding display names, if available.
easy-to-use set of fast, accurate and extensible
                                                        • Slangs, abbreviations, and some common ty-
Natural Language Processing modules for Ital-
                                                          pos (e.g., e’ instead of è) in the text are replaced
ian. It is based on Stanford CoreNLP5 and is dis-
                                                          based on a custom dictionary (for Italian, we
tributed open source. Among other modules, the
                                                          extracted it from the Wikipedia page Gergo -
Tint pipeline includes tokenization, sentence split-
                                                          di Internet7 ).
ting, part-of-speech tagging and NER.
2.3      Social Media Toolkit                           • URLs, emoticons, and other unprocessable se-
                                                          quences of characters in the text are discarded.
Social Media Toolkit6 (Nechaev et al., 2016), or
SMT, is an API that is able to align any given          • True-casing is performed to recover the proper
knowledge base entry to a corresponding social            word case where this information is lost (e.g.,
media profile (if it exists). The reverse alignment       all upper case or lower case text). This task em-
is achieved by using a large database (∼1 million         ploys a dictionary, which for Italian is derived
entries) of precomputed alignments between DB-            from Morph-It! (Zanchetta and Baroni, 2005).
pedia and Twitter. SMT is also able to classify any
Twitter profile as a person, organization, or other.     To help disambiguation, the rewritten text is
                                                       then augmented with a textual context obtained by
3       Description of the System                      aggregating the following contents, if available:
MicroNeel accepts a micropost text as input,
which may include hashtags, mentions of Twitter         • Hashtag descriptions from tagdef 8 , a collabo-
    3
                                                          rative online service;
      http://thewikimachine.fbk.eu/
    4                                                     7
      http://tint.fbk.eu/                                  https://it.wikipedia.org/wiki/Gergo_
    5
      http://stanfordnlp.github.io/CoreNLP/            di_Internet
    6                                                    8
      http://alignments.futuro.media/                      https://www.tagdef.com/
                                      Figure 1: The overview of the system.


 • Twitter user descriptions for author and user          3.2    Annotation
   mentions in the original text;                         In the second step, annotation is performed by
 • Titles of web pages linked by URLs in the orig-        three independent annotator tools run in parallel:
   inal text.
  In the example shown in Figure 2, from the orig-         • The rewritten text is parsed with the NER mod-
inal tweet                                                   ule of Tint (see Section 2.2). This processing
                                                             annotates named entities of type person, orga-
  [Original text]                                            nization, and location.
  (author: @OscardiMontigny)
  #LinkedIn: 200 milioni di iscritti, 4 milioni            • The rewritten text, concatenated with the con-
  in Italia http://t.co/jK8MRiaS via @vincos                 text, is annotated by The Wiki Machine (see
                                                             Section 2.1) with a list of entities from the
we collect                                                   full Italian DBpedia. The obtained EL annota-
                                                             tions are enriched with the DBpedia class (ex-
 • metadata information for the author (Twitter
                                                             tended with Airpedia), and mapped to the con-
   user @OscardiMontigny);
                                                             sidered NER categories (person, organization,
 • description of the hashtag #LinkedIn;
                                                             location, product, event).
 • title of the URL http://t.co/jK8MRiaS;
 • metadata information for the Twitter user               • The user mentions in the tweet are assigned a
   @vincos, mentioned in the tweet.                          type and are linked to the corresponding DB-
The resulting (cleaned) tweet is                             pedia entities using SMT (see Section 2.3); as
                                                             for the previous case, SMT types and DBpedia
  [Rewritten text]                                           classes are mapped to NER categories. A prob-
  LinkedIn: 200 milioni di iscritti, 4 milioni in            lem here is that many user mentions classified
  Italia via Vincenzo Cosenza                                as persons or organizations by SMT are non-
                                                             annotable according to NEEL-IT guidelines.9
with context
                                                             Therefore, we implemented two strategies for
  [Context]                                                  deciding whether to annotate a user mention:
  Speaker; Blogger; Mega-Trends, Marketing                    9
                                                                Basically, a user mention can be annotated in NEEL-
  and Innovation Divulgator. #linkedin is about           IT if its NER category can be determined by just looking
  all things from Linkedin. LinkedIn: 200                 at the username and its surrounding textual context in the
  milioni di iscritti, 4 milioni in Italia — Vincos       tweet. Usernames resembling a person or an organization
                                                          name are thus annotated, while less informative usernames
  Blog. Strategist at @BlogMeter My books:                are not marked as their nature cannot be determined without
  Social Media ROI — La società dei dati.                looking at their Twitter profiles or at the tweets they made,
                                                          which is done instead by SMT.
                                             Figure 2: An example of annotation.


      the rule-based SMT annotator always anno-                    We then trained a supervised merger consisting
      tates if the SMT type is person or organization,          of a multi-class SVM whose output is either one of
      whereas the supervised SMT annotator decides              the NER categories or a special NONE category,
      using an SVM classifier trained on the develop-           for which case we discard all the annotations for
      ment set of NEEL-IT.                                      the offset. The classifier is trained on the develop-
                                                                ment tweets provided by the task organizers, us-
   The middle box in Figure 2 shows the entities                ing libSVM (Chang and Lin, 2011) with a polyno-
extracted by each tool: The Wiki Machine recog-                 mial kernel and controlling precision/recall via the
nizes “LinkedIn” as organization and “Italia” as                penalty parameter C for the NONE class. Given
location; SMT identifies “@vincos” as a person;                 an offset and the associated entity annotations we
and Tint classifies “LinkedIn” as organization and              use the following features:
“Italia” and “Vincenzo Cosenza” as persons.
                                                                 • whether the entity is linked to DBpedia;
3.3      Merging                                                 • whether the tool x annotated this entity;
The last part of the pipeline consists in deciding               • whether the tool x annotated the entity with
which annotations have to be kept and which ones                   category y (x can be Tint, SMT, or The Wiki-
should be discarded. In addition, the system has                   Machine; y can be one of the possible cate-
to choose how to deal with conflicts (for example                  gories, such as person, location, and so on);
inconsistency between the class produced by Tint                 • the case of the annotated text (uppercase ini-
and the one extracted by The Wiki Machine).                        tials, all uppercase, all lowercase, etc.);
   Specifically, the task consists in building a                 • whether the annotation is contained in a Twitter
merger that chooses at most one NER class (and                     username and/or in a hashtag;
possibly a compatible DBpedia link) for each off-                • whether the annotated text is an Italian com-
set of the text for which at least one annotator rec-              mon word and/or a known proper name; com-
ognized an entity. For instance, in the example of                 mon words were taken from Morph-It! (see
Figure 2, the merger should ignore the annotation                  Section 3.1), while proper nouns were ex-
of @vincos, as it is non considered a named entity.                tracted from Wikipedia biographies;
   As baseline, we first developed a rule-based                  • whether the annotated text contains more than
merger that does not discard any annotation and                    one word;
solves conflicts by majority vote or, in the event of            • frequencies of NER categories in the training
a tie, by giving different priorities to the annota-               dataset of tweets.
tions produced by each annotator.10
                                                                  The result of the merging step is a set of NER
  10
       Tint first, followed by The Wiki Machine and SMT.        and EL annotations as required by the NEEL-IT
task. EL annotations whose DBpedia entities are                      merger performed better than the other runs em-
not part of the English DBpedia were discarded                       ploying supervised techniques. Table 1 shows that
when participating in the task, as for NEEL-IT                       the contribution of the supervised SMT annotator
rules. They were however exploited for placing                       was null on the test set. The supervised merger,
the involved entities in the same coreference set.                   on the other hand, is only capable of changing the
The remaining (cross-micropost) coreference an-                      precision/recall balance (which was already good
notations for unlinked (NIL) entities were derived                   for the base run) by keeping only the best anno-
with a simple baseline that always put entities in                   tations. We tuned it for maximum F1 via cross-
different coreference sets.11                                        validation on the development set of NEEL-IT, but
                                                                     the outcome on the test set was a decrease of recall
4        Results                                                     not compensated by a sufficient increase of preci-
Table 1 reports on the performances obtained by                      sion, leading to an overall decrease of F1.
MicroNeel at the NEEL-IT task of EVALITA                                The ablation test in the lower part of Table 1
2016, measured using three sets of Precision (P),                    shows that the largest drop in performances re-
Recall (R), and F1 metrics (Basile et al., 2016a):                   sults from removing The Wiki Machine, which is
                                                                     thus the annotator most contributing to overall per-
 • mention CEAF tests coreference resolution;                        formances, whereas SMT is the annotator giving
 • strong typed mention match tests NER (i.e.,                       the smallest contribution (which still amounts to a
   spans and categories of annotated entities);                      valuable +0.0193 F1). The rewriting of tweet texts
 • strong link match assesses EL (i.e., spans and                    accounts for +0.0531 F1, whereas the addition of
   DBpedia URIs of annotated entities).                              textual context had essentially no impact on the
                                                                     test set, contrarily to our expectations.
Starting from their F1 scores, an overall F1 score
                                                                        An error analysis on the produced annotations
was computed as a weighted sum (0.4 for mention
                                                                     showed that many EL annotations were not pro-
CEAF and 0.3 for each other metric).
                                                                     duced due to wrong word capitalization (e.g.,
   MicroNeel was trained on the development set
                                                                     lower case words not recognized as named enti-
of 1000 annotated tweets distributed as part of the
                                                                     ties), although the true-casing performed as part
task, and tested on 300 tweets. We submitted three
                                                                     of preprocessing mitigated the problem. An alter-
runs (upper part of Table 1) that differ on the tech-
                                                                     native and possibly more robust solution may be
niques used – rule-based vs supervised – for the
                                                                     to retrain the EL tool not considering letter case.
SMT annotator and the merger:

 • base uses the rule-based variants of the SMT                      6        The tool
   annotator and the merger;                                         The MicroNeel extraction pipeline is available
 • merger uses the rule-based SMT annotator and                      as open source (GPL) from the project web-
   the supervised merger;                                            site.12 It is written in Java and additional com-
 • all uses the supervised variants of the SMT an-                   ponents for preprocessing, annotation, and merg-
   notator and the merger.                                           ing can be easily implemented by implementing
                                                                     an Annotator interface. The configuration, in-
   In addition to the official NEEL-IT scores, the
                                                                     cluding the list of components to be used and their
lower part of Table 1 reports the result of an abla-
                                                                     parameters, can be set through a specific JSON
tion test that starts from the base configuration and
                                                                     configuration file. Extensive documentation will
investigates the contributions of different compo-
                                                                     be available soon on the project wiki.
nents of MicroNeel: The Wiki Machine (EL),
Tint (NER), SMT, the tweet rewriting, and the ad-                    7        Conclusion and Future Work
dition of textual context during preprocessing.
                                                                     In this paper we presented MicroNeel, a system for
5        Discussion                                                  Named Entity Recognition and Entity Linking on
Contrarily to our expectations, the base run us-                     Italian microposts. Our approach consists of three
ing the simpler rule-based SMT and rule-based                        main steps, described in Section 3: preprocess-
                                                                     ing, annotation, and merging. By getting the sec-
    11
     It turned out after the evaluation that the alternative base-   ond best result in the NEEL-IT task at EVALITA
line that corefers entities with the same (normalized) surface
                                                                         12
form performed better on NEEL-IT test data.                                   https://github.com/fbk/microneel
                   Table 1: MicroNeel performances on NEEL-IT test set for different configurations.

Configuration             Mention CEAF           Strong typed mention match            Strong link match       Overall
                      P         R      F1           P         R        F1          P           R         F1     F1
base run            0.514     0.547     0.530     0.457      0.487     0.472     0.567      0.412      0.477   0.4967
merger run          0.576     0.455     0.509     0.523      0.415     0.463     0.664      0.332      0.442   0.4751
all run             0.574     0.453     0.506     0.521      0.412     0.460     0.670      0.332      0.444   0.4736
base - NER          0.587     0.341     0.431     0.524      0.305     0.386     0.531      0.420      0.469   0.4289
base - SMT          0.504     0.525     0.514     0.448      0.468     0.458     0.564      0.372      0.448   0.4774
base - EL           0.487     0.430     0.457     0.494      0.437     0.464     0.579      0.049      0.090   0.3490
base - rewriting    0.554     0.399     0.464     0.492      0.356     0.413     0.606      0.354      0.447   0.4436
base - context      0.513     0.547     0.530     0.453      0.485     0.468     0.566      0.416      0.480   0.4964


2016, we demonstrated that our approach is effec-               shop (EVALITA 2016). Associazione Italiana di Lin-
tive even if it builds on standard components.                  guistica Computazionale (AILC).
   Although the task consists in annotating tweets           Pierpaolo Basile, Franco Cutugno, Malvina Nissim,
in Italian, MicroNeel is largely agnostic with re-              Viviana Patti, and Rachele Sprugnoli.         2016b.
spect to the language, the only dependencies be-                EVALITA 2016: Overview of the 5th evaluation
ing the dictionaries used for preprocessing, as both            campaign of natural language processing and speech
                                                                tools for Italian. In Pierpaolo Basile, Anna Corazza,
The Wiki Machine and Tint NER support different                 Franco Cutugno, Simonetta Montemagni, Malv-
languages while SMT is language-independent.                    ina Nissim, Viviana Patti, Giovanni Semeraro, and
Therefore, MicroNeel can be easily adapted to                   Rachele Sprugnoli, editors, Proceedings of Third
other languages without big effort.                             Italian Conference on Computational Linguistics
                                                                (CLiC-it 2016) & Fifth Evaluation Campaign of
   MicroNeel is a combination of existing tools,                Natural Language Processing and Speech Tools
some of which already perform at state-of-the-art               for Italian. Final Workshop (EVALITA 2016). As-
level when applied on tweets (for instance, our                 sociazione Italiana di Linguistica Computazionale
system got the best performance in the linking task             (AILC).
thanks to The Wiki Machine). In the future, we               Kalina Bontcheva, Leon Derczynski, Adam Funk,
plan to adapt MicroNeel to English and other lan-              Mark A. Greenwood, Diana Maynard, and Niraj
guages, and to integrate some other modules both               Aswani. 2013. TwitIE: An open-source information
in the preprocessing and annotation steps, such the            extraction pipeline for microblog text. In Recent
                                                               Advances in Natural Language Processing, RANLP,
NER system expressly developed for tweets de-                  pages 83–90. RANLP 2013 Organising Committee /
scribed by Minard et al. (2016).                               ACL.

Acknowledgments                                              Chih-Chung Chang and Chih-Jen Lin. 2011. LIB-
                                                               SVM: A library for support vector machines. ACM
                                                               Transactions on Intelligent Systems and Technology,
The research leading to this paper was partially               2:27:1–27:27.
supported by the European Union’s Horizon 2020
Programme via the SIMPATICO Project (H2020-                  Jens Lehmann, Robert Isele, Max Jakob, Anja
EURO-6-2015, n. 692819).                                        Jentzsch, Dimitris Kontokostas, Pablo N. Mendes,
                                                                Sebastian Hellmann, Mohamed Morsey, Patrick van
                                                                Kleef, Sören Auer, and Christian Bizer. 2015. DB-
                                                                pedia - A large-scale, multilingual knowledge base
References                                                      extracted from Wikipedia. Semantic Web, 6(2):167–
                                                                195.
Pierpaolo Basile, Annalina Caputo, Anna Lisa Gen-
   tile, and Giuseppe Rizzo. 2016a. Overview                 Anne-Lyse Minard, Mohammed R.H. Qwaider, and
   of the EVALITA 2016 Named Entity rEcognition                Bernardo Magnini. 2016. FBK-NLP at NEEL-
   and Linking in Italian tweets (NEEL-IT) task. In            IT: Active Learning for Domain Adaptation. In
   Pierpaolo Basile, Anna Corazza, Franco Cutugno,             Pierpaolo Basile, Anna Corazza, Franco Cutugno,
   Simonetta Montemagni, Malvina Nissim, Viviana               Simonetta Montemagni, Malvina Nissim, Viviana
   Patti, Giovanni Semeraro, and Rachele Sprugnoli,            Patti, Giovanni Semeraro, and Rachele Sprugnoli,
   editors, Proceedings of Third Italian Conference on         editors, Proceedings of Third Italian Conference on
   Computational Linguistics (CLiC-it 2016) & Fifth            Computational Linguistics (CLiC-it 2016) & Fifth
   Evaluation Campaign of Natural Language Pro-                Evaluation Campaign of Natural Language Pro-
   cessing and Speech Tools for Italian. Final Work-           cessing and Speech Tools for Italian. Final Work-
  shop (EVALITA 2016). Associazione Italiana di Lin-
  guistica Computazionale (AILC).
Yaroslav Nechaev, Francesco Corcoglioniti, and Clau-
  dio Giuliano. 2016. Linking knowledge bases to
  social media profiles. http://alignments.
  futuro.media/.
Alessio Palmero Aprosio and Claudio Giuliano. 2016.
  The Wiki Machine: an open source software for en-
  tity linking and enrichment. ArXiv e-prints.

Alessio Palmero Aprosio and Giovanni Moretti. 2016.
  Italy goes to Stanford: a collection of CoreNLP
  modules for Italian. ArXiv e-prints.

Alessio Palmero Aprosio, Claudio Giuliano, and Al-
  berto Lavelli. 2013. Automatic expansion of DB-
  pedia exploiting Wikipedia cross-language informa-
  tion. In Proceedings of the 10th Extended Semantic
  Web Conference, pages 397–411. Springer.
Eros Zanchetta and Marco Baroni. 2005. Morph-it!
  a free corpus-based morphological resource for the
  Italian language. Corpus Linguistics 2005, 1(1).