sisinflab: an ensemble of supervised and unsupervised strategies for the
                       NEEL-IT challenge at Evalita 2016

                Vittoria Cozza, Wanda La Bruna, Tommaso Di Noia
                           Polytechnic University of Bari
                          via Orabona, 4, 70125, Bari, Italy
      {vittoria.cozza, wanda.labruna, tommaso.dinoia}@poliba.it


                     Abstract                      sents a valuable source of knowledge to under-
                                                   stand events, trends, sentiments as well as user-
     English. This work presents the solu-         behaviors. While processing these small text mes-
     tion adopted by the sisinflab team to solve   sages a key role is played by the entities which
     the task NEEL-IT (Named Entity rEcog-         are named within the Tweet. Indeed, whenever
     nition and Linking in Italian Tweets) at      we have a clear understanding of the entities in-
     the Evalita 2016 challenge. The task con-     volved in a context, a further step can be done by
     sists in the annotation of each named en-     semantically enriching them via side information
     tity mention in a Twitter message written     available, e.g., in the Web. To this aim, pure NER
     in Italian, among characters, events, peo-    techniques show their limits as they are able to
     ple, locations, organizations, products and   identify the category an entity belongs to but they
     things and the eventual linking when a cor-   cannot be used to find further information that can
     responding entity is found in a knowledge     be used to enrich the description of the identified
     base (e.g. DBpedia). We faced the chal-       entity and then of the overall Tweet. This is the
     lenge through an approach that combines       point where Entity Linking starts to play its role.
     unsupervised methods, such as DBpedia         Dealing with Tweets, as we have very short mes-
     Spotlight and word embeddings, and su-        sages and texts with little context, the challenge
     pervised techniques such as a CRF classi-     of Named Entity Linking is even more tricky as
     fier and a Deep learning classifier.          there is a lot of noise and very often text is se-
     Italiano.      Questo lavoro presenta la      mantically ambiguous. A number of popular chal-
     soluzione del team sisinflab al task NEEL-    lenges on the matter currently exists, as those in-
     IT (Named Entity rEcognition and Linking      cluded in the SemEval series on the evaluations of
     in Italian Tweets) di Evalita 2016. Il task   computational semantic analysis systems1 for En-
     richiede il riconoscimento e l’annotazione    glish, the CLEF initiative2 that provides a cross-
     del testo di un messaggio di Twitter in       language evaluation forum or Evalita3 that aims to
     Italiano con entità nominate quali per-      promote the development of language and speech
     sonaggi, eventi, persone, luoghi, orga-       technologies for the Italian language.
     nizzazioni, prodotti e cose e eventual-          Several state of the art solutions have been
     mente l’associazione di queste entità con    proposed for entity extraction and linking to a
     la corrispondente risorsa in una base di      knowledge base (Shen et al., 2015) and many
     conoscenza quale, DBpedia. L’approccio        of them make use of the datasets available as
     proposto combina metodi non supervision-      Linked (Open) Data such as DBpedia or Wiki-
     ati quali DBpedia Spotlight e i word em-      data (Gangemi, 2013). Most of these tools expose
     beddings, e tecniche supervisionate basate    the best performances when used with long texts.
     su due classificatori di tipo CRF e Deep      Anyway, those approaches that perform well on
     learning.                                     newswire domain do not work as well in a mi-
                                                   croblog scenario. As analyzed in (Derczynski et
                                                   al., 2015), conventional tools (i.e., those trained
1    Introduction
                                                     1
                                                       https://en.wikipedia.org/wiki/SemEval
In the interconnected world we live in, the          2
                                                       http://www.clef-initiative.eu/
                                                     3
information encoded in Twitter streams repre-          http://www.evalita.it/
on newswire) perform poorly in this genre, and              applying both approaches we pre-processed the
thus microblog domain adaptation is crucial for             tweets used in the experiments, by doing: (1)
good NER. However, when compared to results                 data cleaning consisting of replacing URLs with
typically achieved on longer news and blog texts,           the keyword URL as well emoticons with EMO;
state-of-the-art tools in microblog NER still reach         This has been implemented with ad hoc rules; (2)
bad performance. Consequently, there is a sig-              sentence splitter and tokenizer, implemented by
nificant proportion of missed entity mentions and           the well known linguistic pipeline available for
false positives. In (Derczynski et al., 2015), the          the Italian language: “openNLP”8 , with its corre-
authors also show which tools are possible to ex-           sponding binary models9 .
tend and adapt to Twitter domain, for example
DBpedia Spotlight.The advantage of Spotlight is             2.1   Spotlight-based solution
that it allows users to customize the annotation            DBpedia Spotlight is a well known tool for en-
task. In (Derczynski et al., 2015) the authors show         tity linking. It allows a user to automatically an-
Spotlight achieves 31.20% of F1 over a Twitter              notate mentions of DBpedia resources in unstruc-
dataset.                                                    tured textual documents.
   In this paper we present the solution we pro-            • Spotting: recognizes in a sentence the phrases
pose for the NEEL-IT task (Basile et al., 2016b)            that may indicate a mention of a DBpedia re-
of Evalita 2016 (Basile et al., 2016a). The task            source.
consists of annotating each named entity mention            • Candidate selection: maps the spotted phrase to
(characters, events, people, locations, organiza-           resources that are candidate disambiguations for
tions, products and things) in an Italian Tweet text,       that phrase.
linking it to DBpedia nodes when available or la-           • Disambiguation: uses the context around the
beling it as NIL entity otherwise. The task con-            spotted phrase to decide for the best choice
sists of three consecutive steps: (1) extraction and        amongst the candidates.
typing of entity mentions within a tweet; (2) link-         In our approach we applied DBpedia Spotlight (J.
ing of each textual mention of an entity to an en-          et al., 2013) in order to identify mention bound-
try in the canonicalized version of DBpedia 2015-           aries and link them to a DBpedia entity. This pro-
10 representing the same “real world” entity, or            cess makes possible to identify only those enti-
NIL in case such entry does not exist; (3) clus-            ties having an entry in DBpedia but it does not
tering of all mentions linked to NIL. In order to           allow a system to directly identify entity types.
evaluate the results the TAC KBP scorer4 has been           According to the challenge guideline we required
adopted. Our team solutions faces the above men-            to identify entities that fall into 7 categories:
tioned challenges by using an ensemble of state of          Thing, Product, Person, Organization,
the art approaches.                                         Location, Event, Character and their sub-
   The remainder of the paper is structured as fol-         categories. In order to perform this extra step, we
lows: in Section 2 we introduce our strategy that           used the “type detection” module, as shown in Fig-
combines DBpedia Spotlight-based and a machine              ure 1 which makes use of a SPARQL query to ex-
learning-based solutions, detailed respectively in          tract ontological information from DBpedia. In
Section 2.1 and Section 2.2. Section 3 reports and          detail we match the name of returned classes asso-
discusses the challenge results.                            ciated to an entity with a list of keywords related
                                                            to the available taxonomy: Place, Organization (or
2       Description of the system                           Organisation), Character, Event, Sport, Disease,
The system proposed for entity boundary and type            Language, Person, Music Group, Software, Ser-
extraction and linking is an ensemble of two strate-        vice, Film, Television, Album, Newspaper, Elec-
gies: a DBpedia Spotligth5 -based solution and              tronic Device. There are three possible outcomes:
a machine learning-based solution, that exploits            no match, one match, more than one match. In the
Stanford CRF6 and DeepNL7 classifiers. Before               case we find no match we discard the entity while
    4
                                                            in case we have more than one match we choose
     https://github.com/wikilinks/neleval/wiki/Evaluation
    5                                                         8
     urlhttps://github.com/dbpedia-spotlight/dbpedia-          https://opennlp.apache.org/index.html
spotlight                                                     9
                                                               https://github.com/aciapetti/
   6
     http://nlp.stanford.edu/software/CRF-NER.shtml         opennlp-italian-models/tree/master/
   7
     https://github.com/attardi/deepnl                      models/it
         Figure 1: Spotlight based solution
the most specific one, according the NEEL-IT tax-
onomy provided for the challenge. Once we have
an unique match we return the entity along with
the new identified type.
   Since DBpedia returns entities classified with
reference to around 300 categories, we process the        Figure 2: Machine Learning based solution
annotated resources through the Type Detection         shortly described in Section 2 thus obtaining a
Module to discard all those entities not falling in    corpus in IOB2-notation. The annotated corpus
any of the categories of the NEEL-IT taxonomy.         was then adopted for training and evaluating two
Over the test set, after we applied the Ontology-      classifiers, Stanford CRF(Finkel et al., 2005) and
based type detection module, we discarded 16.9%        DeepNL(Attardi, 2015) as shown in Figure 2, in
of returned entities. In this way, as shown in Fig-    order to detect the span and the type of entity men-
ure 1, we were able to provide an annotation (span,    tion in the text.
uri, type) as required by the challenge rules.
                                                          The module NERs Enabler & Merger aims
                                                       to enabling the usage of one or both classifiers.
2.2   Machine learning based solution
                                                       When them both are enabled there can be a men-
As summarized in Figure 2, we propose an ensem-        tion overlap in the achieved results. In order to
ble approach that combines unsupervised and su-        avoid overlaps we exploited regular expressions.
pervised techniques by exploiting a large dataset      In particular, we merged two or more mentions
of unannotated tweets, Twita (Basile and Nissim,       when they are consecutive, and we choose the
2013) and the DBpedia knowledge base. We               largest span mention when there is a containment.
used a supervised approach for entity name bound-      While with Spotlight we are allowed to find linked
ary and type identification, that exploits the chal-   entities only, with this approach we can detect
lenge data. Indeed the challenge organizers pro-       both entities that matches well known DBpedia re-
vided a training dataset consisted of 1,000 tweets     sources and those that have not been identified by
in italian, for a total of 1,450 sentences. The        Spotlight (NIL). In this case given an entity spot,
training dataset were annotated with 801 gold          for entity linking we exploited DBpedia Lookup
annotations. Overall 526 over 801 were enti-           and string matching between mention spot and
ties linked to a unique resource on DBpedia, the       the labels associated to DBpedia entities. In this
other were linked to 255 NIL clusters. We ran-         way we were able to find both entities along with
domly split this training dataset in new train         their URIs, plus several more NIL entities. At this
(70%) and validation (30%) set. In Table 1             point, for each retrieved entity we have the span,
we show the number of mentioned entities clas-         the type (multiple types if CRF and DeepNL dis-
sified with reference to their corresponding cate-     agree) and the URI (see Figure 2) so we use a type
gories. We then pre-processed the new train            detection/validation module for assigning the cor-
and the validation sets with the approach              rect type to an entity. This module uses ad hoc
                          #tweets   Character   Event   Location   Organization   Person   Product   Thing
     Training set         1,450     16          15      122        197            323      109       20
     New train set        1,018     6           10      82         142            244      68        12
     Validation set       432       10          5       40         55             79       41        8
                                        Table 1: Dataset statistics
rules for combining types obtained from the clas-     library currently provides tools for performing
sifier with CRF, DeepNL classifier if they disagree   part-of-speech tagging, Named Entity tagging and
and from DBpedia entity type, when the entity         Semantic Role Labeling. External knowledge
is not NIL. For all NIL entities, finally we clus-    and Named Entity Recognition World knowl-
ter them, as required by the challenge, by simply     edge is often incorporated into NER systems
clustering entities with the same type and surface    using gazetteers: categorized lists of names or
form. We consider also surface forms that differ in   common words. The Deep Learning NLP NER
case (lower and upper).                               exploits suffix and entities dictionaries and it uses
                                                      word embedding vectors as main feature. The
CRF NER. The Stanford Named Entity Recog-             entity dictionary has been created by using the
nizer is based on the Conditional Random Fields       entity mention from the training set, and also
(CRF) statistical model and uses Gibbs sampling       the locations mentions provided by SENNA10 .
for inference on sequence models(Finkel et al.,       The suffix dictionary has been extracted as well
2005). This tagger normally works well enough         from the training set with ad hoc scripts. Word
using just the form of tokens as feature. This        embeddings were created using the Bag-of-Words
NER is a widely used machine learning-based           (CBOW) model by (Mikolov et al., 2013) of
method to detect named entities, and is distributed   dimension 300 with a window size of 5. In details
with CRF models for English newswire text. We         we used the software word2vec available from
trained the CRF classifier for Italian tweets with    https://code.google.com/archive/
the new train data annotated with IOB nota-           p/word2vec/, over a corpus of above 10
tion, then we evaluate the results across the vali-   million of unlabeled tweets in Italian. In fact,
dation data, results are reported in Table 2. The     the corpus consists of a collection of the Italian
results provided follow the CoNLL NER evalua-         tweets produced in April 2015 extracted from the
tion (Sang and Meulder, 2003) format that eval-       Twita corpus (Basile and Nissim, 2013) plus the
uates the results in term of Precision (P) and        tweets both from dev and test sets provided by
Recall (R). The F-score (F1) corresponds to the       the NEEL-IT challenge, all them pre-processed
strong typed mention match in the TAC                 through our data preprocessing module, with a
scorer. A manual error analysis showed that even      total of 11.403.536 sentences. As shown in Figure
  Entity P          R       F1      TP FP FN
                                                      3, we trained a DeepNL classifier for Italian
  LOC      0.6154 0.4000 0.4848 16        10   24     tweets with the new train data annotated with
  ORG      0.5238 0.2000 0.2895 11        10   44     IOB-2 notation then we evaluate the results across
  PER      0.4935 0.4810 0.4872 38        39   41
  PRO      0.2857 0.0488 0.0833 2         5    39     the validation data. Over the validation set we
  Totals   0.5115 0.2839 0.3651 67        64   169    obtained an accuracy of 94.50%. Results are
     Table 2: CRF NER over the validation set         reported in Table 3.
when mentions are correctly detected, types are
                                                             Entity P       R         F1       Correct
wrongly identified. This is due of course to lan-            EVE    0       0         0        1
guage ambiguity in a sentence. As an example,                LOC    0.5385 0.1750 0.2642 13
for a NER it is often hard to disambiguate between           ORG    0.4074 0.2        0.2683 27
                                                             PER    0.6458 0.3924 0.4882 48
a person and an organization, or an event and a              PRO    0.4375 0.1707 0.2456 16
products are not. For this reason we applied a fur-          Totals 0.5333 0.2353 0.3265 104
ther type detection and validation module which          Table 3: DeepNL NER over the validation set
allowed to combine, by ad hoc rules, the results
obtained by the classifiers and the Spotlight-based   2.3 Linking
approach previously described.                        For the purpose of accomplish the linking sub task,
                                                      we investigated if a given spot, identified by the
DeepNL NER. DeepNL is a Python library for
                                                      machine learning approach as an entity, has a cor-
Natural Language Processing tasks based on a
                                                         10
Deep Learning neural network architecture. The              http://ronan.collobert.com/senna/
                                     Figure 3: DeepNL: Training phase
responding link in DBpedia. A valid approach to            As an example the triple dbpedia:
link the names in our datasets to entities in DBpe-      Multiple_endocrine_neoplasia>
dia is represented by DBpedia Lookup11 (Bizer et         owl:sameAs         <http://it.dbpedia.
al., 2009) which behaves as follows:                     org/resource/Neoplasia_endocrina_
candidate entity generation. A dictionary is cre-        multipla> maps the Italian version of Neo-
ated via a Lucene index. It is built starting from the   plasia endocrina multipla to its canonicalized
values of the property rdfs:label associated to          version. In a few cases we were not able to
a resource. Very interestingly, the dictionary takes     perform the match.
into account also the Wikipedia:Redirect12
links.                                                   3        Results and Discussion
candidate entity ranking. Results computed
via a lookup in the dictionary are then weighted         In this section we report the results over the gold
combining various string similarity metrics and a        test set distibuted to the challenge participants,
PageRank-like relevance rankings.                        considering first 300 tweets only.
unlinkable mention prediction. The features of-             In order to evaluate the task results, the
fered by DBpedia Lookup to filter out resources          2016 NEEL-it challenge uses the TAC KBP
from the candidate entities are: (i) selection of en-    scorer13 .     TAC KBP scorer evaluates the
tities which are instances of a specific class via the   results according to the following metrics:
QueryClass parameter; (ii) selection of the top          mention ceaf, strong typed mention match and
N entities via the MaxHits parameter.                    strong linked match.
                                                            The overall score is a weighted average score
    As for the last step we used the Type Detec-         computed as:
tion module introduced above, to select entities
                                                             score = 0.4 · mention ceaf + 0.3 · strong link match +
belonging only to those classes representative of                     +0.3 · strong typed mention match
the interest domain. We implemented other filters
to reduce the number of false positives in the final        Our solution combines approaches presented in
mapping. As an example, we discard the results           Section 2.1 and Section 2.2. For the 3 runs sub-
for the case of Person entity, unless the mention        mitted for the challenge, we used the following
exactly matches the entity name. As a plus, for          configurations: run1 Spotlight with results com-
linking, we also used a dictionary made from the         ing from both CRF and DeepNL classifiers; run2
training set, where for a given surface form and         without CRF; run3 without DeepNL.
a type it returns a correspondent URI, if already           As for CRF and DeepNL classifiers, we used a
available in the labeled data.                           model trained with the whole training set provided
                                                         by the challenge organizers. In order to ensemble
Computing canonicalized version. The link re-            the systems output we applied again the NERs En-
sults obtained through Spotlight and Lookup or           abler & Merger module, presented in Section 2.2
string match, refer to the Italian version of DB-        that aims to return the largest number of entity an-
pedia. In order to canonicalized version as re-          notations identified by the different systems with-
quired by the task, we automatically found the cor-      out overlap. If one mention has been identified
responding canonicalized resource link for each          with more then one approach, and they disagree
Italian resource by means of the owl:sameAs              about the type, that returned by the Spotlight ap-
property.                                                proach is chosen. Results for the different runs
  11
   https://github.com/dbpedia/lookup                     are shown in Table 4 together with the results of
  12
   https://en.wikipedia.org/wiki/
                                                             13
Wikipedia:Redirect                                                https://github.com/wikilinks/neleval/wiki/Evaluatio
             System            mention ceaf   strong typed            strong link match      final score
                                               mention match
             Spotlight-based   0.317          0.276                   0.340                  0,3121
             run1              0.358          0.282                   0.38                   0.3418
             run2              0.34           0.28                    0.381                  0.3343
             run3              0.358          0.286                   0.376                  0.3418
             Best Team         0.561          0.474                   0.456                  0.5034
                                           Table 4: Challenge results

the best performing team of the challenge. In or-         that has been addressed using simple heuristics.
der to evaluate the contribution of the Spotlight-        References
based approach to the final result, we evaluated          G. Attardi. 2015. Deepnl: a deep learning nlp pipeline.
the strong link match considering only the por-              Workshop on Vector Space Modeling for NLP, NAACL.
tion of link-annotation due to this approach over         P. Basile and M. Nissim. 2013. Sentiment analysis on italian
the challenge test set, see Table 5. We had a total          tweets. In Proc. of the 4th Workshop on Computational
of 140 links to Italian DBpedia, then following the          Approaches to Subjectivity, Sentiment and Social Media
                                                             Analysis.
approach described in Section 2.3 we obtained 120
links, 88 of which were unique. It was not possi-         P. Basile, A. Caputo, A. L. Gentile, and G. Rizzo. 2016a.
                                                             Overview of the EVALITA 2016 Named Entity rEcogni-
ble to convert into DBpedia canonicalized version            tion and Linking in Italian Tweets (NEEL-IT) Task. In
20 links. Final results are summarized in Table 5.           Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simon-
Looking at the Spotlight-based solution (row 1),             etta Montemagni, Malvina Nissim, Viviana Patti, Gio-
                                                             vanni Semeraro, and Rachele Sprugnoli, editors, Proceed-
                                                             ings of Third Italian Conference on Computational Lin-
                System      P       R      F1                guistics (CLiC-it 2016) & Fifth Evaluation Campaign of
        Spotlight-based   0.446   0.274   0.340              Natural Language Processing and Speech Tools for Ital-
                   run1   0.577    0.28   0.380              ian. Final Workshop (EVALITA 2016). Associazione Ital-
Table 5: strong link match over the challenge                iana di Linguistica Computazionale (AILC).
gold test set (300 tweets)                                P. Basile, F. Cutugno, M. Nissim, V. Patti, and R. Sprug-
                                                             noli. 2016b. EVALITA 2016: Overview of the 5th
compared with the ensemble solution (row 2) re-              Evaluation Campaign of Natural Language Processing
sults, we saw a performance improvement. This                and Speech Tools for Italian.     In Pierpaolo Basile,
means that machine learning-based approach al-               Anna Corazza, Franco Cutugno, Simonetta Montemagni,
                                                             Malvina Nissim, Viviana Patti, Giovanni Semeraro, and
lowed to identify and link entities that were not            Rachele Sprugnoli, editors, Proceedings of Third Ital-
detected by Spotlight thus improving precision re-           ian Conference on Computational Linguistics (CLiC-it
sults. Moreover, combining the two approaches al-            2016) & Fifth Evaluation Campaign of Natural Language
                                                             Processing and Speech Tools for Italian. Final Work-
lowed the system, at the step of merging the over-           shop (EVALITA 2016). Associazione Italiana di Linguis-
lapping span, for a better identification of entities.       tica Computazionale (AILC).
This behavior lead sometime to delete correct enti-       C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker,
ties, but also to correctly detect errors produced by        R. Cyganiak, and S. Hellmann. 2009. {DBpedia} - a
the Spotlight-based approach and, more generally,            crystallization point for the web of data. Web Seman-
                                                             tics: Science, Services and Agents on the World Wide Web,
it improved recall results.                                  7(3):154 – 165.
    In the current entity linking literature, mention
                                                          L. Derczynski, D. Maynard, G. Rizzo, M. van Erp, G. Gorrel,
detection and entity disambiguation are frequently           R. Troncy, J. Petrak, and K. Bontcheva. 2015. Analysis
cast as equally important but distinct problems.             of named entity recognition and linking for tweets. Infor-
However, in this task, we find that mention de-              mation Processing & Management, 51(2):32–49.
tection often represents a bottleneck. In men-            J. R. Finkel, T. Grenager, and C. Manning. 2005. Incorporat-
tion ceaf detection, our submission results show              ing non-local information into information extraction sys-
                                                              tems by gibbs sampling. In Proc. of the 43rd ACL ’05.
that CRF NER worked slightly better then Deep
NER, as already showed in the experiments over            A. Gangemi. 2013. A comparison of knowledge extraction
                                                             tools for the semantic web. In Proc. of ESWC.
the validation set in Section 2.2. Anyway accord-
ing to experiments in (Derczynski et al., 2015)           J., M. Jakob, C. Hokamp, and P. N. Mendes. 2013. Improv-
with a similar dataset and a smaller set of enti-             ing efficiency and accuracy in multilingual entity extrac-
                                                              tion. In Proc. of the 9th I-Semantics.
ties, we expected better results from CRF NER. A
possible explanation is that errors are due also to       T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean.
                                                             2013. Distributed representations of words and phrases
the larger number of types to detect as well as to           and their compositionality. In In Advances in Neural In-
a wrong recombination of overlapping mentions,               formation Processing Systems, pages 3111–3119.
E. F. Tjong Kim Sang and F. De Meulder. 2003. Introduc-       W. Shen, J. Wang, and J. Han. 2015. Entity linking with a
   tion to the conll-2003 shared task: Language-independent     knowledge base: Issues, techniques, and solutions. IEEE
   named entity recognition. In Proc. of 7th CONLL, pages       Transactions on KDE, 27(2):443–460.
   142–147.