sisinflab: an ensemble of supervised and unsupervised strategies for the NEEL-IT challenge at Evalita 2016 Vittoria Cozza, Wanda La Bruna, Tommaso Di Noia Polytechnic University of Bari via Orabona, 4, 70125, Bari, Italy {vittoria.cozza, wanda.labruna, tommaso.dinoia}@poliba.it Abstract sents a valuable source of knowledge to under- stand events, trends, sentiments as well as user- English. This work presents the solu- behaviors. While processing these small text mes- tion adopted by the sisinflab team to solve sages a key role is played by the entities which the task NEEL-IT (Named Entity rEcog- are named within the Tweet. Indeed, whenever nition and Linking in Italian Tweets) at we have a clear understanding of the entities in- the Evalita 2016 challenge. The task con- volved in a context, a further step can be done by sists in the annotation of each named en- semantically enriching them via side information tity mention in a Twitter message written available, e.g., in the Web. To this aim, pure NER in Italian, among characters, events, peo- techniques show their limits as they are able to ple, locations, organizations, products and identify the category an entity belongs to but they things and the eventual linking when a cor- cannot be used to find further information that can responding entity is found in a knowledge be used to enrich the description of the identified base (e.g. DBpedia). We faced the chal- entity and then of the overall Tweet. This is the lenge through an approach that combines point where Entity Linking starts to play its role. unsupervised methods, such as DBpedia Dealing with Tweets, as we have very short mes- Spotlight and word embeddings, and su- sages and texts with little context, the challenge pervised techniques such as a CRF classi- of Named Entity Linking is even more tricky as fier and a Deep learning classifier. there is a lot of noise and very often text is se- Italiano. Questo lavoro presenta la mantically ambiguous. A number of popular chal- soluzione del team sisinflab al task NEEL- lenges on the matter currently exists, as those in- IT (Named Entity rEcognition and Linking cluded in the SemEval series on the evaluations of in Italian Tweets) di Evalita 2016. Il task computational semantic analysis systems1 for En- richiede il riconoscimento e l’annotazione glish, the CLEF initiative2 that provides a cross- del testo di un messaggio di Twitter in language evaluation forum or Evalita3 that aims to Italiano con entità nominate quali per- promote the development of language and speech sonaggi, eventi, persone, luoghi, orga- technologies for the Italian language. nizzazioni, prodotti e cose e eventual- Several state of the art solutions have been mente l’associazione di queste entità con proposed for entity extraction and linking to a la corrispondente risorsa in una base di knowledge base (Shen et al., 2015) and many conoscenza quale, DBpedia. L’approccio of them make use of the datasets available as proposto combina metodi non supervision- Linked (Open) Data such as DBpedia or Wiki- ati quali DBpedia Spotlight e i word em- data (Gangemi, 2013). Most of these tools expose beddings, e tecniche supervisionate basate the best performances when used with long texts. su due classificatori di tipo CRF e Deep Anyway, those approaches that perform well on learning. newswire domain do not work as well in a mi- croblog scenario. As analyzed in (Derczynski et al., 2015), conventional tools (i.e., those trained 1 Introduction 1 https://en.wikipedia.org/wiki/SemEval In the interconnected world we live in, the 2 http://www.clef-initiative.eu/ 3 information encoded in Twitter streams repre- http://www.evalita.it/ on newswire) perform poorly in this genre, and applying both approaches we pre-processed the thus microblog domain adaptation is crucial for tweets used in the experiments, by doing: (1) good NER. However, when compared to results data cleaning consisting of replacing URLs with typically achieved on longer news and blog texts, the keyword URL as well emoticons with EMO; state-of-the-art tools in microblog NER still reach This has been implemented with ad hoc rules; (2) bad performance. Consequently, there is a sig- sentence splitter and tokenizer, implemented by nificant proportion of missed entity mentions and the well known linguistic pipeline available for false positives. In (Derczynski et al., 2015), the the Italian language: “openNLP”8 , with its corre- authors also show which tools are possible to ex- sponding binary models9 . tend and adapt to Twitter domain, for example DBpedia Spotlight.The advantage of Spotlight is 2.1 Spotlight-based solution that it allows users to customize the annotation DBpedia Spotlight is a well known tool for en- task. In (Derczynski et al., 2015) the authors show tity linking. It allows a user to automatically an- Spotlight achieves 31.20% of F1 over a Twitter notate mentions of DBpedia resources in unstruc- dataset. tured textual documents. In this paper we present the solution we pro- • Spotting: recognizes in a sentence the phrases pose for the NEEL-IT task (Basile et al., 2016b) that may indicate a mention of a DBpedia re- of Evalita 2016 (Basile et al., 2016a). The task source. consists of annotating each named entity mention • Candidate selection: maps the spotted phrase to (characters, events, people, locations, organiza- resources that are candidate disambiguations for tions, products and things) in an Italian Tweet text, that phrase. linking it to DBpedia nodes when available or la- • Disambiguation: uses the context around the beling it as NIL entity otherwise. The task con- spotted phrase to decide for the best choice sists of three consecutive steps: (1) extraction and amongst the candidates. typing of entity mentions within a tweet; (2) link- In our approach we applied DBpedia Spotlight (J. ing of each textual mention of an entity to an en- et al., 2013) in order to identify mention bound- try in the canonicalized version of DBpedia 2015- aries and link them to a DBpedia entity. This pro- 10 representing the same “real world” entity, or cess makes possible to identify only those enti- NIL in case such entry does not exist; (3) clus- ties having an entry in DBpedia but it does not tering of all mentions linked to NIL. In order to allow a system to directly identify entity types. evaluate the results the TAC KBP scorer4 has been According to the challenge guideline we required adopted. Our team solutions faces the above men- to identify entities that fall into 7 categories: tioned challenges by using an ensemble of state of Thing, Product, Person, Organization, the art approaches. Location, Event, Character and their sub- The remainder of the paper is structured as fol- categories. In order to perform this extra step, we lows: in Section 2 we introduce our strategy that used the “type detection” module, as shown in Fig- combines DBpedia Spotlight-based and a machine ure 1 which makes use of a SPARQL query to ex- learning-based solutions, detailed respectively in tract ontological information from DBpedia. In Section 2.1 and Section 2.2. Section 3 reports and detail we match the name of returned classes asso- discusses the challenge results. ciated to an entity with a list of keywords related to the available taxonomy: Place, Organization (or 2 Description of the system Organisation), Character, Event, Sport, Disease, The system proposed for entity boundary and type Language, Person, Music Group, Software, Ser- extraction and linking is an ensemble of two strate- vice, Film, Television, Album, Newspaper, Elec- gies: a DBpedia Spotligth5 -based solution and tronic Device. There are three possible outcomes: a machine learning-based solution, that exploits no match, one match, more than one match. In the Stanford CRF6 and DeepNL7 classifiers. Before case we find no match we discard the entity while 4 in case we have more than one match we choose https://github.com/wikilinks/neleval/wiki/Evaluation 5 8 urlhttps://github.com/dbpedia-spotlight/dbpedia- https://opennlp.apache.org/index.html spotlight 9 https://github.com/aciapetti/ 6 http://nlp.stanford.edu/software/CRF-NER.shtml opennlp-italian-models/tree/master/ 7 https://github.com/attardi/deepnl models/it Figure 1: Spotlight based solution the most specific one, according the NEEL-IT tax- onomy provided for the challenge. Once we have an unique match we return the entity along with the new identified type. Since DBpedia returns entities classified with reference to around 300 categories, we process the Figure 2: Machine Learning based solution annotated resources through the Type Detection shortly described in Section 2 thus obtaining a Module to discard all those entities not falling in corpus in IOB2-notation. The annotated corpus any of the categories of the NEEL-IT taxonomy. was then adopted for training and evaluating two Over the test set, after we applied the Ontology- classifiers, Stanford CRF(Finkel et al., 2005) and based type detection module, we discarded 16.9% DeepNL(Attardi, 2015) as shown in Figure 2, in of returned entities. In this way, as shown in Fig- order to detect the span and the type of entity men- ure 1, we were able to provide an annotation (span, tion in the text. uri, type) as required by the challenge rules. The module NERs Enabler & Merger aims to enabling the usage of one or both classifiers. 2.2 Machine learning based solution When them both are enabled there can be a men- As summarized in Figure 2, we propose an ensem- tion overlap in the achieved results. In order to ble approach that combines unsupervised and su- avoid overlaps we exploited regular expressions. pervised techniques by exploiting a large dataset In particular, we merged two or more mentions of unannotated tweets, Twita (Basile and Nissim, when they are consecutive, and we choose the 2013) and the DBpedia knowledge base. We largest span mention when there is a containment. used a supervised approach for entity name bound- While with Spotlight we are allowed to find linked ary and type identification, that exploits the chal- entities only, with this approach we can detect lenge data. Indeed the challenge organizers pro- both entities that matches well known DBpedia re- vided a training dataset consisted of 1,000 tweets sources and those that have not been identified by in italian, for a total of 1,450 sentences. The Spotlight (NIL). In this case given an entity spot, training dataset were annotated with 801 gold for entity linking we exploited DBpedia Lookup annotations. Overall 526 over 801 were enti- and string matching between mention spot and ties linked to a unique resource on DBpedia, the the labels associated to DBpedia entities. In this other were linked to 255 NIL clusters. We ran- way we were able to find both entities along with domly split this training dataset in new train their URIs, plus several more NIL entities. At this (70%) and validation (30%) set. In Table 1 point, for each retrieved entity we have the span, we show the number of mentioned entities clas- the type (multiple types if CRF and DeepNL dis- sified with reference to their corresponding cate- agree) and the URI (see Figure 2) so we use a type gories. We then pre-processed the new train detection/validation module for assigning the cor- and the validation sets with the approach rect type to an entity. This module uses ad hoc #tweets Character Event Location Organization Person Product Thing Training set 1,450 16 15 122 197 323 109 20 New train set 1,018 6 10 82 142 244 68 12 Validation set 432 10 5 40 55 79 41 8 Table 1: Dataset statistics rules for combining types obtained from the clas- library currently provides tools for performing sifier with CRF, DeepNL classifier if they disagree part-of-speech tagging, Named Entity tagging and and from DBpedia entity type, when the entity Semantic Role Labeling. External knowledge is not NIL. For all NIL entities, finally we clus- and Named Entity Recognition World knowl- ter them, as required by the challenge, by simply edge is often incorporated into NER systems clustering entities with the same type and surface using gazetteers: categorized lists of names or form. We consider also surface forms that differ in common words. The Deep Learning NLP NER case (lower and upper). exploits suffix and entities dictionaries and it uses word embedding vectors as main feature. The CRF NER. The Stanford Named Entity Recog- entity dictionary has been created by using the nizer is based on the Conditional Random Fields entity mention from the training set, and also (CRF) statistical model and uses Gibbs sampling the locations mentions provided by SENNA10 . for inference on sequence models(Finkel et al., The suffix dictionary has been extracted as well 2005). This tagger normally works well enough from the training set with ad hoc scripts. Word using just the form of tokens as feature. This embeddings were created using the Bag-of-Words NER is a widely used machine learning-based (CBOW) model by (Mikolov et al., 2013) of method to detect named entities, and is distributed dimension 300 with a window size of 5. In details with CRF models for English newswire text. We we used the software word2vec available from trained the CRF classifier for Italian tweets with https://code.google.com/archive/ the new train data annotated with IOB nota- p/word2vec/, over a corpus of above 10 tion, then we evaluate the results across the vali- million of unlabeled tweets in Italian. In fact, dation data, results are reported in Table 2. The the corpus consists of a collection of the Italian results provided follow the CoNLL NER evalua- tweets produced in April 2015 extracted from the tion (Sang and Meulder, 2003) format that eval- Twita corpus (Basile and Nissim, 2013) plus the uates the results in term of Precision (P) and tweets both from dev and test sets provided by Recall (R). The F-score (F1) corresponds to the the NEEL-IT challenge, all them pre-processed strong typed mention match in the TAC through our data preprocessing module, with a scorer. A manual error analysis showed that even total of 11.403.536 sentences. As shown in Figure Entity P R F1 TP FP FN 3, we trained a DeepNL classifier for Italian LOC 0.6154 0.4000 0.4848 16 10 24 tweets with the new train data annotated with ORG 0.5238 0.2000 0.2895 11 10 44 IOB-2 notation then we evaluate the results across PER 0.4935 0.4810 0.4872 38 39 41 PRO 0.2857 0.0488 0.0833 2 5 39 the validation data. Over the validation set we Totals 0.5115 0.2839 0.3651 67 64 169 obtained an accuracy of 94.50%. Results are Table 2: CRF NER over the validation set reported in Table 3. when mentions are correctly detected, types are Entity P R F1 Correct wrongly identified. This is due of course to lan- EVE 0 0 0 1 guage ambiguity in a sentence. As an example, LOC 0.5385 0.1750 0.2642 13 for a NER it is often hard to disambiguate between ORG 0.4074 0.2 0.2683 27 PER 0.6458 0.3924 0.4882 48 a person and an organization, or an event and a PRO 0.4375 0.1707 0.2456 16 products are not. For this reason we applied a fur- Totals 0.5333 0.2353 0.3265 104 ther type detection and validation module which Table 3: DeepNL NER over the validation set allowed to combine, by ad hoc rules, the results obtained by the classifiers and the Spotlight-based 2.3 Linking approach previously described. For the purpose of accomplish the linking sub task, we investigated if a given spot, identified by the DeepNL NER. DeepNL is a Python library for machine learning approach as an entity, has a cor- Natural Language Processing tasks based on a 10 Deep Learning neural network architecture. The http://ronan.collobert.com/senna/ Figure 3: DeepNL: Training phase responding link in DBpedia. A valid approach to As an example the triple dbpedia: link the names in our datasets to entities in DBpe- Multiple_endocrine_neoplasia> dia is represented by DBpedia Lookup11 (Bizer et owl:sameAs maps the Italian version of Neo- ated via a Lucene index. It is built starting from the plasia endocrina multipla to its canonicalized values of the property rdfs:label associated to version. In a few cases we were not able to a resource. Very interestingly, the dictionary takes perform the match. into account also the Wikipedia:Redirect12 links. 3 Results and Discussion candidate entity ranking. Results computed via a lookup in the dictionary are then weighted In this section we report the results over the gold combining various string similarity metrics and a test set distibuted to the challenge participants, PageRank-like relevance rankings. considering first 300 tweets only. unlinkable mention prediction. The features of- In order to evaluate the task results, the fered by DBpedia Lookup to filter out resources 2016 NEEL-it challenge uses the TAC KBP from the candidate entities are: (i) selection of en- scorer13 . TAC KBP scorer evaluates the tities which are instances of a specific class via the results according to the following metrics: QueryClass parameter; (ii) selection of the top mention ceaf, strong typed mention match and N entities via the MaxHits parameter. strong linked match. The overall score is a weighted average score As for the last step we used the Type Detec- computed as: tion module introduced above, to select entities score = 0.4 · mention ceaf + 0.3 · strong link match + belonging only to those classes representative of +0.3 · strong typed mention match the interest domain. We implemented other filters to reduce the number of false positives in the final Our solution combines approaches presented in mapping. As an example, we discard the results Section 2.1 and Section 2.2. For the 3 runs sub- for the case of Person entity, unless the mention mitted for the challenge, we used the following exactly matches the entity name. As a plus, for configurations: run1 Spotlight with results com- linking, we also used a dictionary made from the ing from both CRF and DeepNL classifiers; run2 training set, where for a given surface form and without CRF; run3 without DeepNL. a type it returns a correspondent URI, if already As for CRF and DeepNL classifiers, we used a available in the labeled data. model trained with the whole training set provided by the challenge organizers. In order to ensemble Computing canonicalized version. The link re- the systems output we applied again the NERs En- sults obtained through Spotlight and Lookup or abler & Merger module, presented in Section 2.2 string match, refer to the Italian version of DB- that aims to return the largest number of entity an- pedia. In order to canonicalized version as re- notations identified by the different systems with- quired by the task, we automatically found the cor- out overlap. If one mention has been identified responding canonicalized resource link for each with more then one approach, and they disagree Italian resource by means of the owl:sameAs about the type, that returned by the Spotlight ap- property. proach is chosen. Results for the different runs 11 https://github.com/dbpedia/lookup are shown in Table 4 together with the results of 12 https://en.wikipedia.org/wiki/ 13 Wikipedia:Redirect https://github.com/wikilinks/neleval/wiki/Evaluatio System mention ceaf strong typed strong link match final score mention match Spotlight-based 0.317 0.276 0.340 0,3121 run1 0.358 0.282 0.38 0.3418 run2 0.34 0.28 0.381 0.3343 run3 0.358 0.286 0.376 0.3418 Best Team 0.561 0.474 0.456 0.5034 Table 4: Challenge results the best performing team of the challenge. In or- that has been addressed using simple heuristics. der to evaluate the contribution of the Spotlight- References based approach to the final result, we evaluated G. Attardi. 2015. Deepnl: a deep learning nlp pipeline. the strong link match considering only the por- Workshop on Vector Space Modeling for NLP, NAACL. tion of link-annotation due to this approach over P. Basile and M. Nissim. 2013. Sentiment analysis on italian the challenge test set, see Table 5. We had a total tweets. In Proc. of the 4th Workshop on Computational of 140 links to Italian DBpedia, then following the Approaches to Subjectivity, Sentiment and Social Media Analysis. approach described in Section 2.3 we obtained 120 links, 88 of which were unique. It was not possi- P. Basile, A. Caputo, A. L. Gentile, and G. Rizzo. 2016a. Overview of the EVALITA 2016 Named Entity rEcogni- ble to convert into DBpedia canonicalized version tion and Linking in Italian Tweets (NEEL-IT) Task. In 20 links. Final results are summarized in Table 5. Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simon- Looking at the Spotlight-based solution (row 1), etta Montemagni, Malvina Nissim, Viviana Patti, Gio- vanni Semeraro, and Rachele Sprugnoli, editors, Proceed- ings of Third Italian Conference on Computational Lin- System P R F1 guistics (CLiC-it 2016) & Fifth Evaluation Campaign of Spotlight-based 0.446 0.274 0.340 Natural Language Processing and Speech Tools for Ital- run1 0.577 0.28 0.380 ian. Final Workshop (EVALITA 2016). Associazione Ital- Table 5: strong link match over the challenge iana di Linguistica Computazionale (AILC). gold test set (300 tweets) P. Basile, F. Cutugno, M. Nissim, V. Patti, and R. Sprug- noli. 2016b. EVALITA 2016: Overview of the 5th compared with the ensemble solution (row 2) re- Evaluation Campaign of Natural Language Processing sults, we saw a performance improvement. This and Speech Tools for Italian. In Pierpaolo Basile, means that machine learning-based approach al- Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and lowed to identify and link entities that were not Rachele Sprugnoli, editors, Proceedings of Third Ital- detected by Spotlight thus improving precision re- ian Conference on Computational Linguistics (CLiC-it sults. Moreover, combining the two approaches al- 2016) & Fifth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Work- lowed the system, at the step of merging the over- shop (EVALITA 2016). Associazione Italiana di Linguis- lapping span, for a better identification of entities. tica Computazionale (AILC). This behavior lead sometime to delete correct enti- C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, ties, but also to correctly detect errors produced by R. Cyganiak, and S. Hellmann. 2009. {DBpedia} - a the Spotlight-based approach and, more generally, crystallization point for the web of data. Web Seman- tics: Science, Services and Agents on the World Wide Web, it improved recall results. 7(3):154 – 165. In the current entity linking literature, mention L. Derczynski, D. Maynard, G. Rizzo, M. van Erp, G. Gorrel, detection and entity disambiguation are frequently R. Troncy, J. Petrak, and K. Bontcheva. 2015. Analysis cast as equally important but distinct problems. of named entity recognition and linking for tweets. Infor- However, in this task, we find that mention de- mation Processing & Management, 51(2):32–49. tection often represents a bottleneck. In men- J. R. Finkel, T. Grenager, and C. Manning. 2005. Incorporat- tion ceaf detection, our submission results show ing non-local information into information extraction sys- tems by gibbs sampling. In Proc. of the 43rd ACL ’05. that CRF NER worked slightly better then Deep NER, as already showed in the experiments over A. Gangemi. 2013. A comparison of knowledge extraction tools for the semantic web. In Proc. of ESWC. the validation set in Section 2.2. Anyway accord- ing to experiments in (Derczynski et al., 2015) J., M. Jakob, C. Hokamp, and P. N. Mendes. 2013. Improv- with a similar dataset and a smaller set of enti- ing efficiency and accuracy in multilingual entity extrac- tion. In Proc. of the 9th I-Semantics. ties, we expected better results from CRF NER. A possible explanation is that errors are due also to T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. 2013. Distributed representations of words and phrases the larger number of types to detect as well as to and their compositionality. In In Advances in Neural In- a wrong recombination of overlapping mentions, formation Processing Systems, pages 3111–3119. E. F. Tjong Kim Sang and F. De Meulder. 2003. Introduc- W. Shen, J. Wang, and J. Han. 2015. Entity linking with a tion to the conll-2003 shared task: Language-independent knowledge base: Issues, techniques, and solutions. IEEE named entity recognition. In Proc. of 7th CONLL, pages Transactions on KDE, 27(2):443–460. 142–147.