MicroNeel: Combining NLP Tools to Perform Named Entity Detection and Linking on Microposts Francesco Corcoglioniti, Alessio Palmero Aprosio, Yaroslav Nechaev, Claudio Giuliano Fondazione Bruno Kessler Trento, Italy {corcoglio,aprosio,nechaev,giuliano}@fbk.eu Abstract like news article, perform poorly when applied to microposts and are outperformed by NLP solu- English. In this paper we present the Mi- tions specifically-developed for this kind of con- croNeel system for Named Entity Recog- tent (see, e.g., Bontcheva et al. (2013)). nition and Entity Linking on Italian mi- Recognizing these challenges and following croposts, which participated in the NEEL- similar initiatives for the English language, the IT task at EVALITA 2016. MicroNeel NEEL-IT1 task (Basile et al., 2016a) at EVALITA combines The Wiki Machine and Tint, 20162 (Basile et al., 2016b) aims at promoting two standard NLP tools, with compre- the research on NLP for the analysis of microp- hensive tweet preprocessing, the Twitter- osts in the Italian language. The task is a combi- DBpedia alignments from the Social Me- nation of Named Entity Recognition (NER), En- dia Toolkit resource, and rule-based or su- tity Linking (EL), and Coreference Resolution for pervised merging of produced annotations. Twitter tweets, which are short microposts of max- Italiano. In questo articolo presentiamo il imum 140 characters that may include hashtags, sistema MicroNeel per il riconoscimento e user mentions, and URLs linking to external Web la disambiguazione di entità in micropost resources. Participating systems have to recognize in lingua Italiana, con cui abbiamo parte- mentions of named entities, assign them a NER cipato al task NEEL-IT di EVALITA 2016. category (e.g., person), and disambiguate them MicroNeel combina The Wiki Machine e against a fragment of DBpedia containing the en- Tint, due sistemi NLP standard, con un tities common to the Italian and English DBpe- preprocessing esteso dei tweet, con gli dia chapters; unlinked (i.e., NIL) mentions have allineamenti tra Twitter e DBpedia della finally to be clustered in coreference sets. risorsa Social Media Toolkit, e con un sis- In this paper we present our MicroNeel system tema di fusione delle annotazioni prodotte that participated in the NEEL-IT task. With Mi- basato su regole o supervisionato. croNeel, we investigate the use on microposts of two standard NER and EL tools – The Wiki Ma- chine (Palmero Aprosio and Giuliano, 2016) and 1 Introduction Tint (Palmero Aprosio and Moretti, 2016) – that Microposts, i.e., brief user-generated texts like were originally developed for more formal texts. tweets, checkins, status messages, etc., are a To achieve adequate performances, we comple- form of content highly popular on social media ment them with: (i) a preprocessing step where and an increasingly relevant source for informa- tweets are enriched with semantically related text, tion extraction. The application of Natural Lan- and rewritten to make them less noisy; (ii) a guage Processing (NLP) techniques to microposts set of alignments from Twitter user mentions to presents unique challenges due to their informal DBpedia entities, provided by the Social Media nature, noisiness, lack of sufficient textual con- Toolkit (SMT) resource (Nechaev et al., 2016); text (e.g., for disambiguation), and use of spe- and (iii) rule-based and supervised mechanisms cific abbreviations and conventions like #hash- for merging the annotations produced by NER, tags, @user mentions, retweet markers and so on. EL, and SMT, resolving possible conflicts. As a consequence, standard NLP tools designed 1 http://neel-it.github.io/ 2 and trained on more ‘traditional’ formal domains, http://www.evalita.it/2016 In the remainder of the paper, Section 2 in- users, and URLs. Alternatively, a tweet ID can be troduces the main tools and resources we used. supplied in input (as done in NEEL-IT), and the Section 3 describes MicroNeel, whose results at system retrieves the corresponding text and meta- NEEL-IT and their discussions are reported in data (e.g., author information, date and time, lan- Sections 4 and 5. Section 6 presents the system guage) from Twitter API, if the tweet has not been open-source release, while Section 7 concludes. deleted by the user or by Twitter itself. Processing in MicroNeel is structured as a 2 Tools and Resources pipeline of three main steps, outlined in Figure 1: MicroNeel makes use of a certain number of re- preprocessing, annotation, and merging. Their ex- sources and tools. In this section, we briefly ecution on an example tweet is shown in Figure 2. present the main ones used in the annotation pro- 3.1 Preprocessing cess. The description of the rest of them (mainly used for preprocessing) can be found in Section 3. During the first step, the original text of the mi- cropost is rewritten, keeping track of the map- 2.1 The Wiki Machine pings between original and rewritten offsets. The The Wiki Machine3 (Palmero Aprosio and Giu- rewritten text is obtained by applying the follow- liano, 2016) is an open source Entity Linking tool ing transformations: that automatically annotates a text with respect to Wikipedia pages. The output is provided through • Hashtags in the text are replaced with their to- two main steps: entity identification, and disam- kenizations. Given an hashtag, a bunch of 100 biguation. The Wiki Machine is trained using data tweets using it is retrieved from Twitter. Then, extracted from Wikipedia and is enriched with when some camel-case versions of that hashtag Airpedia (Palmero Aprosio et al., 2013), a dataset are found, tokenization is done based on the se- built on top of DBpedia (Lehmann et al., 2015) quence of uppercase letters used. that increase its coverage over Wikipedia pages. • User mentions are also replaced with their tok- 2.2 Tint enizations (based on camel-case) or the corre- Tint4 (Palmero Aprosio and Moretti, 2016) is an sponding display names, if available. easy-to-use set of fast, accurate and extensible • Slangs, abbreviations, and some common ty- Natural Language Processing modules for Ital- pos (e.g., e’ instead of è) in the text are replaced ian. It is based on Stanford CoreNLP5 and is dis- based on a custom dictionary (for Italian, we tributed open source. Among other modules, the extracted it from the Wikipedia page Gergo - Tint pipeline includes tokenization, sentence split- di Internet7 ). ting, part-of-speech tagging and NER. 2.3 Social Media Toolkit • URLs, emoticons, and other unprocessable se- quences of characters in the text are discarded. Social Media Toolkit6 (Nechaev et al., 2016), or SMT, is an API that is able to align any given • True-casing is performed to recover the proper knowledge base entry to a corresponding social word case where this information is lost (e.g., media profile (if it exists). The reverse alignment all upper case or lower case text). This task em- is achieved by using a large database (∼1 million ploys a dictionary, which for Italian is derived entries) of precomputed alignments between DB- from Morph-It! (Zanchetta and Baroni, 2005). pedia and Twitter. SMT is also able to classify any Twitter profile as a person, organization, or other. To help disambiguation, the rewritten text is then augmented with a textual context obtained by 3 Description of the System aggregating the following contents, if available: MicroNeel accepts a micropost text as input, which may include hashtags, mentions of Twitter • Hashtag descriptions from tagdef 8 , a collabo- 3 rative online service; http://thewikimachine.fbk.eu/ 4 7 http://tint.fbk.eu/ https://it.wikipedia.org/wiki/Gergo_ 5 http://stanfordnlp.github.io/CoreNLP/ di_Internet 6 8 http://alignments.futuro.media/ https://www.tagdef.com/ Figure 1: The overview of the system. • Twitter user descriptions for author and user 3.2 Annotation mentions in the original text; In the second step, annotation is performed by • Titles of web pages linked by URLs in the orig- three independent annotator tools run in parallel: inal text. In the example shown in Figure 2, from the orig- • The rewritten text is parsed with the NER mod- inal tweet ule of Tint (see Section 2.2). This processing annotates named entities of type person, orga- [Original text] nization, and location. (author: @OscardiMontigny) #LinkedIn: 200 milioni di iscritti, 4 milioni • The rewritten text, concatenated with the con- in Italia http://t.co/jK8MRiaS via @vincos text, is annotated by The Wiki Machine (see Section 2.1) with a list of entities from the we collect full Italian DBpedia. The obtained EL annota- tions are enriched with the DBpedia class (ex- • metadata information for the author (Twitter tended with Airpedia), and mapped to the con- user @OscardiMontigny); sidered NER categories (person, organization, • description of the hashtag #LinkedIn; location, product, event). • title of the URL http://t.co/jK8MRiaS; • metadata information for the Twitter user • The user mentions in the tweet are assigned a @vincos, mentioned in the tweet. type and are linked to the corresponding DB- The resulting (cleaned) tweet is pedia entities using SMT (see Section 2.3); as for the previous case, SMT types and DBpedia [Rewritten text] classes are mapped to NER categories. A prob- LinkedIn: 200 milioni di iscritti, 4 milioni in lem here is that many user mentions classified Italia via Vincenzo Cosenza as persons or organizations by SMT are non- annotable according to NEEL-IT guidelines.9 with context Therefore, we implemented two strategies for [Context] deciding whether to annotate a user mention: Speaker; Blogger; Mega-Trends, Marketing 9 Basically, a user mention can be annotated in NEEL- and Innovation Divulgator. #linkedin is about IT if its NER category can be determined by just looking all things from Linkedin. LinkedIn: 200 at the username and its surrounding textual context in the milioni di iscritti, 4 milioni in Italia — Vincos tweet. Usernames resembling a person or an organization name are thus annotated, while less informative usernames Blog. Strategist at @BlogMeter My books: are not marked as their nature cannot be determined without Social Media ROI — La società dei dati. looking at their Twitter profiles or at the tweets they made, which is done instead by SMT. Figure 2: An example of annotation. the rule-based SMT annotator always anno- We then trained a supervised merger consisting tates if the SMT type is person or organization, of a multi-class SVM whose output is either one of whereas the supervised SMT annotator decides the NER categories or a special NONE category, using an SVM classifier trained on the develop- for which case we discard all the annotations for ment set of NEEL-IT. the offset. The classifier is trained on the develop- ment tweets provided by the task organizers, us- The middle box in Figure 2 shows the entities ing libSVM (Chang and Lin, 2011) with a polyno- extracted by each tool: The Wiki Machine recog- mial kernel and controlling precision/recall via the nizes “LinkedIn” as organization and “Italia” as penalty parameter C for the NONE class. Given location; SMT identifies “@vincos” as a person; an offset and the associated entity annotations we and Tint classifies “LinkedIn” as organization and use the following features: “Italia” and “Vincenzo Cosenza” as persons. • whether the entity is linked to DBpedia; 3.3 Merging • whether the tool x annotated this entity; The last part of the pipeline consists in deciding • whether the tool x annotated the entity with which annotations have to be kept and which ones category y (x can be Tint, SMT, or The Wiki- should be discarded. In addition, the system has Machine; y can be one of the possible cate- to choose how to deal with conflicts (for example gories, such as person, location, and so on); inconsistency between the class produced by Tint • the case of the annotated text (uppercase ini- and the one extracted by The Wiki Machine). tials, all uppercase, all lowercase, etc.); Specifically, the task consists in building a • whether the annotation is contained in a Twitter merger that chooses at most one NER class (and username and/or in a hashtag; possibly a compatible DBpedia link) for each off- • whether the annotated text is an Italian com- set of the text for which at least one annotator rec- mon word and/or a known proper name; com- ognized an entity. For instance, in the example of mon words were taken from Morph-It! (see Figure 2, the merger should ignore the annotation Section 3.1), while proper nouns were ex- of @vincos, as it is non considered a named entity. tracted from Wikipedia biographies; As baseline, we first developed a rule-based • whether the annotated text contains more than merger that does not discard any annotation and one word; solves conflicts by majority vote or, in the event of • frequencies of NER categories in the training a tie, by giving different priorities to the annota- dataset of tweets. tions produced by each annotator.10 The result of the merging step is a set of NER 10 Tint first, followed by The Wiki Machine and SMT. and EL annotations as required by the NEEL-IT task. EL annotations whose DBpedia entities are merger performed better than the other runs em- not part of the English DBpedia were discarded ploying supervised techniques. Table 1 shows that when participating in the task, as for NEEL-IT the contribution of the supervised SMT annotator rules. They were however exploited for placing was null on the test set. The supervised merger, the involved entities in the same coreference set. on the other hand, is only capable of changing the The remaining (cross-micropost) coreference an- precision/recall balance (which was already good notations for unlinked (NIL) entities were derived for the base run) by keeping only the best anno- with a simple baseline that always put entities in tations. We tuned it for maximum F1 via cross- different coreference sets.11 validation on the development set of NEEL-IT, but the outcome on the test set was a decrease of recall 4 Results not compensated by a sufficient increase of preci- Table 1 reports on the performances obtained by sion, leading to an overall decrease of F1. MicroNeel at the NEEL-IT task of EVALITA The ablation test in the lower part of Table 1 2016, measured using three sets of Precision (P), shows that the largest drop in performances re- Recall (R), and F1 metrics (Basile et al., 2016a): sults from removing The Wiki Machine, which is thus the annotator most contributing to overall per- • mention CEAF tests coreference resolution; formances, whereas SMT is the annotator giving • strong typed mention match tests NER (i.e., the smallest contribution (which still amounts to a spans and categories of annotated entities); valuable +0.0193 F1). The rewriting of tweet texts • strong link match assesses EL (i.e., spans and accounts for +0.0531 F1, whereas the addition of DBpedia URIs of annotated entities). textual context had essentially no impact on the test set, contrarily to our expectations. Starting from their F1 scores, an overall F1 score An error analysis on the produced annotations was computed as a weighted sum (0.4 for mention showed that many EL annotations were not pro- CEAF and 0.3 for each other metric). duced due to wrong word capitalization (e.g., MicroNeel was trained on the development set lower case words not recognized as named enti- of 1000 annotated tweets distributed as part of the ties), although the true-casing performed as part task, and tested on 300 tweets. We submitted three of preprocessing mitigated the problem. An alter- runs (upper part of Table 1) that differ on the tech- native and possibly more robust solution may be niques used – rule-based vs supervised – for the to retrain the EL tool not considering letter case. SMT annotator and the merger: • base uses the rule-based variants of the SMT 6 The tool annotator and the merger; The MicroNeel extraction pipeline is available • merger uses the rule-based SMT annotator and as open source (GPL) from the project web- the supervised merger; site.12 It is written in Java and additional com- • all uses the supervised variants of the SMT an- ponents for preprocessing, annotation, and merg- notator and the merger. ing can be easily implemented by implementing an Annotator interface. The configuration, in- In addition to the official NEEL-IT scores, the cluding the list of components to be used and their lower part of Table 1 reports the result of an abla- parameters, can be set through a specific JSON tion test that starts from the base configuration and configuration file. Extensive documentation will investigates the contributions of different compo- be available soon on the project wiki. nents of MicroNeel: The Wiki Machine (EL), Tint (NER), SMT, the tweet rewriting, and the ad- 7 Conclusion and Future Work dition of textual context during preprocessing. In this paper we presented MicroNeel, a system for 5 Discussion Named Entity Recognition and Entity Linking on Contrarily to our expectations, the base run us- Italian microposts. Our approach consists of three ing the simpler rule-based SMT and rule-based main steps, described in Section 3: preprocess- ing, annotation, and merging. By getting the sec- 11 It turned out after the evaluation that the alternative base- ond best result in the NEEL-IT task at EVALITA line that corefers entities with the same (normalized) surface 12 form performed better on NEEL-IT test data. https://github.com/fbk/microneel Table 1: MicroNeel performances on NEEL-IT test set for different configurations. Configuration Mention CEAF Strong typed mention match Strong link match Overall P R F1 P R F1 P R F1 F1 base run 0.514 0.547 0.530 0.457 0.487 0.472 0.567 0.412 0.477 0.4967 merger run 0.576 0.455 0.509 0.523 0.415 0.463 0.664 0.332 0.442 0.4751 all run 0.574 0.453 0.506 0.521 0.412 0.460 0.670 0.332 0.444 0.4736 base - NER 0.587 0.341 0.431 0.524 0.305 0.386 0.531 0.420 0.469 0.4289 base - SMT 0.504 0.525 0.514 0.448 0.468 0.458 0.564 0.372 0.448 0.4774 base - EL 0.487 0.430 0.457 0.494 0.437 0.464 0.579 0.049 0.090 0.3490 base - rewriting 0.554 0.399 0.464 0.492 0.356 0.413 0.606 0.354 0.447 0.4436 base - context 0.513 0.547 0.530 0.453 0.485 0.468 0.566 0.416 0.480 0.4964 2016, we demonstrated that our approach is effec- shop (EVALITA 2016). Associazione Italiana di Lin- tive even if it builds on standard components. guistica Computazionale (AILC). Although the task consists in annotating tweets Pierpaolo Basile, Franco Cutugno, Malvina Nissim, in Italian, MicroNeel is largely agnostic with re- Viviana Patti, and Rachele Sprugnoli. 2016b. spect to the language, the only dependencies be- EVALITA 2016: Overview of the 5th evaluation ing the dictionaries used for preprocessing, as both campaign of natural language processing and speech tools for Italian. In Pierpaolo Basile, Anna Corazza, The Wiki Machine and Tint NER support different Franco Cutugno, Simonetta Montemagni, Malv- languages while SMT is language-independent. ina Nissim, Viviana Patti, Giovanni Semeraro, and Therefore, MicroNeel can be easily adapted to Rachele Sprugnoli, editors, Proceedings of Third other languages without big effort. Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of MicroNeel is a combination of existing tools, Natural Language Processing and Speech Tools some of which already perform at state-of-the-art for Italian. Final Workshop (EVALITA 2016). As- level when applied on tweets (for instance, our sociazione Italiana di Linguistica Computazionale system got the best performance in the linking task (AILC). thanks to The Wiki Machine). In the future, we Kalina Bontcheva, Leon Derczynski, Adam Funk, plan to adapt MicroNeel to English and other lan- Mark A. Greenwood, Diana Maynard, and Niraj guages, and to integrate some other modules both Aswani. 2013. TwitIE: An open-source information in the preprocessing and annotation steps, such the extraction pipeline for microblog text. In Recent Advances in Natural Language Processing, RANLP, NER system expressly developed for tweets de- pages 83–90. RANLP 2013 Organising Committee / scribed by Minard et al. (2016). ACL. Acknowledgments Chih-Chung Chang and Chih-Jen Lin. 2011. LIB- SVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, The research leading to this paper was partially 2:27:1–27:27. supported by the European Union’s Horizon 2020 Programme via the SIMPATICO Project (H2020- Jens Lehmann, Robert Isele, Max Jakob, Anja EURO-6-2015, n. 692819). Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DB- pedia - A large-scale, multilingual knowledge base References extracted from Wikipedia. Semantic Web, 6(2):167– 195. Pierpaolo Basile, Annalina Caputo, Anna Lisa Gen- tile, and Giuseppe Rizzo. 2016a. Overview Anne-Lyse Minard, Mohammed R.H. Qwaider, and of the EVALITA 2016 Named Entity rEcognition Bernardo Magnini. 2016. FBK-NLP at NEEL- and Linking in Italian tweets (NEEL-IT) task. In IT: Active Learning for Domain Adaptation. In Pierpaolo Basile, Anna Corazza, Franco Cutugno, Pierpaolo Basile, Anna Corazza, Franco Cutugno, Simonetta Montemagni, Malvina Nissim, Viviana Simonetta Montemagni, Malvina Nissim, Viviana Patti, Giovanni Semeraro, and Rachele Sprugnoli, Patti, Giovanni Semeraro, and Rachele Sprugnoli, editors, Proceedings of Third Italian Conference on editors, Proceedings of Third Italian Conference on Computational Linguistics (CLiC-it 2016) & Fifth Computational Linguistics (CLiC-it 2016) & Fifth Evaluation Campaign of Natural Language Pro- Evaluation Campaign of Natural Language Pro- cessing and Speech Tools for Italian. Final Work- cessing and Speech Tools for Italian. Final Work- shop (EVALITA 2016). Associazione Italiana di Lin- guistica Computazionale (AILC). Yaroslav Nechaev, Francesco Corcoglioniti, and Clau- dio Giuliano. 2016. Linking knowledge bases to social media profiles. http://alignments. futuro.media/. Alessio Palmero Aprosio and Claudio Giuliano. 2016. The Wiki Machine: an open source software for en- tity linking and enrichment. ArXiv e-prints. Alessio Palmero Aprosio and Giovanni Moretti. 2016. Italy goes to Stanford: a collection of CoreNLP modules for Italian. ArXiv e-prints. Alessio Palmero Aprosio, Claudio Giuliano, and Al- berto Lavelli. 2013. Automatic expansion of DB- pedia exploiting Wikipedia cross-language informa- tion. In Proceedings of the 10th Extended Semantic Web Conference, pages 397–411. Springer. Eros Zanchetta and Marco Baroni. 2005. Morph-it! a free corpus-based morphological resource for the Italian language. Corpus Linguistics 2005, 1(1).