Learning with the Web: Spotting Named Entities on the intersection of NERD and Machine Learning Marieke van Erp1 , Giuseppe Rizzo2 , Raphaël Troncy2 1 VU University Amsterdam, The Netherlands marieke.van.erp@vu.nl 2 EURECOM, Sophia Antipolis, France, giuseppe.rizzo@eurecom.fr, raphael.troncy@eurecom.fr Abstract. Microposts shared on social platforms instantaneously report facts, opinions or emotions. In these posts, entities are often used but they are continuously changing depending on what is currently trend- ing. In such a scenario, recognising these named entities is a challenging task, for which off-the-shelf approaches are not well equipped. We pro- pose NERD-ML, an approach that unifies the benefits of a crowd entity recognizer through Web entity extractors combined with the linguistic strengths of a machine learning classifier. Keywords: Named Entity Recognition, NERD, Machine Learning 1 Introduction Microposts are a highly popular medium to share facts, opinions or emotions. They promise great potential for researchers and companies alike to tap into a vast wealth of a heterogeneous and instantaneous barometer of what is currently trending in the world. However, due to their brief and fleeting nature, microp- osts provide a challenging playground for text analysis tools that are oftentimes tuned to longer and more stable texts. We present an approach that attempts to leverage this problem by employing an hybrid approach that unifies the benefits of a crowd entity recogniser through Web entity extractors combined with the linguistic strengths of a machine learning classifier. 2 The NERD-ML System In our approach, we combine a mix of NER systems in order to deal with the brief and fleeting nature of microposts. The three main modules of our approach are: NERD, Ritter et al.’s system, and Stanford NER. NERD [4] is used to spot entities using a variety of Web extractors. The strength of this approach lies in the fact that these systems have access to large knowledge bases of entities Copyright c 2013 held by author(s)/owner(s). Published as part of the · #MSM2013 Workshop Concept Extraction Challenge Proceedings · available online as CEUR Vol-1019, at: http://ceur-ws.org/Vol-1019 Making Sense of Microposts Workshop @ WWW’13, May 13th 2013, Rio de Janeiro, Brazil 2 Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy such as DBpedia3 and Freebase4 . Ritter et al. [3] propose a tailored approach for entity recognition based on a previously annotated Twitter stream; while Stanford NER [1] represents the state of the art in the entity recognition, pro- viding off-the-shelf or customisable NER using a machine learning algorithm. While NERD and Ritter et al.’s approach are used as off-the-shelf extractors, Stanford NER is trained on the MSM training dataset. The outputs of these systems are used as features for NERD-ML’s final machine learning module. We have also added extra features based on the token and the micropost format to further aid the system. The generated feature sets can be fed into any machine learning algorithm in order to learn the optimal extractor/feature combination. An overview of our system is shown in Figure 1. In the remainder of this section we explain the components. NERD Extractors Ritter et al. (2011) Machine Preprocessing Learner Stanford NER Feature generation Fig. 1: Overview of the NERD-ML System Preprocessing: In the preprocessing phase, the data is formatted to comply with the input format of our extractors. For ease of use, the dataset is converted to the CoNLL IOB format [5]. Furthermore, posts from the MSM2013 training data are divided randomly over 10 parts in order to a) be able to perform a 10- fold cross-validation experiment and b) comply with NERD filesize limitations. NERD Extractors: Each of the data parts is sent to the NERD API to re- trieve named entities from the following extractors: AlchemyAPI, DBpedia Spot- light (setting: confidence=0, support=0, spotter=CoOccurrenceBasedSelector ), Extractiv, Lupedia, OpenCalais, Saplo, TextRazor, Wikimeta, Yahoo and Ze- manta (setting: markup limit=10 ). The NERD ontology consists of 75 classes, which are mapped to the four classes of the MSM2013 challenge. Ritter et al. 2011: The off-the-shelf approach as described in [3] is taken both as baseline and input for the hybrid classifier. The 10 entity classes are mapped to the four classes of the MSM2013 challenge. 3 http://www.dbpedia.org 4 http://www.freebase.com 28 · #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III · NERD-ML at the #MSM2013 IE Challenge 3 100 100 100 AlchemyAPI DBpedia Spotlight 90 Extractiv 90 90 Lupedia OpenCalais 80 80 80 Saplo Textrazor Wikimeta 70 70 70 Yahoo Zemanta 60 60 60 Ritter et al. Stanford NER 50 50 50 NERD-ML Run01 NERD-ML Run02 NERD-ML Run03 40 40 40 30 30 30 20 20 20 10 10 10 0 0 0 LOC MISC ORG PER LOC MISC ORG PER LOC MISC ORG PER Precision Recall F-measure Fig. 2: Results of individual and combined extractors in 10-fold cross validation experiments Stanford NER: The Stanford NER system (version 1.2.7) is retrained on the MSM2013 data challenge set, using parameters based on the properties file en- glish.conll.4class.distsim.crf.ser.gz provided with the Stanford distribution. The Stanford results serve as a baseline, as well as input for the hybrid classifier. Feature Generation: To aid the classifier in making sense of the structure of the microposts, we added 8 additional features to the dataset inspired by the features described in [3]. We implemented the following features: capitali- sation information (initial capital, allcaps, proportion of tokens capitals in the micropost), prefix (first three letters of the token), suffix (last three letters of the token), whether the token is at the beginning or end of the micropost, and part-of-speech token using the TwitterNLP tool and POS-tagset from [2]. NERD-ML: The output generated by the NERD extractors, Ritter et al.’s system, Stanford NER system and the added features are used to create feature vectors. The feature vectors serve as input to a machine learning algorithm in order to find combinations of features and extractor outputs that improve the scores of the individual extractors. We experimented with several different algorithms and machine learning settings using WEKA-3.6.95 . 3 Results In Figure 2, the results of the individual NER components and the hybrid NERD- ML system are presented. The first run is a baseline run that includes the full feature set. The second run only includes the extractors and no extra features. The third run uses a smaller feature set that was compiled through automatic feature selection. The settings of the three runs of the hybrid NERD-ML system are: Run 1: All features, k -NN, k =1, Euclidean distance, 10-fold cross validation 5 http://www.cs.waikato.ac.nz/ml/weka 29 · #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III · 4 Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy Run 2: AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo, Yahoo, Textrazor, Wikimeta, and Zemanta, Stanford NER, Ritter et al., SMO, standard parameters, 10-fold cross validation Run 3: POS, Initial Capital, Suffix, Proportion of Capitals, AlchemyAPI, DB- pedia Spotlight, Extractiv, Opencalais, Textrazor,Wikimeta, Stanford NER, Ritter et al., SMO, standard parameters, 10-fold cross validation Results are computed using the conlleval script and plotted using R. All settings and scripts are publicly available6 . 4 Conclusions Extracting named entities from microposts is a difficult task due to the ever- changing nature of the data, breadth of topics discussed and linguistic inconsis- tencies it contains. Our experiments with NERD-ML show that the combination of different NER systems outperforms off-the-shelf approaches, as well as the cus- tomised Stanford approach. Our results indicate that an hybrid system may be better equipped to deal with the task of identifying entities in microposts, but care must be taken in combining features and extractor outputs. Acknowledgments This work was partially supported by the European Union’s 7th Framework Programme via the projects LinkedTV (GA 287911). References 1. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: 43nd Annual Meeting of the Association for Computational Linguistics (ACL’05). Ann Arbor, MI, USA (June 2005) 2. Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Im- proved part-of-speech tagging for online conversational text with word clusters. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2013). Atlanta, GA, USA (June 2013) 3. Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: An experimental study. In: Empirical Methods in Natural Language Processing (EMNLP’11). Edinburgh, UK (July 2011) 4. Rizzo, G., Troncy, R.: NERD: A Framework for Unifying Named Entity Recognition and Disambiguation Extraction Tools. In: 13th Conference of the European Chapter of the Association for computational Linguistics (EACL’12). Avignon, France (April 2012) 5. Tjong Kim Sang, E.F.: Introduction to the conll-2002 shared task: Language- independent named entity recognition. In: Conference on Computational Natural Language Learning (CoNLL’02). Taipei, Taiwan (Aug-Sept 2002) 6 https://github.com/giusepperizzo/nerdml 30 · #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·