Learning with the Web: Spotting Named
         Entities on the intersection of NERD and
                     Machine Learning

                 Marieke van Erp1 , Giuseppe Rizzo2 , Raphaël Troncy2
                      1
                       VU University Amsterdam, The Netherlands
                              marieke.van.erp@vu.nl
                        2
                          EURECOM, Sophia Antipolis, France,
               giuseppe.rizzo@eurecom.fr, raphael.troncy@eurecom.fr


          Abstract. Microposts shared on social platforms instantaneously report
          facts, opinions or emotions. In these posts, entities are often used but
          they are continuously changing depending on what is currently trend-
          ing. In such a scenario, recognising these named entities is a challenging
          task, for which off-the-shelf approaches are not well equipped. We pro-
          pose NERD-ML, an approach that unifies the benefits of a crowd entity
          recognizer through Web entity extractors combined with the linguistic
          strengths of a machine learning classifier.


   Keywords: Named Entity Recognition, NERD, Machine Learning


   1    Introduction

   Microposts are a highly popular medium to share facts, opinions or emotions.
   They promise great potential for researchers and companies alike to tap into a
   vast wealth of a heterogeneous and instantaneous barometer of what is currently
   trending in the world. However, due to their brief and fleeting nature, microp-
   osts provide a challenging playground for text analysis tools that are oftentimes
   tuned to longer and more stable texts. We present an approach that attempts to
   leverage this problem by employing an hybrid approach that unifies the benefits
   of a crowd entity recogniser through Web entity extractors combined with the
   linguistic strengths of a machine learning classifier.


   2    The NERD-ML System

   In our approach, we combine a mix of NER systems in order to deal with the
   brief and fleeting nature of microposts. The three main modules of our approach
   are: NERD, Ritter et al.’s system, and Stanford NER. NERD [4] is used to spot
   entities using a variety of Web extractors. The strength of this approach lies
   in the fact that these systems have access to large knowledge bases of entities


Copyright c 2013 held by author(s)/owner(s). Published as part of the
      · #MSM2013 Workshop Concept Extraction Challenge Proceedings ·
available online as CEUR Vol-1019, at: http://ceur-ws.org/Vol-1019
Making Sense of Microposts Workshop @ WWW’13, May 13th 2013, Rio de Janeiro, Brazil
   2                                   Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy

   such as DBpedia3 and Freebase4 . Ritter et al. [3] propose a tailored approach
   for entity recognition based on a previously annotated Twitter stream; while
   Stanford NER [1] represents the state of the art in the entity recognition, pro-
   viding off-the-shelf or customisable NER using a machine learning algorithm.
   While NERD and Ritter et al.’s approach are used as off-the-shelf extractors,
   Stanford NER is trained on the MSM training dataset. The outputs of these
   systems are used as features for NERD-ML’s final machine learning module. We
   have also added extra features based on the token and the micropost format to
   further aid the system. The generated feature sets can be fed into any machine
   learning algorithm in order to learn the optimal extractor/feature combination.
   An overview of our system is shown in Figure 1. In the remainder of this section
   we explain the components.


                                              NERD
                                             Extractors


                                            Ritter et al.
                                              (2011)
                                                                        Machine
            Preprocessing
                                                                        Learner

                                           Stanford NER


                                             Feature
                                            generation


                            Fig. 1: Overview of the NERD-ML System

   Preprocessing: In the preprocessing phase, the data is formatted to comply
   with the input format of our extractors. For ease of use, the dataset is converted
   to the CoNLL IOB format [5]. Furthermore, posts from the MSM2013 training
   data are divided randomly over 10 parts in order to a) be able to perform a 10-
   fold cross-validation experiment and b) comply with NERD filesize limitations.
   NERD Extractors: Each of the data parts is sent to the NERD API to re-
   trieve named entities from the following extractors: AlchemyAPI, DBpedia Spot-
   light (setting: confidence=0, support=0, spotter=CoOccurrenceBasedSelector ),
   Extractiv, Lupedia, OpenCalais, Saplo, TextRazor, Wikimeta, Yahoo and Ze-
   manta (setting: markup limit=10 ). The NERD ontology consists of 75 classes,
   which are mapped to the four classes of the MSM2013 challenge.
   Ritter et al. 2011: The off-the-shelf approach as described in [3] is taken both
   as baseline and input for the hybrid classifier. The 10 entity classes are mapped
   to the four classes of the MSM2013 challenge.
    3
        http://www.dbpedia.org
    4
        http://www.freebase.com


                                                                                      28
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·
                                                            NERD-ML at the #MSM2013 IE Challenge                               3


100


                                                100


                                                                                          100
                                                                                                                                          AlchemyAPI
                                                                                                                                          DBpedia Spotlight
90                                                                                                                                        Extractiv


                                                90


                                                                                          90
                                                                                                                                          Lupedia
                                                                                                                                          OpenCalais
80


                                                80


                                                                                          80
                                                                                                                                          Saplo
                                                                                                                                          Textrazor
                                                                                                                                          Wikimeta
70


                                                70


                                                                                          70
                                                                                                                                          Yahoo
                                                                                                                                          Zemanta
60


                                                60


                                                                                          60
                                                                                                                                          Ritter et al.
                                                                                                                                          Stanford NER
50


                                                50


                                                                                          50
                                                                                                                                          NERD-ML Run01
                                                                                                                                          NERD-ML Run02
                                                                                                                                          NERD-ML Run03
40


                                                40


                                                                                          40
30


                                                30


                                                                                          30
20


                                                20


                                                                                          20
10


                                                10


                                                                                          10
0


                                                0


                                                                                          0
      LOC       MISC                ORG   PER         LOC     MISC            ORG   PER         LOC   MISC               ORG        PER

                        Precision                                    Recall                                  F-measure


            Fig. 2: Results of individual and combined extractors in 10-fold cross validation
            experiments


            Stanford NER: The Stanford NER system (version 1.2.7) is retrained on the
            MSM2013 data challenge set, using parameters based on the properties file en-
            glish.conll.4class.distsim.crf.ser.gz provided with the Stanford distribution. The
            Stanford results serve as a baseline, as well as input for the hybrid classifier.
            Feature Generation: To aid the classifier in making sense of the structure
            of the microposts, we added 8 additional features to the dataset inspired by
            the features described in [3]. We implemented the following features: capitali-
            sation information (initial capital, allcaps, proportion of tokens capitals in the
            micropost), prefix (first three letters of the token), suffix (last three letters of
            the token), whether the token is at the beginning or end of the micropost, and
            part-of-speech token using the TwitterNLP tool and POS-tagset from [2].
            NERD-ML: The output generated by the NERD extractors, Ritter et al.’s
            system, Stanford NER system and the added features are used to create feature
            vectors. The feature vectors serve as input to a machine learning algorithm
            in order to find combinations of features and extractor outputs that improve
            the scores of the individual extractors. We experimented with several different
            algorithms and machine learning settings using WEKA-3.6.95 .


            3          Results

            In Figure 2, the results of the individual NER components and the hybrid NERD-
            ML system are presented. The first run is a baseline run that includes the full
            feature set. The second run only includes the extractors and no extra features.
            The third run uses a smaller feature set that was compiled through automatic
            feature selection. The settings of the three runs of the hybrid NERD-ML system
            are:
            Run 1: All features, k -NN, k =1, Euclidean distance, 10-fold cross validation
            5
                http://www.cs.waikato.ac.nz/ml/weka


                                                                                                                               29
       · #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·
   4                                 Marieke van Erp, Giuseppe Rizzo, Raphaël Troncy

   Run 2: AlchemyAPI, DBpedia Spotlight, Extractiv, Lupedia, OpenCalais, Saplo,
     Yahoo, Textrazor, Wikimeta, and Zemanta, Stanford NER, Ritter et al.,
     SMO, standard parameters, 10-fold cross validation
   Run 3: POS, Initial Capital, Suffix, Proportion of Capitals, AlchemyAPI, DB-
     pedia Spotlight, Extractiv, Opencalais, Textrazor,Wikimeta, Stanford NER,
     Ritter et al., SMO, standard parameters, 10-fold cross validation
   Results are computed using the conlleval script and plotted using R. All settings
   and scripts are publicly available6 .


   4      Conclusions
   Extracting named entities from microposts is a difficult task due to the ever-
   changing nature of the data, breadth of topics discussed and linguistic inconsis-
   tencies it contains. Our experiments with NERD-ML show that the combination
   of different NER systems outperforms off-the-shelf approaches, as well as the cus-
   tomised Stanford approach. Our results indicate that an hybrid system may be
   better equipped to deal with the task of identifying entities in microposts, but
   care must be taken in combining features and extractor outputs.


   Acknowledgments
   This work was partially supported by the European Union’s 7th Framework
   Programme via the projects LinkedTV (GA 287911).


   References
   1. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into
      information extraction systems by gibbs sampling. In: 43nd Annual Meeting of the
      Association for Computational Linguistics (ACL’05). Ann Arbor, MI, USA (June
      2005)
   2. Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Im-
      proved part-of-speech tagging for online conversational text with word clusters. In:
      Proceedings of the Conference of the North American Chapter of the Association
      for Computational Linguistics (NAACL 2013). Atlanta, GA, USA (June 2013)
   3. Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets:
      An experimental study. In: Empirical Methods in Natural Language Processing
      (EMNLP’11). Edinburgh, UK (July 2011)
   4. Rizzo, G., Troncy, R.: NERD: A Framework for Unifying Named Entity Recognition
      and Disambiguation Extraction Tools. In: 13th Conference of the European Chapter
      of the Association for computational Linguistics (EACL’12). Avignon, France (April
      2012)
   5. Tjong Kim Sang, E.F.: Introduction to the conll-2002 shared task: Language-
      independent named entity recognition. In: Conference on Computational Natural
      Language Learning (CoNLL’02). Taipei, Taiwan (Aug-Sept 2002)

    6
        https://github.com/giusepperizzo/nerdml


                                                                                        30
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·