=Paper=
{{Paper
|id=None
|storemode=property
|title=NERTUW: Named Entity Recognition on Tweets using Wikipedia
|pdfUrl=https://ceur-ws.org/Vol-1019/paper_35.pdf
|volume=Vol-1019
|dblpUrl=https://dblp.org/rec/conf/msm/SachidanandanSK13
}}
==NERTUW: Named Entity Recognition on Tweets using Wikipedia==
<pdf width="1500px">https://ceur-ws.org/Vol-1019/paper_35.pdf</pdf>
<pre>
           MSM2013 IE Challenge
     NERTUW : Named Entity Recognition on
           tweets using Wikipedia

 Sandhya Sachidanandan, Prathyush Sambaturu, and Kamalakar Karlapalem

                                IIIT-Hyderabad
             {sandhya.s,prathyush.sambaturu}@research.iiit.ac.in
                              kamal@iiit.ac.in


       Abstract. We propose an approach to recognize named entities in
       tweets, disambiguate and classify them into four categories namely per-
       son, organization, location and miscellaneous using Wikipedia. Our ap-
       proach annotates the tweets on the fly, ie, it does not require any training
       data.

       Keywords: named entity recognition, entity disambiguation, entity clas-
       sification


1   Introduction
A significant amount of tweets generated each day, discusses about different
types of popular entities which may be persons, locations, organizations etc.
Most of the popular entities has a page in Wikipedia. Hence, Wikipedia can act
as a useful source of information to recognize popular named entities in tweets.
Moreover, Wikipedia contains huge number of names of different types of entities
which will help us to recognize entities which does not have an explicit page in
Wikipedia.
    Tweets are of very short length. A tweet may or may not have enough context
information to disambiguate the named entities in it. There would be a very
small number of words in the tweet which supports the disambiguation of named
entities which needs to be utilized efficiently. If the tweet do not have enough
context to disambiguate the named entities in it, the popularity of each entity
has to be leveraged in disambiguating it. Disambiguating an entity is essential
to classify it correctly into location, person, organization or miscellaneous.
    Our contributions are :- 1) An approach which utilizes the titles, anchors and
infoboxes contained in Wikipedia and a little information from Wordnet and the
context information in tweets to recognize, disambiguate and classify named en-
tities in tweets. 2) Our approach does not require any training data and hence
no human labelling effort is needed. 3) Along with the global information from
Wikipedia, our approach utilizes the context information in the tweet by map-
ping them to their correct senses using a word sense disambiguation approach
which is then used to disambiguate the named entities in the tweet. This will


 Copyright c 2013 held by author(s)/owner(s). Published as part of the
       · #MSM2013 Workshop Concept Extraction Challenge Proceedings ·
 available online as CEUR Vol-1019, at: http://ceur-ws.org/Vol-1019
 Making Sense of Microposts Workshop @ WWW’13, May 13th 2013, Rio de Janeiro, Brazil
also help in disambiguating the words other than the named entities present in
the tweet if any.


2   Approach

 – Input tweet is split into ngrams. Link probability of each ngram is calculated
   as in [1], and those ngrams with link probability less than a threshold τ
   (experimentally set to 0.01) are discarded . Link probability of a phrase p is
   calculated as shown in Equation 1.

                                            na (p, W )
                              LP rob(p) =                                      (1)
                                             n(p, T )
   where, na (p, W ) is the number of times a phrase p is used as an anchor text
   in Wikipedia W and n(p, T ) is the number of times the phrase occur as text
   in a corpus T of around one million tweets. Each concept associated with a
   phrase, will get the same link probability LP rob(p).
 – For each ngram, a set of Wikipedia article titles are obtained based on their
   lexical match. The Wikipedia article titles mapped to the longest matching
   ngrams are then treated as candidate entities for disambiguation. For each
   ngram that matched to the title of a disambiguation page in Wikipedia, all
   the articles related to the ngram are added.
 – The candidate entities are then passed on to a Syntax analyser, which uses
   YAGO’s type relation to extract WordNet synsets mapped to the candi-
   date entities. With the synsets mapped to the candidate entities and all the
   synsets of verbs and common nouns associated with the tweet as vertices,
   a syntax graph is generated using WordNet. The idea behind creating the
   syntax graph, is to identify the candidate entities which are supported by the
   syntax of the text. Since, this should be accompanied by disambiguation of
   words in the text, we found the approach proposed in [3] to be appropriate.
   In order to identify the candidate entities supported by the syntax of the
   tweet, we modify [3] by adding words from WordNet which are mapped to
   the candidate entities, to the syntax graph being generated. If a candidate
   entity is supported by the syntax of the tweet, the words from WordNet
   mapped to it get connected to the correct sense of the words added from the
   tweet in the Syntax graph. A portion of the syntax graph generated for a
   tweet is shown in Figure 1.
 – Page Rank algorithm [2] is then applied on the syntax graph, setting high
   prior probabilities for synsets of common nouns and verbs added from the
   tweet. The average of the score of all synsets mapped to a candidate entity
   is treated as its syntax score.
 – With the candidate entities as vertices, a semantic graph is created. The
   similarity between each pair of candidate entities is calculated and an edge
   is added with the similarity score as weight if the score is greater than an
   experimentally set threshold. This makes the most related candidate entities


                                                                               68
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·
                                                                                        vertex
                                                                                        edge
                                         organization


                      establishment                          social unit


                                         company
                                                                             activity

                      sitcom
                                                   service


                                      drama
                                                                    assistance

                    comedy


Fig. 1. A portion of the syntax graph created from the tweet - How can #SMRT take
in 161 million of profit and yet deliver sich a crapy service? why do we a company that
puts.... The vertices include the words mapped to candidate entities by Yago along
with all the senses of common nouns and verbs obtained from the tweet. The edges
represents a relation between the vertices in WordNet.


   connected in the resulting semantic graph, which may result in many con-
   nected components in the graph. An example of a so constructed semantic
   graph is shown in Figure 2.
 – Weighted Page rank algorithm [4] is then applied on the semantic graph and
   the resulting scores assigned to the candidate entities is treated as the final
   score for ranking. The priors for each candidate entity is set as the linear
   combination of the following scores :-
     • Syntax score of each entity as calculated by the Syntax analyzer. This
       score represents the context information in the tweet.
     • Link probability of the ngram from which the candidate entity is gener-
       ated.
     • Anchor probability of the candidate entity which is the number of times
       the entity is used as an anchor in Wikipedia. Both link probability
       and anchor probability represents the popularity of the candidate entity
       which plays a significant role in disambiguating the candidate entities in
       cases where a little or no context information is available in the tweet.
 – Entity classification: Each ngram which has a candidate entity in the
   semantic graph is considered as a named entity. For each ngram, the can-
   didate entity with the highest page rank in the semantic graph is given to
   a named entity classifier, which uses the keywords present in the infobox
   of the Wikipedia page of the candidate entity to classify it as person, loca-
   tion, organization or miscellaneous. We extracted the unique keywords with
   maximum occurence, pertaining to each entity type provided in the training
   data to classify the named entities.

3    Error analysis and Discussion
 – We use an automated and scalable approach to collect keywords from the
   infoboxes of Wikipedia pages to identify different entity types. Though it


                                                                                                 69
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·
                                 The Artist and Meryl Streep won Oscar Award
                                                                                            Vertex
                                                                                            Edge

                                           Meryl Streep
                                                                               OSCAR


                  The Artist (film)
                                                           The Artist (magazine)


                                      Academy Award                      The Oscar (film)


Fig. 2. A portion of semantic graph obtained from the tweet - The Artist and Meryl
Streep won oscar award. The vertices represent the candidate entities, and edges rep-
resent their semantic relatedness.


   is able to classify a significant number of entities correctly, it fails in cases
   where the articles do not contain infobox.
 – Since not all entities are present in Wikipedia, we used a post processing step
   where we merge certain entities with the same type which occur adjacently in
   the tweet. More post processing can be done by merging adjacently located
   entities which are not of the same type and assign the most generic type to
   it which is not done.


References
1. E. Meij, W. Weerkamp, and M. de Rijke. Adding semantics to microblog posts. In
   WSDM, 2012.
2. R. Mihalcea, P. Tarau, and E. Figa. Pagerank on semantic networks, with applica-
   tion to word sense disambiguation. In COLING, 2004.
3. R. Navigli and M. Lapata. Graph connectivity measures for unsupervised word
   sense disambiguation. In IJCAI, pages 1683–1688, 2007.
4. W. Xing and A. A. Ghorbani. Weighted pagerank algorithm. In CNSR, pages
   305–314, 2004.


                                                                                                     70
· #MSM2013 · Concept Extraction Challenge · Making Sense of Microposts III ·

</pre>