=Paper=
{{Paper
|id=Vol-1141/paper_16
|storemode=property
|title=Linking Entities in #Microposts
|pdfUrl=https://ceur-ws.org/Vol-1141/paper_16.pdf
|volume=Vol-1141
|dblpUrl=https://dblp.org/rec/conf/msm/BansalPRGV14
}}
==Linking Entities in #Microposts==
Linking Entities in #Microposts
Romil Bansal, Sandeep Panem, Priya Radhakrishnan,
Manish Gupta, Vasudeva Varma
International Institute of Information Technology, Hyderabad
ABSTRACT Entity linking consists of two different tasks, mention detection
Social media has emerged to be an important source of informa- and entity disambiguation. Entity linking from general text is a well
tion. Entity linking in social media provides an effective way to explored problem. Existing entity linking tools are intended for
extract useful information from microposts shared by the users. En- use over news corpora and similar document-based corpora with
tity linking in microposts is a difficult task as they lack sufficient relatively long length. But as microposts lack sufficient context,
context to disambiguate the entity mentions. In this paper, we do these context-based approaches fail to perform well on microposts.
entity linking by first identifying entity mentions and then disam- In this paper we describe our system proposed for the NEEL
biguating the mentions based on three different features: (1) simi- Challenge 2014 [1]. The proposed system disambiguates the en-
larity between the mention and the corresponding Wikipedia entity tity mentions in the tweets based on three different measures: (1)
pages; (2) similarity between the mention and the tweet text with Wikipedia’s context based measure (§2.2.1); (2) anchor text based
the anchor text strings across multiple webpages, and (3) popularity measure (§2.2.2); and (3) Twitter popularity based measure (§2.2.3).
of the entity on Twitter at the time of disambiguation. The system is The mention detection is done using existing Twitter part-of-
tested on the manually annotated dataset provided by Named Entity speech (POS) taggers [2, 5].
Extraction and Linking (NEEL) Challenge 2014, and the obtained
results are on par with the state-of-the-art methods. 2. OUR APPROACH
Categories and Subject Descriptors 2.1 Mention Detection
Mention detection is the task of finding entity mentions in the
H.3.3 [Information Storage and Retrieval]: Information Search given text. We assumed mentions as named entities present in-
and Retrieval side the tweets. Various approaches for named entity recognition
in tweets have been proposed recently [3, 5]. This includes spotting
General Terms continuous sequence of proper nouns as named entities in the tweet.
Algorithms, Experimentation But sometimes named entities like ‘Statue of Liberty’, ‘Game of
Thrones’ etc. also includes tokens other than nouns. To detect such
Keywords mentions, Ritter et al. [5] proposed a machine learning based sys-
tem for named entity detection in tweets. Gimpel et al. [2] present
Named Entity Extraction and Linking (NEEL) Challenge, Entity yet another approach for POS tagging of tweets. We tried both of
Linking, Entity Disambiguation, Social Media these POS taggers to extract proper noun sequences. In our experi-
ments Ritter et al.’s tagger gave an accuracy of 77% while Gimpel
1. INTRODUCTION et al.’s tagger gave an accuracy of 92%. So we merged the re-
Social media networks like Twitter have emerged to be major sults from both as shown in Fig. 1. The tweet text is fed to the
platforms for sharing information in form of short messages (tweets). system and the longest continuous sequences of proper noun to-
Analysis of tweets can be useful for various applications like e- kens detected using the above approach are extracted as the entity
commerce, entertainment, recommendations, etc. Entity linking is mentions from the given tweet. The merged system provided an
the one such analysis task which deals with finding correct referent accuracy of 98% in predicting mentions.
entities in the knowledge base for various mentions in the tweet.
Entity linking in social media is important as it helps in detect- Wikipedia based
ing, understanding and tracking information about an entity shared measure
across social media. ARK POS Tagger
Gimpel et al. [2]
Tweet Merge Anchor text based
LambdaMART Entity
Text Mentions measure
T-NER POS Tagger
Ritter et al. [5]
Copyright c 2014 held by author(s)/owner(s); copying permitted
Twitter popularity
only for private and academic purposes. based measure
Published as part of the #Microposts2014 Workshop proceedings, Mention Detection Entity Disambiguation
available online as CEUR Vol-1141 (http://ceur-ws.org/Vol-1141)
Copyright is held by the author/owner(s).
WWW’14 Companion,
#Microposts2014, April
April 7–11,
7th, 2014,
2014, Seoul,
Seoul, Korea.
Korea. Figure 1: System Architecture
ACM 978-1-4503-2745-9/14/04.
http://dx.doi.org/XYZW.
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014
2.2 Entity Disambiguation best results are obtained when a combination of all the measures
Entity disambiguation is the task of assigning the correct referent were used for disambiguation3 . A 5-fold cross validation on the
entity from the knowledge base to the given mention. We disam- dataset gave an average F1 of 0.52 for M1+M2+M3.
biguate the entity mention using three measures as described below.
The scores from these three measures are combined using Lamb- Table 1: Results: M1 represents Wikipedia’s Context based
daMART [7] model to arrive at the final disambiguated entity. Measure (§2.2.1), M2 represents Anchor Text based Measure
2.2.1 Wikipedia’s Context based Measure (M1) (§2.2.2) and M3 represents Twitter Popularity based Measure
(§2.2.3)
This measure disambiguates a mention by calculating the fre- Measure F1-measure
quency of occurrence of the mention in the Wikipedia corpus. Wikipedia’s M1 0.355
context based measure has been used in various approaches for dis-
M2 0.100
ambiguating mentions in tweets [4]. We query MediaWiki API1
M3 0.194
with the entity mention. MediaWiki API returns the candidate en-
M1+M2 0.355
tities in the ranked order. Each candidate entity is assigned its re-
ciprocal rank as score. Thus, a ranked list of candidate entities with M2+M3 0.244
their scores are created using M1. M1+M3 0.405
M1+M2+M 0.512
2.2.2 Anchor Text based Measure (M2)
Google Cross-Wiki Dictionary (GCD) [6] is a string to concept
mapping, created using anchor text from various web pages. A 4. CONCLUSION
concept is an individual Wikipedia article, identified by its URL. For effective entity linking, mention detection in tweets is impor-
The text strings constitute the anchor hypertexts that refer to these tant. We improve the accuracy of detecting mentions by combining
concepts. Thus, anchor text strings represent a concept. We query various Twitter POS taggers. We resolve multiple mentions, ab-
the GCD with a mention along with the tweet text. Based on the breviations and spell variations of a named entity using the Google
similarity to the query string, a ranked list of probable candidate Cross-Wiki Dictionary. We also use popularity of an entity on Twit-
entities are created (which is the ranked list using M2). The ranking ter for improving the disambiguation. Our system performed well
criteria is based on Jaccard similarity between the anchor text and with a F1 score of 0.512 on the given dataset.
the query. So if the mention is highly similar to the anchor text,
then the corresponding concept will have a high score. 5. REFERENCES
[1] A. E. Cano Basave, G. Rizzo, A. Varga, M. Rowe,
2.2.3 Twitter Popularity based Measure (M3) M. Stankovic, and A.-S. Dadzie. Making Sense of Microposts
Tweets about entities follow a bursty pattern. Bursty patterns are (#Microposts2014) Named Entity Extraction & Linking
the bursts of tweets that appear after an event relating to an entity Challenge. In Proc., 4th Workshop on Making Sense of
happens. We exploited this fact and tried to measure the number Microposts (#Microposts2014), pages 54–60, 2014.
of times the given mention refers to a particular entity on Twitter [2] K. Gimpel, N. Schneider, B. O’Connor, D. Das, D. Mills,
recently. The mention is queried on Twitter API2 and the resul- J. Eisenstein, M. Heilman, D. Yogatama, J. Flanigan, and
tant tweets are analyzed. All the tweets along with the mention N. A. Smith. Part-of-speech Tagging for Twitter: Annotation,
are then queried on the GCD and the candidate entities are taken. Features, and Experiments. In Proc. of the 49th Annual
Based on the scores returned using GCD, all the candidate entities Meeting of the Association for Computational Linguistics:
are ranked (which is the ranked list using M3). As Twitter popu- Human Language Technologies: Short Papers - Volume 2
larity based measure captures the people’s interests at a particular (NAACL-HLT), pages 42–47, 2011.
time, it works well for entity disambiguation on recent tweets. In [3] S. Guo, M.-W. Chang, and E. Kıcıman. To Link or Not to
essence, the methods M2 and M3 are similar but with different in- Link? A Study on End-to-End Tweet Entity Linking. In Proc.
puts. Both use GCD, and produce candidate mentions and score as of the Human Language Technologies: The Annual
output. However, M2 takes mention and single tweet text as input Conference of the North American Chapter of the Association
whereas M3 takes mention and multiple tweets as input. for Computational Linguistics (NAACL-HLT), pages
We have three rankings available using M1, M2, M3. Now the 1020–1030, 2013.
task is to arrive at the final ranking of the candidate entities by com- [4] X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu. Entity
bining the rankings of the three different models. The rankings of Linking for Tweets. In Proc. of the 51th Annual Meeting of
different models should be combined such that the overall F1 score the Association for Computational Linguistics (ACL), pages
is maximized. For this, we use LambdaMART which combines 1304–1311, 2013.
LambdaRank and MART models. LambdaMART creates boosted [5] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named Entity
regression trees for combining the rankings of the three different Recognition in Tweets: An Experimental Study. In Proc. of
systems. the 2011 Conference on Empirical Methods in Natural
Language Processing (EMNLP), 2011.
3. RESULTS AND EVALUATION [6] V. I. Spitkovsky and A. X. Chang. A Cross-Lingual Dictionary
The dataset comprises of 2.3K tweets each annotated with the for English Wikipedia Concepts. In Proc. of the 8th Intl. Conf.
entity mention and its corresponding DBpedia URL. We divided on Language Resources and Evaluation (LREC), 2012.
the dataset into the 7:3 (train:test) ratio. Table 1 shows the results [7] Q. Wu, C. J. Burges, K. M. Svore, and J. Gao. Adapting
obtained using the NEEL Challenge evaluation framework. The Boosting for Information Retrieval Measures. Journal of
1 Information Retrieval, 13(3):254–270, Jun 2010.
https://www.mediawiki.org/wiki/API:Search
2 3
https://dev.twitter.com/docs/api/1.1/get/search/tweets submitted as Agglutweet_1.tsv
72
· #Microposts2014 · 4th Workshop on Making Sense of Microposts · @WWW2014