=Paper=
{{Paper
|id=None
|storemode=property
|title=What Makes a Tweet Relevant for a Topic?
|pdfUrl=https://ceur-ws.org/Vol-838/paper_08.pdf
|volume=Vol-838
|dblpUrl=https://dblp.org/rec/conf/msm/TaoAHH12
}}
==What Makes a Tweet Relevant for a Topic?==
What makes a tweet relevant for a topic?
Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben
Web Information Systems, TU Delft
PO Box 5031, 2600 GA Delft, the Netherlands
{k.tao, f.abel, c.hauff, g.j.p.m.houben}@tudelft.nl
ABSTRACT characters). This can be explained by the length of Twit-
Users who rely on microblogging search (MS) engines to find ter messages which is limited to 140 characters so that long
relevant microposts for their queries usually follow their in- queries easily become too restrictive. Short queries on the
terests and rationale when deciding whether a retrieved post other hand may result in a large (or too large) number of
is of interest to them or not. While today’s MS engines matching microposts.
commonly rely on keyword-based retrieval strategies, we in- For these reasons, building search algorithms that are ca-
vestigate if there exist additional micropost characteristics pable of identifying interesting and relevant microposts for
that are more predictive of a post’s relevance and interest- a given topic is a non-trivial and crucial research challenge.
ingness than its keyword-based similarity with the query. In In order to take a first step towards solving this challenge,
this paper, we experiment with a corpus of Twitter messages in this paper, we present an analysis of the following ques-
and investigate sixteen features along two dimensions: topic- tion: is a keyword-based retrieval strategy sufficient or can
dependent and topic-independent features. Our in-depth we identify features that are more predictive of a tweet’s
analysis compares the importance of the different types of relevance and interestingness? To investigate this question,
features and reveals that semantic features and therefore an we took advantage of last year’s TREC4 2011 Microblog
understanding of the semantic meaning of the tweets plays Track5 , where for the first time an openly accessible search
a major role in determining the relevance of a tweet with & retrieval Twitter data set with about 16 million tweets
respect to a query. We evaluate our findings in a relevance was published.
classification experiment and show that by combining differ- In the context of TREC, the ad-hoc search task on Twit-
ent features, we can achieve a precision and recall of more ter is defined as follows: given a topic (identified by a title)
than 35% and 45% respectively. and a point in time pt, retrieve all interesting and relevant
microposts from the corpus that were posted no later than
pt. A subset of the tweets that were retrieved by the research
1. INTRODUCTION groups participating in the benchmark were then judged by
Microblogging services such as Twitter1 or Sina Weibo2 human assessors as either relevant to the topic or as non-
have become a valuable source of information particularly relevant. For example, “Obama birth certificate” is one of
for exploring, monitoring and discussing news-related infor- the topics that is part of the TREC corpus. Given the tem-
mation [7]. Searching for relevant information in such ser- poral context, one can infer that this topic title refers the
vices is challenging as the number of posts published per discussions about Barack Obama’s birth certificate: people
day can exceed several hundred millions3 . were questioning whether Barack Obama was truly born in
Moreover, users who search for microposts about a cer- the United States.
tain topic typically perform a keyword search. Teevan et We rely on the judged tweets for our analysis and investi-
al. [11] found that keyword queries on Twitter are signifi- gate topic-dependent as well as topic-independent features.
cantly shorter than those issued for Web search: on Twitter Examples of topic-dependent features are the retrieval score
people typically use 1.64 words (or 12.0 characters) to search derived from retrieval strategies that are based on document
while on the Web they use, on average, 3.08 words (or 18.8 and corpus statistics as well as the semantic overlap score
1 which determines the extent of overlap between the seman-
http://twitter.com/
2 tic meaning of a search topic and a tweet. In addition to
http://www.weibo.com/
3 these topic-dependent features, we also studied a number of
http://blog.twitter.com/2011/06/
200-million-tweets-per-day.html topic-independent features: syntactical features (such as the
presence of URLs or hashtags in a tweet), semantic features
(such as the diversity of the semantic concepts mentioned in
a tweet) and social context features (such as the authority
of the user who published the tweet).
Permission to make digital or hard copies of all or part of this work for The main contributions of our work can be summarized
personal or classroom use is granted without fee provided that copies are as follows:
not made or distributed
Copyright c 2012 held for profit or commercial advantage and that copies
by author(s)/owner(s). • We present a set of strategies for the extraction of fea-
bear this notice
Published as and
parttheoffull
thecitation on the first
#MSM2012 page. To proceedings,
Workshop copy otherwise, to
republish,
availabletoonline
post onasservers
CEUR or Vol-838,
to redistribute to lists, requires prior specific
at: http://ceur-ws.org/Vol-838 4
permission
#MSM2012, and/or a fee.
April 16, 2012, Lyon, France. http://trec.nist.gov/
5
Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. http://sites.google.com/site/trecmicroblogtrack/
49
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
tures from Twitter messages that allow us to predict (keyword-based and semantic-based relevance) and topic-
the relevance of a post for a given topic. insensitive measures that do not consider the actual topic
• Given a set of more than 38,000 tweets that were man- but solely exploit syntactical or semantic tweet characteris-
ually labeled as relevant or not relevant for a set of tics. Finally, we also consider contextual features that, for
49 topics, we analyze the features and characteristics example, characterize the creator of a tweet.
of relevant and interesting tweets.
• We evaluate the effectiveness of the different features 3.1 Keyword-based Relevance Features
for predicting the relevance of tweets for a topic and keyword-based relevance score (Indri-based query rel-
investigate the impact of the different features on the evance): To calculate the retrieval score for pair of (topic,
quality of the relevance classification. We also study to tweet), we employ the language modeling approach to in-
what extent the success of the classification depends on formation retrieval [13]. A language model θt is derived
the type of topics (e.g. topics of short-term vs. topics for each document (tweet). Given a query Q with terms
of long-term interest) for which relevant tweets should Q = {q1 , ..., qn } the document language models are ranked
be identified. with respect to the probability P (θt |Q), which according to
2. RELATED WORK the Bayes theorem can be expressed as:
Since its launch in 2006 Twitter attracted a lot of at- P (Q|θt )P (θt )
P (θt |Q) = (1)
tention, both in the general public as well as in the re- P (Q)
search community. Researchers started studying microblog- Y
∝ P (θt ) P (qi |θt ). (2)
ging phenomena to find out what kind of information is dis-
qi ∈Q
cussed on Twitter [7], how trends evolve on Twitter [8], or
how one detects influential users on Twitter [12]. Applica- This is the standard query likelihood based language mod-
tions have been researched that utilize microblogging data to eling setup which assumes term independence. Usually, the
enrich traditional news media with information from Twit- prior probability of a tweet P (θt ) is considered to be uni-
ter [6], to detect and manage emergency situations such form, that is, each tweet in the corpus is equally likely. The
as earthquakes [10] or to enhance search and ranking of language models are multinomial probability distributions
Web sites which possibly have not been indexed yet by Web over the terms occurring in the tweets. Since a maximum
search engines. likelihood estimate of P (qi |θt ) would result in a zero proba-
So far, search on Twitter or other microblogging plat- bility of any tweet that misses one or more of the query terms
forms such as Sina Weibo has not been studied extensively. in Q, the estimate is usually smoothed with a background
Teevan et al. [11] compared the search behavior on Twitter language model, generated over all tweets in the corpus. We
with traditional Web search behavior. It was found that key- employed Dirichlet smoothing [13]:
word queries that people issue to retrieve information from
Twitter are, on average, significantly shorter than queries
submitted to traditional Web search engines (1.64 words vs. c(qi , t) + µP (qi |θC )
P (qi |θt ) = . (3)
3.08 words). This finding indicates that there is a demand |t| + µ
to investigate new algorithms and strategies for retrieving Here, µ is the smoothing parameter, c(qi , t) is the count of
relevant information from microblogging streams. term qi in t and |t| is the length of the tweet. The probability
Bernstein et al. [2] proposed an interface that allows for P (qi |θC ) is the maximum likelihood probability of term qi
exploring tweets by means of tag clouds. However, their in- occurring in the collection language model θC (derived by
terface is targeted towards browsing the tweets that have concatenating all tweets in the corpus).
been published by the people whom a user is following and
not for searching the entire Twitter corpus. Jadhav et al. [6] Due to the very small probabilities of P (Q|θt ), we utilize
developed an engine that enriches the semantics of Twitter log (P (Q|θt )) as feature scores. Note that this score is always
messages and allows for issuing SPARQL queries on Twit- negative. The greater the score (that is, the less negative),
ter streams. In previous work, we followed such a semantic the more relevant the tweet is to the query.
enrichment strategy to provide faceted search capabilities 3.2 Semantic-based Relevance Features
on Twitter [1]. Duan et al. [5] investigated features such
as Okapi BM25 relevance scores or Twitter specific features semantic-based relevance score This feature is also a
(length of a tweet, presence or absence of a URL or hash- retrieval score calculated according to Section 3.1 though
tag, etc.) in combination with RankSVM to learn a ranking with a different set of queries. Since the average length
model for tweets (learning to rank). In an empirical study, of search queries submitted to microblog search engines is
they found that the length of a tweet and information about lower than in traditional Web search, it is necessary to un-
the presence of a URL in a tweet are important features to derstand the information need behind the query. The search
rank relevant tweets. In this paper, we re-visit some of the topics provided as part of the TREC data set contain abbre-
features proposed by Duan et al. [5] and introduce novel viations, part of names, and nicknames. One example (cf.
semantic measures that allow us to estimate whether a mi- Table 1) is the first name “Jintao” (in the query: “Jintao
cropost is relevant to a given topic or not. visit US”) which refers to the President of the People’s Re-
public of China. However, in tweets he is also referred to as
3. FEATURES OF MICROPOSTS “President Hu”, “Chinese President”, etc. If these semantic
In this section, we provide an overview of the different fea- variants of a person’s name and titles would be considered
tures that we analyze to estimate the relevance of a Twitter when deriving an expanded query, a wider variety of poten-
message to a given topic. We present topic-sensitive fea- tially relevant tweets could be found. We utilize the well-
tures that measure the relevance with respect to the topic known Named-Entity-Recognition (NER) service DBPedia
50
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
Query Jintao visits US interestingness. We hypothesize that the length of a Twitter
Entity Annotated Text Possible Concepts message correlates with the amount of information that is
Hu Jintao Jintao Hu, Jintao, Hu Jintao conveyed in the message.
Hypothesis H4: the longer a tweet, the more likely it is to be
Table 1: Example of entity recognition and possible relevant and interesting.
concepts in the query
The values of boolean properties are set to 0 (false) and 1
Spotlight6 to identify names and their synonyms in the orig- (true) while the length of a Twitter message is measured
inal query. We merge the found concepts into an expanded by the number of characters divided by 140 which is the
query which is then used as input to the retrieval approach maximum length of a Twitter message.
described earlier. There are further syntactical features that can be explored
such as the mentioning of certain character sequences includ-
isSemanticallyRelated It is a boolean value that shows ing emoticons, question marks, exclamation marks, etc. In
whether there is a semantic overlap between the topic and line with the isReply feature, one could also utilize knowl-
the tweet. This requires us to employ DBpedia Spotlight edge about the re-tweet history of a tweet, e.g. a boolean
on the topic as well as the tweets. If there is an overlap in property that indicates whether the tweet is a copy from an-
the identified DBpedia concepts, the value of this feature is other tweet or a numeric property that counts the number
true, otherwise it is false. of users who re-tweeted the message. However, in this paper
we are merely interested in original messages that have not
3.3 Syntactical Features been re-tweeted yet7 and therefore also merely in features
Syntactical features describe elements that are mentioned which do not require any knowledge about the history of a
in a Twitter message. We analyze the following properties: tweet. This allows us to estimate the relevance of a message
as soon as it is published.
hasHashtag This is a boolean property which indicates
whether a given tweet contains at least one hashtag or not. 3.4 Semantic Features
Twitter users typically apply hashtags in order to facilitate In addition to the semantic relevance scores described in
the retrieval of the tweet. For example, by using a hashtag Section 3.2, one can also analyze the semantics of a Twitter
people can join a discussion on a topic that is represented via message independently from the topic of interest. We there-
that hashtag. Users, who monitor the hashtag, will retrieve fore utilize again the DBpedia entity extraction provided by
all tweets that contain it. Teevan et al. [11] showed that DBpedia Spotlight to extract the following features:
such monitoring behavior is a common practice on Twitter
to retrieve relevant Twitter messages. Therefore, we inves-
#entities The number of DBpedia entities that are men-
tigate whether the occurrence of hashtags (possibly without
tioned in a Twitter message may give further evidence about
any obvious relevance to the topic) is an indicator for the
the potential relevance and interestingness of a tweet. We
relevance and interestingness of a tweet.
assume that the more entities can be extracted from a tweet,
Hypothesis H1: tweets that contain hashtags are more likely
the more information it contains and the more valuable it
to be relevant than tweets that do not contain hashtags.
is. For example, in the context of the discussion about birth
hasURL Dong et al. [4] showed that people often exchange certificates we find the following two tweets in our dataset:
URLs via Twitter so that information about trending URLs t1 : “Despite what her birth certificate says, my lady is ac-
can be exploited to improve Web search and particularly the tually only 27”
ranking of recently discussed URLs. Therefore, the presence t2 : “Hawaii (Democratic) lawmakers want release of Obama’s
of a URL (boolean property) can be an indicator for the birth certificate”
relevance of a tweet. When reading the two tweets, without having a particular
Hypothesis H2: tweets that contain a URL are more likely topic or information need in mind, it seems that t2 has a
to be relevant than tweets that do not contain a URL. higher likelihood to be relevant for some topic for the major-
ity of the Twitter users than t1 as it conveys more entities
isReply On Twitter, users can reply to the tweets of other
that are known to the public and available on Wikipedia
people. This type of communication can, for example, be
and DBpedia respectively. In fact, the entity extractor is
used to comment on a certain message, to answer a ques-
able to detect one entity, db:Birth certificate, for tweet t1
tion or to chat with other people. Chen et al. [3] studied
while it detects three additional entities for t2 : db:Hawaii,
the characteristics of reply chains and discovered that one
db:Legislator and db:Barack Obama.
can distinguish between users who are merely interested in
Hypothesis H5: the more entities a tweet mentions, the more
news-related information and users who are also interested
likely it is to be relevant and interesting.
in social chatter. For deciding whether a tweet is relevant for
a news-related topic, we therefore assume that the boolean
#entities(type) Similarly to counting the number of en-
isReply feature, which indicates whether a tweet is a reply
tities that occur in a Twitter message, we also count the
to another tweet, can be a valuable signal.
number of entities of specific types. The rationale behind
Hypothesis H3: tweets that are formulated as a reply to an-
this feature being that some types of entities might be a
other tweet are less likely to be relevant than other tweets.
stronger indicator for relevance than others. The impor-
length The length of a tweet—measured in the number of tance of a specific entity type may also depend on the topic.
characters—may also be an indicator for the relevance or 7
This is in line with the relevance judgments provided by
6
DBpedia Spotlight, http://spotlight.dbpedia.org/ TREC which did not consider re-tweeted messages.
51
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
For example, when searching for Twitter messages that re- Contextual features may also refer to characteristics of
port about wild fires in a specific area, location-related en- Web pages that are linked from a Twitter message. For
tities may be more interesting than product-related entities. example, one could exploit the PageRank scores of the ref-
In this paper, we count the number of entity occurrences in a erenced Web sites to estimate the relevance of a tweet or one
Twitter message for five different types: locations, persons, could categorize the linked Web pages to discover the types
organizations, artifacts and species (plants and animals). of Web sites that usually attract attention on Twitter. We
Hypothesis H6: different types of entities are of different leave the investigation of such additional contextual features
importance for estimating the relevance of a tweet. for future work.
diversity The diversity of semantic concepts mentioned in 4. FEATURE ANALYSIS
a Twitter message can also be exploited as an indicator for In this section, we describe and characterize the Twitter
the potential relevance and interestingness of a tweet. We corpus with respect to the features that we presented in the
therefore count the number of distinct types of entities that previous section.
are mentioned in a Twitter message. For example, for the
two tweets t1 and t2 mentioned earlier, the diversity score 4.1 Dataset Characteristics
would be 1 and 4 respectively as for t1 only one type of We use the Twitter corpus which was used in the mi-
entity is detected (yago:PersonalDocuments) while for t2 croblog track of TREC 20119 . The original corpus consists
also instances of db:Person (person), db:Place (location) and of approximately 16 million tweets, posted over a period
owl:Thing (the role db:Legislator is not further classified) are of 2 weeks (January 24 until February 8th, inclusive). We
detected. utilized an existing language detection library10 to identify
Hypothesis H7: the greater the diversity of concepts men- English tweets and found that 4,766,901 tweets were clas-
tioned in a tweet, the more likely it is to be interesting and sified as English. Employing NER on the English tweets
relevant. resulted in a total over six million named entities among
sentiment Naveed et al. [9] showed that tweets which con- which we found approximately 0.14 million distinct entities.
tain negative emoticons are more likely to be re-tweeted than Besides the tweets, 49 topics were given as the targets of
tweets which feature positive emoticons. The sentiment of retrieval. TREC assessors judged the relevance of 40,855
a tweet may thus impact the perceived relevance of a tweet. topic-tweet pairs which we use as ground truth in our ex-
Therefore, we classify the the semantic polarity of a tweet periments. 2,825 tweets were judged as relevant for a given
into positive, negative or neutral using Twitter Sentiment 8 . topic while the majority of the tweet-topic pairs (37,349)
Hypothesis H8: the likelihood of a tweet’s relevance is influ- were marked as non-relevant.
enced by its sentiment polarity.
4.2 Feature Characteristics
3.5 Contextual Features In Table 2 we list the average values and the standard de-
In addition to the aforementioned features, which describe viations of the features and the percentages of true instances
characteristics of the Twitter messages, we also investigate for boolean features respectively. It shows that relevant and
features that describe the context in which a tweet was pub- non-relevant tweets show, on average, different characteris-
lish. In our analysis, we investigate the social and temporal tics for several features.
context: As expected, the average keyword-based relevance score
of tweets, which are judged as relevant for a given topic, is
social context The social context describes the creator of
much higher than the one for non-relevant tweets: -10.709
a Twitter message. Different characteristics of the message
in comparison to -14.408 (the higher the value the better,
creator may increase or decrease the likelihood of her tweets
see Section 3.1). Similarly, the semantic-based relevance
being relevant and interesting such as the number of follow-
score, which exploits the semantic concepts mentioned in
ers or the number of tweets from this user that have been
the tweets (see Section 3.2) while calculating the retrieval
re-tweeted. In this paper, we apply a light-weight measure
rankings, shows the same characteristic. The isSemantical-
to characterize the creator of a message: we count the num-
lyReleated feature, which is a binary measure of the overlap
ber of tweets which the user has published.
between the semantic concepts mentioned in the query and
Hypothesis H9: the higher the number of tweets that have
the respective tweets, is also higher for relevant tweets than
been published by the creator of a tweet, the more likely it is
for non-relevant tweets. Hence, when we consider the topic-
that the tweet is relevant.
dependent features (keyword-based and semantic-based), we
temporal context The temporal context describes when find first indicators that the hypotheses behind these fea-
a tweet was published. The creation time can be specified tures hold.
with respect to the time when a user is requesting tweets For the syntactical features we observe that, regardless of
about a certain topic (query time) or it can be independent whether the tweets are relevant to a topic or not, the ratios of
of the query time. For example, one could specify at which tweets that contain hashtags are almost the same (about 19%).
hour during the day the tweet was published or whether it Hence, it seems that the presence of a hashtag is not nec-
was created during the weekend. In our analysis, we utilize essarily an indicator for relevance. However, the presence
the temporal distance (in seconds) between the query time of a URL is potentially a very good indicator: 81.9% of
and the creation time of the tweet. Hypothesis H10: the the relevant tweets feature a URL whereas only 54.1% of
lower the temporal distance between the query time and the the non-relevant tweets contain a URL. A possible explana-
creation time of a tweet, the more likely is the tweet relevant 9
http://trec.nist.gov/data/tweets/
to the topic. 10
Language detection, http://code.google.com/p/
8
http://twittersentiment.appspot.com/ language-detection/
52
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
Category Feature Relevant Standard deviation Non-relevant Standard deviation
keyword
keyword-based -10.709 3.5860 -14.408 2.6442
relevance
semantic semantic-based -10.308 3.7363 -14.264 3.1872
relevance isSemanticallyRelated 25.3% 43.5% 4.6% 22.6%
hasHashtag 19.1% 39.2% 19.3% 39.9%
hasURL 81.9% 38.5% 54.1% 49.5%
syntactical
isReply 3.4% 18.0% 14.2% 34.5%
length (in characters) 90.323 30.81 87.797 36.17
#entities 2.367 1.605 1.880 1.777
#entities(person) 0.276 0.566 0.188 0.491
#entities(organization) 0.316 0.589 0.181 0.573
#entities(location) 0.177 0.484 0.116 0.444
semantics
#entities(artifact) 0.188 0.471 0.245 0.609
#entities(species) 0.005 0.094 0.012 0.070
diversity 0.795 0.788 0.597 0.802
sentiment (-1=neg, 1=pos) -0.025 0.269 0.042 0.395
social context (#tweets by creator) 12.287 19.069 12.226 20.027
contextual
temporal context (time distance in days) 4.85 4.48 3.98 5.09
Table 2: The comparison of features between relevant tweets and non-relevant tweets
tion for this difference is that the tweets containing URLs not more active than publishers of non-relevant tweets (12.3
tend to feature also an attractive short title, especially for vs. 12.2). For the temporal context, the average distance
breaking news, in order to attract people to follow the link. between the time when a user requests tweets about a topic
Moreover, the actual content of the linked Web site may and the creation time of tweets is 4.85 days for relevant
also stipulate users when assessing the relevance of a tweet. tweets and 3.98 for non-relevant tweets. However, the stan-
In Hypothesis 3 (see Section 3.3), we speculate that mes- dard deviations of these scores is with 4.53 days (relevant)
sages which are replies to other tweets are less likely to be and 4.39 days (non-relevant) fairly high. This indicates that
relevant than other tweets. The results listed in Table 2 the temporal context is not a reliable feature for our dataset.
support this hypothesis: only 3.4% of the relevant tweets Preliminary experiments indeed confirmed the low utility of
are replies in contrast to 14.2% of the non-relevant tweets. the temporal feature. However, this observation seems to be
The length of the tweets that are judged as relevant is, on strongly influenced by the TREC dataset itself which was
average, 90.3 characters, which is slightly longer than for the collected within a short time period of time (two weeks). In
non-relevant ones (87.8 characters). our evaluations, we therefore do not consider the temporal
The comparison of the topic-independent semantic fea- context and leave an analysis of the temporal features for
tures also reveals some differences between relevant and non- future work.
relevant tweets. Overall, relevant tweets contain more en-
tities (2.4) than non-relevant tweets (1.9). Among the five 5. EVALUATION OF FEATURES FOR REL-
most frequently mentioned types of entities, persons, orga- EVANCE PREDICTION
nizations, and locations occur more often in relevant tweets
Having analyzed the dataset and the proposed features,
than in non-relevant ones. On average, messages are there-
we now evaluate the quality of the features for predicting
fore considered as more likely to be relevant or interesting
the relevance of tweets for a given topic. We first outline the
for users if they contain information about people, involved
experimental setup before we present our results and analyze
organizations, or places. Artifacts (e.g. tangible things, soft-
the influence of the different features on the performance for
ware) and species (e.g. plants, animals) are more frequent
the different types of topics.
in non-relevant tweets. However, counting the number of
entities of type species seems to be a less promising feature 5.1 Experimental Setup
since the fraction of tweets which mention a species is fairly
low. We employ logistic regression to classify tweets as rele-
The diversity of content mentioned in a Twitter message— vant or non-relevant to a given topic. Due to the small size
i.e. the number of distinct types (only person, organization, of the topic set (49 topics), we use 5-fold cross validation
location, artifact, and species are considered)—is potentially to evaluate the learned classification models. For the final
a good feature: the semantic diversity is higher for the rel- setup, 16 features were used as predictor variables (all fea-
evant tweets (0.8) than for the non-relevant ones (0.6). In tures listed in Table 2 except for the temporal context). To
addition to the entities that are mentioned in the tweets, conduct our experiments, we rely on the machine learning
we also conducted a sentiment analysis of the tweets (see toolkit Weka11 . As the number of relevant tweets is consid-
Section 3.4). Although most of the tweets are neutral (sen- erably smaller than the number of non-relevant tweets, we
timent score = 0), the average sentiment score for relevant employed a cost-sensitive classification setup to prevent the
tweets is negative (-0.025). This observation is in line with classifier from following a best match strategy where simply
the finding made by Naveed et al. [9] who found that nega- all tweets are marked as non-relevant. As the estimation for
tive tweets are more likely to be re-tweeted. the negative class achieves a precision and recall both over
Finally, we also attempted to determine the relationship 90%, we focus on the precision and recall of the relevance
between a tweet’s likelihood of relevance and its context. classification (the positive class) in our evaluation as we aim
With respect to the social context, we however do not ob- to investigate the characteristics that make tweets relevant
serve a significant difference between relevant an non-relevant to a given topic.
tweets: users who publish relevant tweets are, on average, 11
http://www.cs.waikato.ac.nz/ml/weka/
53
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
Features Precision Recall F-Measure
of the model by considering the absolute weights assigned
keyword relevance 0.3040 0.2924 0.2981
to them. For this reason, we have listed the relevant-tweet
semantic relevance 0.3053 0.2931 0.2991
prediction model coefficients for all employed features in Ta-
topic-sensitive 0.3017 0.3419 0.3206 ble 4. The features influencing the model the most are:
topic-insensitive 0.1294 0.0170 0.0300 • hasURL: Since the feature coefficient is positive, the
without semantics 0.3363 0.4828 0.3965 presence of a URL in a tweet is more indicative of
all features 0.3674 0.4736 0.4138 relevance than non-relevance. That means, that hy-
pothesis H2 (Section 3.3) holds.
Table 3: Performance results of relevance predic- • isSemanticallyRelated : The overlap between the iden-
tions for different sets of features. tified DBpedia concepts in the topics and the identified
DBpedia concepts in the tweets is the second most im-
Feature Category Feature Coefficient
portant feature in this model. This is an interesting
keyword-based keyword-based 0.1701
observation, especially in comparison to the keyword-
semantic-based 0.1046 based relevance score, which is only the ninth impor-
semantic-based
isSemanticallyRelated 0.9177 tant feature among the evaluated ones. It implies that
hasHashtag 0.0946 a standard keyword-based retrieval approach, which
hasURL 1.2431 performs well for longer documents, is less suitable for
syntactical
isReply -0.5662 microposts.
length 0.0004
• isReply: This feature, which is true (= 1) if a tweet is
#entities 0.0339 written in reply to a previously published tweet has a
#entities(person) -0.0725 negative coefficient which means that tweets which are
#entities(organization) -0.0890
#entities(location) -0.0927 replies are less likely to be in the relevant class than
semantics tweets which are not replies, confirming hypothesis H3
#entities(artifact) -0.3404
#entities(species) -0.5914 (Section 3.3).
diversity 0.2006 • sentiment: The coefficient of the sentiment feature is
sentiment -0.5220 similarly negative, which suggests that a negative sen-
contextual social context -0.0042 timent is more predictive of relevance than a positive
sentiment, in line with our hypothesis H8 (Section 3.4).
Table 4: The feature coefficients were determined
We note that the keyword-based similarity, while being
across all topics. The total number of topics is 49.
positively aligned with relevance, does not belong to the
The three features with the highest absolute coeffi-
most important features in this model. It is superseded by
cient are underlined.
syntactic as well as semantic-based features. When we con-
5.2 Influence of Features on Relevance Predic- sider the non-topical features only, we observe that inter-
tion estingness (independent of a topic) is related to the poten-
Table 3 shows the performances of estimating the rele- tial amount of additional information (i.e. the presence of
vance of tweets based on different sets of features. Learning a URL), the clarity of the tweet overall (a tweet in reply
the classification model solely based on the keyword-based may be only understandable in the context of the contex-
or semantic-based relevance scoring features leads to an F- tual tweets) and the different aspects covered in the tweet (as
Measure of 0.2981 and 0.2991 respectively. There is thus no evident in the diversity feature). It should also be pointed
notable difference between the two topic-sensitive features. out that the negative coefficients assigned to most topic-
However, by combining both features (see topic-sensitive in insensitive entity count features (#entities(X)) is in line
Table 3) the F-Measure increases which is caused by a higher with the results in Table 2.
recall, increasing from 0.29 to 0.34. It appears that the
keyword-based and semantic-based relevance scores comple- 5.3 Influence of Topic Characteristics on Rel-
ment each other. evance Prediction
As expected, when solely learning the classification model In all reported experiments so far, we have considered the
based on the topic-independent features—i.e. without mea- entire set of topics available to us. In this section, we inves-
suring the relevance to the given topic—the quality of the tigate to what extent certain topic characteristics play a role
relevance prediction is poor. The best performance is achieved for relevance prediction and to what extent those differences
when all features are combined. A precision of 36.74% means lead to a change in the logistic regression models.
that more than a third of all tweets that our approach clas- Consider the following two topics: Taco Bell filling lawsuit
sifies as relevant are indeed relevant, while the recall level (MB02012 ) and Egyptian protesters attack museum (MB010).
(47.36%) implies that our approach discovers nearly half of While the former has a business theme and is likely to be
all relevant tweets. Since microblog messages are very short, mostly of interest to American users, the latter topic belongs
a significant number of tweets can be read quickly by a user into the politics category and can be considered as being of
when presented in response to her search request. In such a global interest, as the entire world was watching the events
setting, we believe such a classification accuracy to be suffi- in Egypt unfold. Due to these differences we defined a num-
cient. Overall, the semantic features seem to play an impor- ber of topic splits. A manual annotator then decided for
tant role as they lead to a performance improvement with each split dimension into which category the topic should
respect to the F-Measure from 0.3965 to 0.4138. We will fall. We investigated four topic splits, three splits with two
now analyze the impact of the different features in detail.
One of the advantages of the logistic regression model is, 12
The identifiers of the topics correspond to the ones used in
that it is easy to determine the most important features the official TREC dataset.
54
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
Performance Measure popular unpopular global local persistent occasional
#topics 24 25 18 31 28 21
#samples 19803 21052 16209 25646 22604 18251
precision 0.3596 0.3579 0.3442 0.3726 0.3439 0.4072
recall 0.4308 0.5344 0.4510 0.4884 0.4311 0.5330
F-measure 0.3920 0.4287 0.3904 0.4227 0.3826 0.4617
Feature Category Feature popular unpopular global local persistent occasional
keyword-based keyword-based 0.1018 0.2475 0.1873 0.1624 0.1531 0.1958
semantic-based 0.1061 0.1312 0.1026 0.1028 0.0820 0.1560
semantic-based
isSemanticallyRelated 1.1026 0.5546 0.9563 0.8617 0.8685 1.0908
hasHashtag 0.1111 0.0917 0.1166 0.0843 0.0801 0.1274
hasURL 1.3509 1.1706 1.2355 1.2676 1.3503 1.0556
syntactical
isReply -0.5603 -0.5958 -0.6466 -0.5162 -0.4443 -0.7643
length 0.0013 -0.0007 0.0003 0.0004 0.0016 -0.0020
#entities 0.0572 0.0117 0.0620 0.0208 0.0478 -0.0115
#entities(person) -0.2613 0.0552 -0.5400 0.0454 0.1088 -0.3932
#entities(organization) -0.0952 -0.1767 -0.2257 -0.0409 -0.1636 -0.0297
#entities(location) -0.1446 0.0136 -0.1368 -0.1056 -0.0583 -0.1305
semantics
#entities(artifact) -0.3442 -0.3725 -0.4834 -0.3086 -0.2260 -0.4835
#entities(species) -0.2567 -0.9599 -0.8893 -0.4792 -0.1634 -18.8129
diversity 0.1940 0.2695 0.2776 0.1943 0.1071 0.3867
sentiment -0.7968 -0.1761 -0.6297 -0.4727 -0.3227 -0.7411
contextual social context -0.002 -0.0068 -0.0020 -0.0057 -0.0034 -0.0055
Table 5: Influence comparison of different features among different topic partitions. There are three splits
shown here: popular vs. unpopular topics, global vs. local topics and persistent vs. occasional topics. While
the performance measures are based on 5-fold cross-validation, the derived feature weights for the logistic
regression model were determined across all topics of a split. The total number of topics is 49. For each topic
split, the three features with the highest absolute coefficient are underlined. The extreme negative coefficient
for #entities(species) and the occasional topic split is an artifact of the small training size: in none of the
relevant tweets did this concept type occur.
partitions each and one split with five partitions: tures of Mpopular and Munpopular shows few differences with
• Popular/unpopular: The topics were split into popular the exception of a single feature: sentiment. While senti-
(interesting to many users) and unpopular (interesting ment, and in particular a negative sentiment, is the third
to few users) topics. An example of a popular topic is most important feature in Mpopular , it is ranked eighth in
2022 FIFA soccer (MB002) - in total we found 24. In Munpopular . We hypothesize that unpopular topics are also
contrast, topic NIST computer security (MB005) was partially unpopular because they do not evoke strong emo-
classified as unpopular (as one of 25 topics). tions in the users. A similar reasoning can be applied when
• Global/local: In this split, we considered the inter- considering the amount of relevant tweets discovered for
est for the topic across the globe. The already men- both topic splits: while on average 67.3 tweets were found to
tioned topic MB002 is of global interest, since soccer be relevant for popular topics, only 49.9 tweets were found
is a highly popular sport in many countries, whereas to be relevant for unpopular topics (the average number of
topic Cuomo budget cuts (MB019) is mostly of local relevant tweets across the entire topic set is 58.44).
interest to users living or working in New York where Global vs. local: This split did not result in mod-
Andrew Cuomo is the current governor. We found 18 els that are significantly different from each other or from
topics to be of global and 31 topics to be of local in- MallTopics , indicating that—at least for our currently investi-
terest. gated features—a distinction between global and local topics
• Persistent/occasional: This split is concerned with the is not useful.
interestingness of the topic over time. Some topics Temporal persistence: The same conclusion can be
persist for a long time, such as MB002 (the FIFA world drawn about the temporal persistence topic split; for both
cup will be played in 2022), whereas other topics are models the same features are of importance which in turn
only of short-term interest, e.g. Keith Olbermann new are similar to MallTopics . However, it is interesting to see
job (MB030). We assigned 28 topics to the persistent that the performance (regarding all metrics) is clearly higher
and 21 topics to the occasional topic partition. for the occasional (short-term) topics in comparison to the
• Topic themes: The topics were classified as belonging persistent (long-term) topics. For topics that have a short
to one of five themes, either business, entertainment, lifespan recall and precision are notably higher than for the
sports, politics or technology. While MB002 is a sports other types of topics.
topic, MB019 for instance is considered to be a politi- Topic Themes: The results of the topic split accord-
cal topic. ing to the theme of the topic are shown in Table 6. Three
Our discussion of the results focuses on two aspects: (i) topics did not fit in one of the five categories. Since the
the difference between the models derived for each of the topic set is split into five partitions, the size of some par-
two partitions, and, (ii) the difference between these models titions is extremely small, making it difficult to reach con-
(denoted MsplitName ) and the model derived over all topics clusive results. We can, though, detect trends, such as the
(MallT opics ) in Table 4. The results for the three binary fact that relevant tweets for business topics are less likely to
topic splits are shown in Table 5. contain hashtags (negative coefficient), while the opposite
Popularity: A comparison of the most important fea- holds for entertainment topics (positive coefficient). The
55
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
Performance Measure business entertainment sports politics technology
#topics 6 12 5 21 2
#samples 4503 9724 4669 17162 1811
precision 0.4659 0.3691 0.1918 0.3433 0.5109
recall 0.7904 0.5791 0.1045 0.4456 0.4653
F-measure 0.5862 0.4508 0.1353 0.3878 0.4870
Feature Category Feature business entertainment sports politics technology
keyword-based keyword-based 0.2143 0.2069 0.1021 0.1728 0.2075
semantic-based 0.2287 0.2246 0.0858 0.0456 0.0180
semantic-based
isSemanticallyRelated 1.3821 0.4088 1.0253 1.0689 2.1150
hasHashtag -0.8488 0.5234 0.3752 -0.0403 -0.1503
hasURL 2.0960 1.1429 1.2785 1.2085 0.4452
syntactical
isReply -0.2738 -0.4784 -0.6747 -0.9130 -0.3912
length 0.0044 0.0011 0.0050 -0.0009 0.0013
#entities -0.2473 -0.1470 0.0853 0.0537 0.1011
#entities(person) -1.2929 -0.1161 -0.4852 0.0177 0.1307
#entities(organization) -0.0976 0.0865 -0.4259 -0.0673 -0.7318
#entities(location) -1.3932 -0.9327 0.3655 -0.1169 0.0875
semantics
#entities(artifact) -0.4003 -0.1235 -1.0891 -0.2663 -0.3943
#entities(species) 0.0241 -19.1819 -31.0063 -0.5570 -0.6187
diversity 0.5277 0.4540 0.3209 0.2037 0.1431
sentiment -1.0070 -0.3477 -1.0766 -0.5663 -0.2180
contextual social context -0.0067 -0.0086 -0.0047 -0.0041 -0.0155
Table 6: In line with Table 5, this table shows the influence comparison of different features when partitioning
the topic set according to five broad topic themes.
semantic similarity has a large impact on all themes but depth. Moreover, we would like to investigate to what ex-
entertainment. Another interesting observation is that sen- tent personal interests of the users (possibly aggregated from
timent, and in particular negative sentiment, is a prominent different Social Web platforms) can be utilized as features
feature in Mbusiness and in Mpolitics but less so in the other for personalized retrieval of microposts.
models.
Finally we note that there are also some features which 7. REFERENCES
have no impact at all, independent of the topic split em- [1] F. Abel, I. Celik, and P. Siehndel. Leveraging the Semantics of
ployed: the length of the tweet and the social context of Tweets for Adaptive Faceted Search on Twitter. In ISWC ’11,
the user posting the message. The observation that certain Springer, 2011.
topic splits lead to models that emphasize certain features [2] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and
E. H. Chi. Eddi: interactive topic-based browsing of social
also offers a natural way forward: if we are able to determine status streams. In UIST ’10, ACM, 2010.
for each topic in advance to which theme or topic charac- [3] J. Chen, R. Nairn, and E. H. Chi. Speak Little and Well:
teristic it belongs to, we can select the model that fits the Recommending Conversations in Online Social Streams. In
CHI ’11, ACM, 2011.
topic best.
[4] A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang,
Z. Zheng, and H. Zha. Time is of the essence: improving
6. CONCLUSIONS recency ranking using twitter data. In WWW ’10, ACM, 2010.
[5] Y. Duan, L. Jiang, T. Qin, M. Zhou, and H.-Y. Shum. An
In this paper, we have analyzed features that can be used empirical study on learning to rank of tweets. In COLING ’10,
as indicators of a tweet’s relevance and interestingness to Association for Computational Linguistics, 2010.
[6] A. Jadhav, H. Purohit, P. Kapanipathi, P. Ananthram,
a given topic. To achieve this, we investigated features A. Ranabahu, V. Nguyen, P. N. Mendes, A. G. Smith,
along two dimensions: topic-dependent features and topic- M. Cooney, , and A. Sheth. Twitris 2.0 : Semantically
independent features. We evaluated the utility of these fea- Empowered System for Understanding Perceptions From Social
Data. In Semantic Web Challenge, 2010.
tures with a machine learning approach that allowed us to
[7] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a
gain insights into the importance of the different features for social network or a news media? In WWW ’10, ACM, 2010.
the relevance classification. [8] M. Mathioudakis and N. Koudas. Twittermonitor: trend
Our main discoveries about the factors that lead to rele- detection over the twitter stream. In SIGMOD ’10, ACM,
2010.
vant tweets are the following: (i) The learned models which
[9] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi. Bad
take advantage of semantics and topic-sensitive features out- news travel fast: A content-based analysis of interestingness on
perform those which do not take the semantics and topic- twitter. In WebSci ’11, 2011.
sensitive features into account. (ii) The length of tweets and [10] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes
Twitter users: real-time event detection by social sensors. In
the social context of the user posting the message have little WWW ’10, ACM, 2010. ACM.
impact on the prediction. (iii) The importance of a feature [11] J. Teevan, D. Ramage, and M. R. Morris. #TwitterSearch: a
differs depending on the characteristics of the topics. For comparison of microblog search and web search. In WSDM
example, the sentiment-based feature is more important for ’11, ACM, 2011.
[12] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding
popular than for unpopular topics and the semantic similar- topic-sensitive influential twitterers. In WSDM ’10, ACM,
ity does not have a significant impact on entertaining topics. 2010.
The work presented here is beneficial for search & retrieval [13] C. Zhai and J. Lafferty. A study of smoothing methods for
of microblogging data and contributes to the foundations of language models applied to ad hoc information retrieval. In
SIGIR ’01, ACM, 2001. ACM.
engineering search engines for microposts. In the future, we
plan to investigate the social and the contextual features in
56
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·