=Paper=
{{Paper
|id=None
|storemode=property
|title=What Makes a Tweet Relevant for a Topic?
|pdfUrl=https://ceur-ws.org/Vol-838/paper_08.pdf
|volume=Vol-838
|dblpUrl=https://dblp.org/rec/conf/msm/TaoAHH12
}}
==What Makes a Tweet Relevant for a Topic?==
<pdf width="1500px">https://ceur-ws.org/Vol-838/paper_08.pdf</pdf>
<pre>
                              What makes a tweet relevant for a topic?

                                       Ke Tao, Fabian Abel, Claudia Hauff, Geert-Jan Houben
                                                          Web Information Systems, TU Delft
                                                      PO Box 5031, 2600 GA Delft, the Netherlands
                                             {k.tao, f.abel, c.hauff, g.j.p.m.houben}@tudelft.nl


    ABSTRACT                                                                               characters). This can be explained by the length of Twit-
    Users who rely on microblogging search (MS) engines to find                            ter messages which is limited to 140 characters so that long
    relevant microposts for their queries usually follow their in-                         queries easily become too restrictive. Short queries on the
    terests and rationale when deciding whether a retrieved post                           other hand may result in a large (or too large) number of
    is of interest to them or not. While today’s MS engines                                matching microposts.
    commonly rely on keyword-based retrieval strategies, we in-                               For these reasons, building search algorithms that are ca-
    vestigate if there exist additional micropost characteristics                          pable of identifying interesting and relevant microposts for
    that are more predictive of a post’s relevance and interest-                           a given topic is a non-trivial and crucial research challenge.
    ingness than its keyword-based similarity with the query. In                           In order to take a first step towards solving this challenge,
    this paper, we experiment with a corpus of Twitter messages                            in this paper, we present an analysis of the following ques-
    and investigate sixteen features along two dimensions: topic-                          tion: is a keyword-based retrieval strategy sufficient or can
    dependent and topic-independent features. Our in-depth                                 we identify features that are more predictive of a tweet’s
    analysis compares the importance of the different types of                             relevance and interestingness? To investigate this question,
    features and reveals that semantic features and therefore an                           we took advantage of last year’s TREC4 2011 Microblog
    understanding of the semantic meaning of the tweets plays                              Track5 , where for the first time an openly accessible search
    a major role in determining the relevance of a tweet with                              & retrieval Twitter data set with about 16 million tweets
    respect to a query. We evaluate our findings in a relevance                            was published.
    classification experiment and show that by combining differ-                              In the context of TREC, the ad-hoc search task on Twit-
    ent features, we can achieve a precision and recall of more                            ter is defined as follows: given a topic (identified by a title)
    than 35% and 45% respectively.                                                         and a point in time pt, retrieve all interesting and relevant
                                                                                           microposts from the corpus that were posted no later than
                                                                                           pt. A subset of the tweets that were retrieved by the research
    1. INTRODUCTION                                                                        groups participating in the benchmark were then judged by
       Microblogging services such as Twitter1 or Sina Weibo2                              human assessors as either relevant to the topic or as non-
    have become a valuable source of information particularly                              relevant. For example, “Obama birth certificate” is one of
    for exploring, monitoring and discussing news-related infor-                           the topics that is part of the TREC corpus. Given the tem-
    mation [7]. Searching for relevant information in such ser-                            poral context, one can infer that this topic title refers the
    vices is challenging as the number of posts published per                              discussions about Barack Obama’s birth certificate: people
    day can exceed several hundred millions3 .                                             were questioning whether Barack Obama was truly born in
       Moreover, users who search for microposts about a cer-                              the United States.
    tain topic typically perform a keyword search. Teevan et                                  We rely on the judged tweets for our analysis and investi-
    al. [11] found that keyword queries on Twitter are signifi-                            gate topic-dependent as well as topic-independent features.
    cantly shorter than those issued for Web search: on Twitter                            Examples of topic-dependent features are the retrieval score
    people typically use 1.64 words (or 12.0 characters) to search                         derived from retrieval strategies that are based on document
    while on the Web they use, on average, 3.08 words (or 18.8                             and corpus statistics as well as the semantic overlap score
    1                                                                                      which determines the extent of overlap between the seman-
      http://twitter.com/
    2                                                                                      tic meaning of a search topic and a tweet. In addition to
      http://www.weibo.com/
    3                                                                                      these topic-dependent features, we also studied a number of
      http://blog.twitter.com/2011/06/
    200-million-tweets-per-day.html                                                        topic-independent features: syntactical features (such as the
                                                                                           presence of URLs or hashtags in a tweet), semantic features
                                                                                           (such as the diversity of the semantic concepts mentioned in
                                                                                           a tweet) and social context features (such as the authority
                                                                                           of the user who published the tweet).
    Permission to make digital or hard copies of all or part of this work for                 The main contributions of our work can be summarized
    personal or classroom use is granted without fee provided that copies are              as follows:
    not made or distributed
    Copyright      c 2012 held for profit or commercial advantage and that copies
                                    by author(s)/owner(s).                                     • We present a set of strategies for the extraction of fea-
    bear this notice
    Published     as and
                     parttheoffull
                                thecitation on the first
                                     #MSM2012            page. To proceedings,
                                                     Workshop       copy otherwise, to
    republish,
    availabletoonline
                  post onasservers
                             CEUR   or Vol-838,
                                       to redistribute to lists, requires prior specific
                                                  at: http://ceur-ws.org/Vol-838           4
    permission
    #MSM2012,   and/or  a fee.
                     April   16, 2012, Lyon, France.                                           http://trec.nist.gov/
                                                                                           5
    Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.                                            http://sites.google.com/site/trecmicroblogtrack/


                                                                                                                                                          49
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
           tures from Twitter messages that allow us to predict        (keyword-based and semantic-based relevance) and topic-
           the relevance of a post for a given topic.                  insensitive measures that do not consider the actual topic
         • Given a set of more than 38,000 tweets that were man-       but solely exploit syntactical or semantic tweet characteris-
           ually labeled as relevant or not relevant for a set of      tics. Finally, we also consider contextual features that, for
           49 topics, we analyze the features and characteristics      example, characterize the creator of a tweet.
           of relevant and interesting tweets.
         • We evaluate the effectiveness of the different features     3.1 Keyword-based Relevance Features
           for predicting the relevance of tweets for a topic and      keyword-based relevance score (Indri-based query rel-
           investigate the impact of the different features on the     evance): To calculate the retrieval score for pair of (topic,
           quality of the relevance classification. We also study to   tweet), we employ the language modeling approach to in-
           what extent the success of the classification depends on    formation retrieval [13]. A language model θt is derived
           the type of topics (e.g. topics of short-term vs. topics    for each document (tweet). Given a query Q with terms
           of long-term interest) for which relevant tweets should     Q = {q1 , ..., qn } the document language models are ranked
           be identified.                                              with respect to the probability P (θt |Q), which according to
    2.    RELATED WORK                                                 the Bayes theorem can be expressed as:
       Since its launch in 2006 Twitter attracted a lot of at-                                     P (Q|θt )P (θt )
                                                                                       P (θt |Q) =                                  (1)
    tention, both in the general public as well as in the re-                                           P (Q)
    search community. Researchers started studying microblog-                                              Y
                                                                                                 ∝ P (θt )     P (qi |θt ).         (2)
    ging phenomena to find out what kind of information is dis-
                                                                                                              qi ∈Q
    cussed on Twitter [7], how trends evolve on Twitter [8], or
    how one detects influential users on Twitter [12]. Applica-        This is the standard query likelihood based language mod-
    tions have been researched that utilize microblogging data to      eling setup which assumes term independence. Usually, the
    enrich traditional news media with information from Twit-          prior probability of a tweet P (θt ) is considered to be uni-
    ter [6], to detect and manage emergency situations such            form, that is, each tweet in the corpus is equally likely. The
    as earthquakes [10] or to enhance search and ranking of            language models are multinomial probability distributions
    Web sites which possibly have not been indexed yet by Web          over the terms occurring in the tweets. Since a maximum
    search engines.                                                    likelihood estimate of P (qi |θt ) would result in a zero proba-
       So far, search on Twitter or other microblogging plat-          bility of any tweet that misses one or more of the query terms
    forms such as Sina Weibo has not been studied extensively.         in Q, the estimate is usually smoothed with a background
    Teevan et al. [11] compared the search behavior on Twitter         language model, generated over all tweets in the corpus. We
    with traditional Web search behavior. It was found that key-       employed Dirichlet smoothing [13]:
    word queries that people issue to retrieve information from
    Twitter are, on average, significantly shorter than queries
    submitted to traditional Web search engines (1.64 words vs.                                       c(qi , t) + µP (qi |θC )
                                                                                      P (qi |θt ) =                            .    (3)
    3.08 words). This finding indicates that there is a demand                                                 |t| + µ
    to investigate new algorithms and strategies for retrieving        Here, µ is the smoothing parameter, c(qi , t) is the count of
    relevant information from microblogging streams.                   term qi in t and |t| is the length of the tweet. The probability
       Bernstein et al. [2] proposed an interface that allows for      P (qi |θC ) is the maximum likelihood probability of term qi
    exploring tweets by means of tag clouds. However, their in-        occurring in the collection language model θC (derived by
    terface is targeted towards browsing the tweets that have          concatenating all tweets in the corpus).
    been published by the people whom a user is following and
    not for searching the entire Twitter corpus. Jadhav et al. [6]     Due to the very small probabilities of P (Q|θt ), we utilize
    developed an engine that enriches the semantics of Twitter         log (P (Q|θt )) as feature scores. Note that this score is always
    messages and allows for issuing SPARQL queries on Twit-            negative. The greater the score (that is, the less negative),
    ter streams. In previous work, we followed such a semantic         the more relevant the tweet is to the query.
    enrichment strategy to provide faceted search capabilities         3.2 Semantic-based Relevance Features
    on Twitter [1]. Duan et al. [5] investigated features such
    as Okapi BM25 relevance scores or Twitter specific features        semantic-based relevance score This feature is also a
    (length of a tweet, presence or absence of a URL or hash-          retrieval score calculated according to Section 3.1 though
    tag, etc.) in combination with RankSVM to learn a ranking          with a different set of queries. Since the average length
    model for tweets (learning to rank). In an empirical study,        of search queries submitted to microblog search engines is
    they found that the length of a tweet and information about        lower than in traditional Web search, it is necessary to un-
    the presence of a URL in a tweet are important features to         derstand the information need behind the query. The search
    rank relevant tweets. In this paper, we re-visit some of the       topics provided as part of the TREC data set contain abbre-
    features proposed by Duan et al. [5] and introduce novel           viations, part of names, and nicknames. One example (cf.
    semantic measures that allow us to estimate whether a mi-          Table 1) is the first name “Jintao” (in the query: “Jintao
    cropost is relevant to a given topic or not.                       visit US”) which refers to the President of the People’s Re-
                                                                       public of China. However, in tweets he is also referred to as
    3.    FEATURES OF MICROPOSTS                                       “President Hu”, “Chinese President”, etc. If these semantic
      In this section, we provide an overview of the different fea-    variants of a person’s name and titles would be considered
    tures that we analyze to estimate the relevance of a Twitter       when deriving an expanded query, a wider variety of poten-
    message to a given topic. We present topic-sensitive fea-          tially relevant tweets could be found. We utilize the well-
    tures that measure the relevance with respect to the topic         known Named-Entity-Recognition (NER) service DBPedia


                                                                                                                                       50
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
           Query                    Jintao visits US                  interestingness. We hypothesize that the length of a Twitter
           Entity      Annotated Text       Possible Concepts         message correlates with the amount of information that is
           Hu Jintao       Jintao          Hu, Jintao, Hu Jintao      conveyed in the message.
                                                                      Hypothesis H4: the longer a tweet, the more likely it is to be
    Table 1: Example of entity recognition and possible               relevant and interesting.
    concepts in the query
                                                                      The values of boolean properties are set to 0 (false) and 1
    Spotlight6 to identify names and their synonyms in the orig-      (true) while the length of a Twitter message is measured
    inal query. We merge the found concepts into an expanded          by the number of characters divided by 140 which is the
    query which is then used as input to the retrieval approach       maximum length of a Twitter message.
    described earlier.                                                   There are further syntactical features that can be explored
                                                                      such as the mentioning of certain character sequences includ-
    isSemanticallyRelated It is a boolean value that shows            ing emoticons, question marks, exclamation marks, etc. In
    whether there is a semantic overlap between the topic and         line with the isReply feature, one could also utilize knowl-
    the tweet. This requires us to employ DBpedia Spotlight           edge about the re-tweet history of a tweet, e.g. a boolean
    on the topic as well as the tweets. If there is an overlap in     property that indicates whether the tweet is a copy from an-
    the identified DBpedia concepts, the value of this feature is     other tweet or a numeric property that counts the number
    true, otherwise it is false.                                      of users who re-tweeted the message. However, in this paper
                                                                      we are merely interested in original messages that have not
    3.3     Syntactical Features                                      been re-tweeted yet7 and therefore also merely in features
      Syntactical features describe elements that are mentioned       which do not require any knowledge about the history of a
    in a Twitter message. We analyze the following properties:        tweet. This allows us to estimate the relevance of a message
                                                                      as soon as it is published.
    hasHashtag This is a boolean property which indicates
    whether a given tweet contains at least one hashtag or not.       3.4 Semantic Features
    Twitter users typically apply hashtags in order to facilitate       In addition to the semantic relevance scores described in
    the retrieval of the tweet. For example, by using a hashtag       Section 3.2, one can also analyze the semantics of a Twitter
    people can join a discussion on a topic that is represented via   message independently from the topic of interest. We there-
    that hashtag. Users, who monitor the hashtag, will retrieve       fore utilize again the DBpedia entity extraction provided by
    all tweets that contain it. Teevan et al. [11] showed that        DBpedia Spotlight to extract the following features:
    such monitoring behavior is a common practice on Twitter
    to retrieve relevant Twitter messages. Therefore, we inves-
                                                                      #entities The number of DBpedia entities that are men-
    tigate whether the occurrence of hashtags (possibly without
                                                                      tioned in a Twitter message may give further evidence about
    any obvious relevance to the topic) is an indicator for the
                                                                      the potential relevance and interestingness of a tweet. We
    relevance and interestingness of a tweet.
                                                                      assume that the more entities can be extracted from a tweet,
    Hypothesis H1: tweets that contain hashtags are more likely
                                                                      the more information it contains and the more valuable it
    to be relevant than tweets that do not contain hashtags.
                                                                      is. For example, in the context of the discussion about birth
    hasURL Dong et al. [4] showed that people often exchange          certificates we find the following two tweets in our dataset:
    URLs via Twitter so that information about trending URLs          t1 : “Despite what her birth certificate says, my lady is ac-
    can be exploited to improve Web search and particularly the       tually only 27”
    ranking of recently discussed URLs. Therefore, the presence       t2 : “Hawaii (Democratic) lawmakers want release of Obama’s
    of a URL (boolean property) can be an indicator for the           birth certificate”
    relevance of a tweet.                                             When reading the two tweets, without having a particular
    Hypothesis H2: tweets that contain a URL are more likely          topic or information need in mind, it seems that t2 has a
    to be relevant than tweets that do not contain a URL.             higher likelihood to be relevant for some topic for the major-
                                                                      ity of the Twitter users than t1 as it conveys more entities
    isReply On Twitter, users can reply to the tweets of other
                                                                      that are known to the public and available on Wikipedia
    people. This type of communication can, for example, be
                                                                      and DBpedia respectively. In fact, the entity extractor is
    used to comment on a certain message, to answer a ques-
                                                                      able to detect one entity, db:Birth certificate, for tweet t1
    tion or to chat with other people. Chen et al. [3] studied
                                                                      while it detects three additional entities for t2 : db:Hawaii,
    the characteristics of reply chains and discovered that one
                                                                      db:Legislator and db:Barack Obama.
    can distinguish between users who are merely interested in
                                                                      Hypothesis H5: the more entities a tweet mentions, the more
    news-related information and users who are also interested
                                                                      likely it is to be relevant and interesting.
    in social chatter. For deciding whether a tweet is relevant for
    a news-related topic, we therefore assume that the boolean
                                                                      #entities(type) Similarly to counting the number of en-
    isReply feature, which indicates whether a tweet is a reply
                                                                      tities that occur in a Twitter message, we also count the
    to another tweet, can be a valuable signal.
                                                                      number of entities of specific types. The rationale behind
    Hypothesis H3: tweets that are formulated as a reply to an-
                                                                      this feature being that some types of entities might be a
    other tweet are less likely to be relevant than other tweets.
                                                                      stronger indicator for relevance than others. The impor-
    length The length of a tweet—measured in the number of            tance of a specific entity type may also depend on the topic.
    characters—may also be an indicator for the relevance or          7
                                                                        This is in line with the relevance judgments provided by
    6
        DBpedia Spotlight, http://spotlight.dbpedia.org/              TREC which did not consider re-tweeted messages.


                                                                                                                                   51
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
    For example, when searching for Twitter messages that re-             Contextual features may also refer to characteristics of
    port about wild fires in a specific area, location-related en-     Web pages that are linked from a Twitter message. For
    tities may be more interesting than product-related entities.      example, one could exploit the PageRank scores of the ref-
    In this paper, we count the number of entity occurrences in a      erenced Web sites to estimate the relevance of a tweet or one
    Twitter message for five different types: locations, persons,      could categorize the linked Web pages to discover the types
    organizations, artifacts and species (plants and animals).         of Web sites that usually attract attention on Twitter. We
    Hypothesis H6: different types of entities are of different        leave the investigation of such additional contextual features
    importance for estimating the relevance of a tweet.                for future work.
    diversity The diversity of semantic concepts mentioned in          4.   FEATURE ANALYSIS
    a Twitter message can also be exploited as an indicator for          In this section, we describe and characterize the Twitter
    the potential relevance and interestingness of a tweet. We         corpus with respect to the features that we presented in the
    therefore count the number of distinct types of entities that      previous section.
    are mentioned in a Twitter message. For example, for the
    two tweets t1 and t2 mentioned earlier, the diversity score        4.1 Dataset Characteristics
    would be 1 and 4 respectively as for t1 only one type of              We use the Twitter corpus which was used in the mi-
    entity is detected (yago:PersonalDocuments) while for t2           croblog track of TREC 20119 . The original corpus consists
    also instances of db:Person (person), db:Place (location) and      of approximately 16 million tweets, posted over a period
    owl:Thing (the role db:Legislator is not further classified) are   of 2 weeks (January 24 until February 8th, inclusive). We
    detected.                                                          utilized an existing language detection library10 to identify
    Hypothesis H7: the greater the diversity of concepts men-          English tweets and found that 4,766,901 tweets were clas-
    tioned in a tweet, the more likely it is to be interesting and     sified as English. Employing NER on the English tweets
    relevant.                                                          resulted in a total over six million named entities among
    sentiment Naveed et al. [9] showed that tweets which con-          which we found approximately 0.14 million distinct entities.
    tain negative emoticons are more likely to be re-tweeted than      Besides the tweets, 49 topics were given as the targets of
    tweets which feature positive emoticons. The sentiment of          retrieval. TREC assessors judged the relevance of 40,855
    a tweet may thus impact the perceived relevance of a tweet.        topic-tweet pairs which we use as ground truth in our ex-
    Therefore, we classify the the semantic polarity of a tweet        periments. 2,825 tweets were judged as relevant for a given
    into positive, negative or neutral using Twitter Sentiment 8 .     topic while the majority of the tweet-topic pairs (37,349)
    Hypothesis H8: the likelihood of a tweet’s relevance is influ-     were marked as non-relevant.
    enced by its sentiment polarity.
                                                                       4.2 Feature Characteristics
    3.5     Contextual Features                                           In Table 2 we list the average values and the standard de-
       In addition to the aforementioned features, which describe      viations of the features and the percentages of true instances
    characteristics of the Twitter messages, we also investigate       for boolean features respectively. It shows that relevant and
    features that describe the context in which a tweet was pub-       non-relevant tweets show, on average, different characteris-
    lish. In our analysis, we investigate the social and temporal      tics for several features.
    context:                                                              As expected, the average keyword-based relevance score
                                                                       of tweets, which are judged as relevant for a given topic, is
    social context The social context describes the creator of
                                                                       much higher than the one for non-relevant tweets: -10.709
    a Twitter message. Different characteristics of the message
                                                                       in comparison to -14.408 (the higher the value the better,
    creator may increase or decrease the likelihood of her tweets
                                                                       see Section 3.1). Similarly, the semantic-based relevance
    being relevant and interesting such as the number of follow-
                                                                       score, which exploits the semantic concepts mentioned in
    ers or the number of tweets from this user that have been
                                                                       the tweets (see Section 3.2) while calculating the retrieval
    re-tweeted. In this paper, we apply a light-weight measure
                                                                       rankings, shows the same characteristic. The isSemantical-
    to characterize the creator of a message: we count the num-
                                                                       lyReleated feature, which is a binary measure of the overlap
    ber of tweets which the user has published.
                                                                       between the semantic concepts mentioned in the query and
    Hypothesis H9: the higher the number of tweets that have
                                                                       the respective tweets, is also higher for relevant tweets than
    been published by the creator of a tweet, the more likely it is
                                                                       for non-relevant tweets. Hence, when we consider the topic-
    that the tweet is relevant.
                                                                       dependent features (keyword-based and semantic-based), we
    temporal context The temporal context describes when               find first indicators that the hypotheses behind these fea-
    a tweet was published. The creation time can be specified          tures hold.
    with respect to the time when a user is requesting tweets             For the syntactical features we observe that, regardless of
    about a certain topic (query time) or it can be independent        whether the tweets are relevant to a topic or not, the ratios of
    of the query time. For example, one could specify at which         tweets that contain hashtags are almost the same (about 19%).
    hour during the day the tweet was published or whether it          Hence, it seems that the presence of a hashtag is not nec-
    was created during the weekend. In our analysis, we utilize        essarily an indicator for relevance. However, the presence
    the temporal distance (in seconds) between the query time          of a URL is potentially a very good indicator: 81.9% of
    and the creation time of the tweet. Hypothesis H10: the            the relevant tweets feature a URL whereas only 54.1% of
    lower the temporal distance between the query time and the         the non-relevant tweets contain a URL. A possible explana-
    creation time of a tweet, the more likely is the tweet relevant     9
                                                                        http://trec.nist.gov/data/tweets/
    to the topic.                                                      10
                                                                        Language    detection,   http://code.google.com/p/
    8
        http://twittersentiment.appspot.com/                           language-detection/


                                                                                                                                     52
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
       Category        Feature                                    Relevant    Standard deviation      Non-relevant    Standard deviation
       keyword
                       keyword-based                                -10.709                 3.5860          -14.408               2.6442
       relevance
       semantic        semantic-based                               -10.308                 3.7363          -14.264               3.1872
       relevance       isSemanticallyRelated                         25.3%                  43.5%             4.6%                22.6%
                       hasHashtag                                   19.1%                    39.2%          19.3%                  39.9%
                       hasURL                                       81.9%                    38.5%          54.1%                  49.5%
       syntactical
                       isReply                                       3.4%                    18.0%          14.2%                  34.5%
                       length (in characters)                       90.323                    30.81         87.797                  36.17
                       #entities                                      2.367                  1.605           1.880                 1.777
                       #entities(person)                              0.276                  0.566           0.188                 0.491
                       #entities(organization)                        0.316                  0.589           0.181                 0.573
                       #entities(location)                            0.177                  0.484           0.116                 0.444
       semantics
                       #entities(artifact)                            0.188                  0.471           0.245                 0.609
                       #entities(species)                             0.005                  0.094           0.012                 0.070
                       diversity                                      0.795                  0.788           0.597                 0.802
                       sentiment (-1=neg, 1=pos)                     -0.025                  0.269           0.042                 0.395
                       social context (#tweets by creator)          12.287                  19.069          12.226                20.027
       contextual
                       temporal context (time distance in days)       4.85                    4.48            3.98                  5.09
                     Table 2: The comparison of features between relevant tweets and non-relevant tweets

    tion for this difference is that the tweets containing URLs               not more active than publishers of non-relevant tweets (12.3
    tend to feature also an attractive short title, especially for            vs. 12.2). For the temporal context, the average distance
    breaking news, in order to attract people to follow the link.             between the time when a user requests tweets about a topic
    Moreover, the actual content of the linked Web site may                   and the creation time of tweets is 4.85 days for relevant
    also stipulate users when assessing the relevance of a tweet.             tweets and 3.98 for non-relevant tweets. However, the stan-
    In Hypothesis 3 (see Section 3.3), we speculate that mes-                 dard deviations of these scores is with 4.53 days (relevant)
    sages which are replies to other tweets are less likely to be             and 4.39 days (non-relevant) fairly high. This indicates that
    relevant than other tweets. The results listed in Table 2                 the temporal context is not a reliable feature for our dataset.
    support this hypothesis: only 3.4% of the relevant tweets                 Preliminary experiments indeed confirmed the low utility of
    are replies in contrast to 14.2% of the non-relevant tweets.              the temporal feature. However, this observation seems to be
    The length of the tweets that are judged as relevant is, on               strongly influenced by the TREC dataset itself which was
    average, 90.3 characters, which is slightly longer than for the           collected within a short time period of time (two weeks). In
    non-relevant ones (87.8 characters).                                      our evaluations, we therefore do not consider the temporal
       The comparison of the topic-independent semantic fea-                  context and leave an analysis of the temporal features for
    tures also reveals some differences between relevant and non-             future work.
    relevant tweets. Overall, relevant tweets contain more en-
    tities (2.4) than non-relevant tweets (1.9). Among the five               5.      EVALUATION OF FEATURES FOR REL-
    most frequently mentioned types of entities, persons, orga-                       EVANCE PREDICTION
    nizations, and locations occur more often in relevant tweets
                                                                                Having analyzed the dataset and the proposed features,
    than in non-relevant ones. On average, messages are there-
                                                                              we now evaluate the quality of the features for predicting
    fore considered as more likely to be relevant or interesting
                                                                              the relevance of tweets for a given topic. We first outline the
    for users if they contain information about people, involved
                                                                              experimental setup before we present our results and analyze
    organizations, or places. Artifacts (e.g. tangible things, soft-
                                                                              the influence of the different features on the performance for
    ware) and species (e.g. plants, animals) are more frequent
                                                                              the different types of topics.
    in non-relevant tweets. However, counting the number of
    entities of type species seems to be a less promising feature             5.1 Experimental Setup
    since the fraction of tweets which mention a species is fairly
    low.                                                                         We employ logistic regression to classify tweets as rele-
       The diversity of content mentioned in a Twitter message—               vant or non-relevant to a given topic. Due to the small size
    i.e. the number of distinct types (only person, organization,             of the topic set (49 topics), we use 5-fold cross validation
    location, artifact, and species are considered)—is potentially            to evaluate the learned classification models. For the final
    a good feature: the semantic diversity is higher for the rel-             setup, 16 features were used as predictor variables (all fea-
    evant tweets (0.8) than for the non-relevant ones (0.6). In               tures listed in Table 2 except for the temporal context). To
    addition to the entities that are mentioned in the tweets,                conduct our experiments, we rely on the machine learning
    we also conducted a sentiment analysis of the tweets (see                 toolkit Weka11 . As the number of relevant tweets is consid-
    Section 3.4). Although most of the tweets are neutral (sen-               erably smaller than the number of non-relevant tweets, we
    timent score = 0), the average sentiment score for relevant               employed a cost-sensitive classification setup to prevent the
    tweets is negative (-0.025). This observation is in line with             classifier from following a best match strategy where simply
    the finding made by Naveed et al. [9] who found that nega-                all tweets are marked as non-relevant. As the estimation for
    tive tweets are more likely to be re-tweeted.                             the negative class achieves a precision and recall both over
       Finally, we also attempted to determine the relationship               90%, we focus on the precision and recall of the relevance
    between a tweet’s likelihood of relevance and its context.                classification (the positive class) in our evaluation as we aim
    With respect to the social context, we however do not ob-                 to investigate the characteristics that make tweets relevant
    serve a significant difference between relevant an non-relevant           to a given topic.
    tweets: users who publish relevant tweets are, on average,                11
                                                                                   http://www.cs.waikato.ac.nz/ml/weka/


                                                                                                                                            53
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
     Features              Precision        Recall   F-Measure
                                                                      of the model by considering the absolute weights assigned
     keyword relevance         0.3040       0.2924       0.2981
                                                                      to them. For this reason, we have listed the relevant-tweet
     semantic relevance        0.3053       0.2931       0.2991
                                                                      prediction model coefficients for all employed features in Ta-
     topic-sensitive           0.3017       0.3419       0.3206       ble 4. The features influencing the model the most are:
     topic-insensitive         0.1294       0.0170       0.0300           • hasURL: Since the feature coefficient is positive, the
     without semantics         0.3363       0.4828       0.3965             presence of a URL in a tweet is more indicative of
     all features              0.3674       0.4736       0.4138             relevance than non-relevance. That means, that hy-
                                                                            pothesis H2 (Section 3.3) holds.
    Table 3: Performance results of relevance predic-                     • isSemanticallyRelated : The overlap between the iden-
    tions for different sets of features.                                   tified DBpedia concepts in the topics and the identified
                                                                            DBpedia concepts in the tweets is the second most im-
     Feature Category      Feature                   Coefficient
                                                                            portant feature in this model. This is an interesting
     keyword-based         keyword-based                  0.1701
                                                                            observation, especially in comparison to the keyword-
                           semantic-based                 0.1046            based relevance score, which is only the ninth impor-
     semantic-based
                           isSemanticallyRelated          0.9177            tant feature among the evaluated ones. It implies that
                           hasHashtag                     0.0946            a standard keyword-based retrieval approach, which
                           hasURL                         1.2431            performs well for longer documents, is less suitable for
     syntactical
                           isReply                       -0.5662            microposts.
                           length                         0.0004
                                                                          • isReply: This feature, which is true (= 1) if a tweet is
                           #entities                      0.0339            written in reply to a previously published tweet has a
                           #entities(person)             -0.0725            negative coefficient which means that tweets which are
                           #entities(organization)       -0.0890
                           #entities(location)           -0.0927            replies are less likely to be in the relevant class than
     semantics                                                              tweets which are not replies, confirming hypothesis H3
                           #entities(artifact)           -0.3404
                           #entities(species)            -0.5914            (Section 3.3).
                           diversity                      0.2006          • sentiment: The coefficient of the sentiment feature is
                           sentiment                     -0.5220            similarly negative, which suggests that a negative sen-
     contextual            social context                -0.0042            timent is more predictive of relevance than a positive
                                                                            sentiment, in line with our hypothesis H8 (Section 3.4).
    Table 4: The feature coefficients were determined
                                                                         We note that the keyword-based similarity, while being
    across all topics. The total number of topics is 49.
                                                                      positively aligned with relevance, does not belong to the
    The three features with the highest absolute coeffi-
                                                                      most important features in this model. It is superseded by
    cient are underlined.
                                                                      syntactic as well as semantic-based features. When we con-
    5.2   Influence of Features on Relevance Predic-                  sider the non-topical features only, we observe that inter-
          tion                                                        estingness (independent of a topic) is related to the poten-
       Table 3 shows the performances of estimating the rele-         tial amount of additional information (i.e. the presence of
    vance of tweets based on different sets of features. Learning     a URL), the clarity of the tweet overall (a tweet in reply
    the classification model solely based on the keyword-based        may be only understandable in the context of the contex-
    or semantic-based relevance scoring features leads to an F-       tual tweets) and the different aspects covered in the tweet (as
    Measure of 0.2981 and 0.2991 respectively. There is thus no       evident in the diversity feature). It should also be pointed
    notable difference between the two topic-sensitive features.      out that the negative coefficients assigned to most topic-
    However, by combining both features (see topic-sensitive in       insensitive entity count features (#entities(X)) is in line
    Table 3) the F-Measure increases which is caused by a higher      with the results in Table 2.
    recall, increasing from 0.29 to 0.34. It appears that the
    keyword-based and semantic-based relevance scores comple-         5.3 Influence of Topic Characteristics on Rel-
    ment each other.                                                      evance Prediction
       As expected, when solely learning the classification model        In all reported experiments so far, we have considered the
    based on the topic-independent features—i.e. without mea-         entire set of topics available to us. In this section, we inves-
    suring the relevance to the given topic—the quality of the        tigate to what extent certain topic characteristics play a role
    relevance prediction is poor. The best performance is achieved    for relevance prediction and to what extent those differences
    when all features are combined. A precision of 36.74% means       lead to a change in the logistic regression models.
    that more than a third of all tweets that our approach clas-         Consider the following two topics: Taco Bell filling lawsuit
    sifies as relevant are indeed relevant, while the recall level    (MB02012 ) and Egyptian protesters attack museum (MB010).
    (47.36%) implies that our approach discovers nearly half of       While the former has a business theme and is likely to be
    all relevant tweets. Since microblog messages are very short,     mostly of interest to American users, the latter topic belongs
    a significant number of tweets can be read quickly by a user      into the politics category and can be considered as being of
    when presented in response to her search request. In such a       global interest, as the entire world was watching the events
    setting, we believe such a classification accuracy to be suffi-   in Egypt unfold. Due to these differences we defined a num-
    cient. Overall, the semantic features seem to play an impor-      ber of topic splits. A manual annotator then decided for
    tant role as they lead to a performance improvement with          each split dimension into which category the topic should
    respect to the F-Measure from 0.3965 to 0.4138. We will           fall. We investigated four topic splits, three splits with two
    now analyze the impact of the different features in detail.
       One of the advantages of the logistic regression model is,     12
                                                                       The identifiers of the topics correspond to the ones used in
    that it is easy to determine the most important features          the official TREC dataset.


                                                                                                                                     54
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
             Performance         Measure                   popular    unpopular    global     local    persistent   occasional
                                 #topics                        24           25        18        31            28           21
                                 #samples                    19803        21052     16209     25646         22604        18251
                                 precision                  0.3596        0.3579   0.3442    0.3726        0.3439       0.4072
                                 recall                     0.4308        0.5344   0.4510    0.4884        0.4311       0.5330
                                 F-measure                  0.3920        0.4287   0.3904    0.4227        0.3826       0.4617
             Feature Category    Feature                   popular    unpopular    global     local    persistent   occasional
             keyword-based       keyword-based              0.1018        0.2475   0.1873    0.1624        0.1531       0.1958
                                 semantic-based             0.1061        0.1312   0.1026    0.1028        0.0820       0.1560
             semantic-based
                                 isSemanticallyRelated      1.1026        0.5546   0.9563    0.8617        0.8685       1.0908
                                 hasHashtag                  0.1111       0.0917    0.1166    0.0843       0.0801       0.1274
                                 hasURL                      1.3509       1.1706    1.2355    1.2676       1.3503       1.0556
             syntactical
                                 isReply                    -0.5603      -0.5958   -0.6466   -0.5162      -0.4443      -0.7643
                                 length                      0.0013      -0.0007    0.0003    0.0004       0.0016      -0.0020
                                 #entities                   0.0572       0.0117    0.0620    0.0208       0.0478      -0.0115
                                 #entities(person)          -0.2613       0.0552   -0.5400    0.0454       0.1088      -0.3932
                                 #entities(organization)    -0.0952      -0.1767   -0.2257   -0.0409      -0.1636      -0.0297
                                 #entities(location)        -0.1446       0.0136   -0.1368   -0.1056      -0.0583      -0.1305
             semantics
                                 #entities(artifact)        -0.3442      -0.3725   -0.4834   -0.3086      -0.2260      -0.4835
                                 #entities(species)         -0.2567      -0.9599   -0.8893   -0.4792      -0.1634     -18.8129
                                 diversity                   0.1940       0.2695    0.2776    0.1943       0.1071       0.3867
                                 sentiment                  -0.7968      -0.1761   -0.6297   -0.4727      -0.3227      -0.7411
             contextual          social context              -0.002      -0.0068   -0.0020   -0.0057      -0.0034      -0.0055

    Table 5: Influence comparison of different features among different topic partitions. There are three splits
    shown here: popular vs. unpopular topics, global vs. local topics and persistent vs. occasional topics. While
    the performance measures are based on 5-fold cross-validation, the derived feature weights for the logistic
    regression model were determined across all topics of a split. The total number of topics is 49. For each topic
    split, the three features with the highest absolute coefficient are underlined. The extreme negative coefficient
    for #entities(species) and the occasional topic split is an artifact of the small training size: in none of the
    relevant tweets did this concept type occur.

    partitions each and one split with five partitions:                 tures of Mpopular and Munpopular shows few differences with
       • Popular/unpopular: The topics were split into popular          the exception of a single feature: sentiment. While senti-
          (interesting to many users) and unpopular (interesting        ment, and in particular a negative sentiment, is the third
          to few users) topics. An example of a popular topic is        most important feature in Mpopular , it is ranked eighth in
          2022 FIFA soccer (MB002) - in total we found 24. In           Munpopular . We hypothesize that unpopular topics are also
          contrast, topic NIST computer security (MB005) was            partially unpopular because they do not evoke strong emo-
          classified as unpopular (as one of 25 topics).                tions in the users. A similar reasoning can be applied when
       • Global/local: In this split, we considered the inter-          considering the amount of relevant tweets discovered for
          est for the topic across the globe. The already men-          both topic splits: while on average 67.3 tweets were found to
          tioned topic MB002 is of global interest, since soccer        be relevant for popular topics, only 49.9 tweets were found
          is a highly popular sport in many countries, whereas          to be relevant for unpopular topics (the average number of
          topic Cuomo budget cuts (MB019) is mostly of local            relevant tweets across the entire topic set is 58.44).
          interest to users living or working in New York where            Global vs. local: This split did not result in mod-
          Andrew Cuomo is the current governor. We found 18             els that are significantly different from each other or from
          topics to be of global and 31 topics to be of local in-       MallTopics , indicating that—at least for our currently investi-
          terest.                                                       gated features—a distinction between global and local topics
       • Persistent/occasional: This split is concerned with the        is not useful.
          interestingness of the topic over time. Some topics              Temporal persistence: The same conclusion can be
          persist for a long time, such as MB002 (the FIFA world        drawn about the temporal persistence topic split; for both
          cup will be played in 2022), whereas other topics are         models the same features are of importance which in turn
          only of short-term interest, e.g. Keith Olbermann new         are similar to MallTopics . However, it is interesting to see
          job (MB030). We assigned 28 topics to the persistent          that the performance (regarding all metrics) is clearly higher
          and 21 topics to the occasional topic partition.              for the occasional (short-term) topics in comparison to the
       • Topic themes: The topics were classified as belonging          persistent (long-term) topics. For topics that have a short
          to one of five themes, either business, entertainment,        lifespan recall and precision are notably higher than for the
          sports, politics or technology. While MB002 is a sports       other types of topics.
          topic, MB019 for instance is considered to be a politi-          Topic Themes: The results of the topic split accord-
          cal topic.                                                    ing to the theme of the topic are shown in Table 6. Three
      Our discussion of the results focuses on two aspects: (i)         topics did not fit in one of the five categories. Since the
    the difference between the models derived for each of the           topic set is split into five partitions, the size of some par-
    two partitions, and, (ii) the difference between these models       titions is extremely small, making it difficult to reach con-
    (denoted MsplitName ) and the model derived over all topics         clusive results. We can, though, detect trends, such as the
    (MallT opics ) in Table 4. The results for the three binary         fact that relevant tweets for business topics are less likely to
    topic splits are shown in Table 5.                                  contain hashtags (negative coefficient), while the opposite
      Popularity: A comparison of the most important fea-               holds for entertainment topics (positive coefficient). The


                                                                                                                                       55
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
                Performance          Measure                   business   entertainment       sports    politics   technology
                                     #topics                          6               12           5          21            2
                                     #samples                      4503             9724        4669      17162          1811
                                     precision                   0.4659            0.3691     0.1918      0.3433         0.5109
                                     recall                      0.7904            0.5791     0.1045      0.4456         0.4653
                                     F-measure                   0.5862            0.4508     0.1353      0.3878         0.4870
                Feature Category     Feature                   business   entertainment       sports    politics   technology
                keyword-based        keyword-based               0.2143            0.2069     0.1021      0.1728         0.2075
                                     semantic-based              0.2287            0.2246     0.0858      0.0456         0.0180
                semantic-based
                                     isSemanticallyRelated       1.3821            0.4088     1.0253      1.0689         2.1150
                                     hasHashtag                 -0.8488            0.5234     0.3752     -0.0403        -0.1503
                                     hasURL                      2.0960            1.1429     1.2785      1.2085         0.4452
                syntactical
                                     isReply                    -0.2738           -0.4784    -0.6747     -0.9130        -0.3912
                                     length                      0.0044            0.0011     0.0050     -0.0009         0.0013
                                     #entities                  -0.2473           -0.1470     0.0853      0.0537         0.1011
                                     #entities(person)          -1.2929           -0.1161    -0.4852      0.0177         0.1307
                                     #entities(organization)    -0.0976            0.0865    -0.4259     -0.0673        -0.7318
                                     #entities(location)        -1.3932           -0.9327     0.3655     -0.1169         0.0875
                semantics
                                     #entities(artifact)        -0.4003           -0.1235    -1.0891     -0.2663        -0.3943
                                     #entities(species)          0.0241          -19.1819   -31.0063     -0.5570        -0.6187
                                     diversity                   0.5277            0.4540     0.3209      0.2037         0.1431
                                     sentiment                  -1.0070           -0.3477    -1.0766     -0.5663        -0.2180
                contextual           social context             -0.0067           -0.0086    -0.0047     -0.0041        -0.0155

    Table 6: In line with Table 5, this table shows the influence comparison of different features when partitioning
    the topic set according to five broad topic themes.

    semantic similarity has a large impact on all themes but              depth. Moreover, we would like to investigate to what ex-
    entertainment. Another interesting observation is that sen-           tent personal interests of the users (possibly aggregated from
    timent, and in particular negative sentiment, is a prominent          different Social Web platforms) can be utilized as features
    feature in Mbusiness and in Mpolitics but less so in the other        for personalized retrieval of microposts.
    models.
      Finally we note that there are also some features which             7.   REFERENCES
    have no impact at all, independent of the topic split em-              [1] F. Abel, I. Celik, and P. Siehndel. Leveraging the Semantics of
    ployed: the length of the tweet and the social context of                  Tweets for Adaptive Faceted Search on Twitter. In ISWC ’11,
    the user posting the message. The observation that certain                 Springer, 2011.
    topic splits lead to models that emphasize certain features            [2] M. S. Bernstein, B. Suh, L. Hong, J. Chen, S. Kairam, and
                                                                               E. H. Chi. Eddi: interactive topic-based browsing of social
    also offers a natural way forward: if we are able to determine             status streams. In UIST ’10, ACM, 2010.
    for each topic in advance to which theme or topic charac-              [3] J. Chen, R. Nairn, and E. H. Chi. Speak Little and Well:
    teristic it belongs to, we can select the model that fits the              Recommending Conversations in Online Social Streams. In
                                                                               CHI ’11, ACM, 2011.
    topic best.
                                                                           [4] A. Dong, R. Zhang, P. Kolari, J. Bai, F. Diaz, Y. Chang,
                                                                               Z. Zheng, and H. Zha. Time is of the essence: improving
    6. CONCLUSIONS                                                             recency ranking using twitter data. In WWW ’10, ACM, 2010.
                                                                           [5] Y. Duan, L. Jiang, T. Qin, M. Zhou, and H.-Y. Shum. An
       In this paper, we have analyzed features that can be used               empirical study on learning to rank of tweets. In COLING ’10,
    as indicators of a tweet’s relevance and interestingness to                Association for Computational Linguistics, 2010.
                                                                           [6] A. Jadhav, H. Purohit, P. Kapanipathi, P. Ananthram,
    a given topic. To achieve this, we investigated features                   A. Ranabahu, V. Nguyen, P. N. Mendes, A. G. Smith,
    along two dimensions: topic-dependent features and topic-                  M. Cooney, , and A. Sheth. Twitris 2.0 : Semantically
    independent features. We evaluated the utility of these fea-               Empowered System for Understanding Perceptions From Social
                                                                               Data. In Semantic Web Challenge, 2010.
    tures with a machine learning approach that allowed us to
                                                                           [7] H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a
    gain insights into the importance of the different features for            social network or a news media? In WWW ’10, ACM, 2010.
    the relevance classification.                                          [8] M. Mathioudakis and N. Koudas. Twittermonitor: trend
       Our main discoveries about the factors that lead to rele-               detection over the twitter stream. In SIGMOD ’10, ACM,
                                                                               2010.
    vant tweets are the following: (i) The learned models which
                                                                           [9] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi. Bad
    take advantage of semantics and topic-sensitive features out-              news travel fast: A content-based analysis of interestingness on
    perform those which do not take the semantics and topic-                   twitter. In WebSci ’11, 2011.
    sensitive features into account. (ii) The length of tweets and        [10] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes
                                                                               Twitter users: real-time event detection by social sensors. In
    the social context of the user posting the message have little             WWW ’10, ACM, 2010. ACM.
    impact on the prediction. (iii) The importance of a feature           [11] J. Teevan, D. Ramage, and M. R. Morris. #TwitterSearch: a
    differs depending on the characteristics of the topics. For                comparison of microblog search and web search. In WSDM
    example, the sentiment-based feature is more important for                 ’11, ACM, 2011.
                                                                          [12] J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: finding
    popular than for unpopular topics and the semantic similar-                topic-sensitive influential twitterers. In WSDM ’10, ACM,
    ity does not have a significant impact on entertaining topics.             2010.
       The work presented here is beneficial for search & retrieval       [13] C. Zhai and J. Lafferty. A study of smoothing methods for
    of microblogging data and contributes to the foundations of                language models applied to ad hoc information retrieval. In
                                                                               SIGIR ’01, ACM, 2001. ACM.
    engineering search engines for microposts. In the future, we
    plan to investigate the social and the contextual features in


                                                                                                                                              56
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·

</pre>