=Paper= {{Paper |id=Vol-1542/paper1 |storemode=property |title=Predicting User-specific Temporal Retweet Count Based on Network and Content Information |pdfUrl=https://ceur-ws.org/Vol-1542/paper1.pdf |volume=Vol-1542 |authors=Balint Daroczy,Robert Palovics,Vilmos Wieszner,Richard Farkas,Andras A. Benczur |dblpUrl=https://dblp.org/rec/conf/recsys/DaroczyPWFB15 }} ==Predicting User-specific Temporal Retweet Count Based on Network and Content Information== https://ceur-ws.org/Vol-1542/paper1.pdf
ish Digg) and Wikipedia. They propose models for cascade
growth and estimate model parameters but give no size pre-
dictions.
   From the information spread point of view, a num-
ber of related studies have largely descriptive focus, unlike
our quantitative prediction goals. In [9] high correlation is
observed between indegree, retweet and mention influence,
while outdegree (the number of tweets sent by the user) is
found to be heavily spammed. [21] reports similar findings
on the relation among follower, mention and retweet influ-
ence. Several more results describe the specific means of
information spread on Facebook [6, 2, 7].
   There are only a limited number of related work on retweet
count prediction. Cheng et al. [10] predict retweet count
based on network features. Unlike in our result where we              Figure 1: Temporal density of tweeting activity.
predict immediately after the tweet is published, they con-
sider prediction after the first few retweets. The network fea-
tures used in their work are similar to the ones in the present
paper and in our earlier work [24]. The main contribution                   Table 1: Size of the tweet time series.
of this work is the investigation of content-based features                       Number of users     371,401
and the interaction between network and content features.                         Number of tweets   1,947,234
Petrovic et al. [26] predicts if a tweet will be retweeted at                    Number of retweets 1,272,443
all, and give no evaluation on distinguishing between the
messages of the same user. As another result very similar to
the previous one, [20] give batch evaluation, for all users and             Table 2: Size of the follower network.
the entire time range. They also use logistic regression; their                   Number of users      330,677
features include tf.idf and an LDA based topic model. Simi-                      Number of edges      16,585,837
lar to us, they classify for ranges of retweet counts, however                 Average in/out degree      37
they mention that their accuracy is very low for the mid-
range. We include logistic regression among other classifiers
as baseline methods in our work.                                          Occupy Wall Street movement1 .
   From the content analysis point of view, Bakshy et al.               • Follower network: The list of followers of users who
[3, 4] investigate bit.ly urls but finds little connection be-            posted at least one message in the tweet dataset.
tween influence and url content, unlike in our experiments
where message content elements prove to be valuable for           Table 1 shows the number of users and tweets in the dataset.
predicting influence. There has been several studies focusing     One can see that a large part of the collected tweets are
exclusively on the analysis of the tweet message textual con-     retweets. Table 2 contains the size of the crawled social net-
tent to solve the re-tweet count prediction problem. Besides      works. Note that the average in- and outdegree is relatively
the terms of the message, Naveed et al. [23] introduced the       high. Fig. 1 shows the temporal density of tweeting activity.
features of direct message, mention, hashtag, url, exclama-
tion mark, question mark, positive and negative sentiment,             For each tweet, our data contains
positive and negative emoticons and valence, arousal, dom-
                                                                        • tweet and user ID,
inance lexicon features. Wang et al. [28] proposed deeper
linguistic features like verb tense, named entities, discourse          • timestamp of creation,
relations and sentence similarity. Similar to [26], neither of          • hashtags used in the tweet, and
these results attempt to distinguish between the tweets of
the same user.                                                          • the tweet text content.
   Regarding the idea of combining author, network and            In case of a retweet, we have all these information not only
content information, our work is related to Gupta et al.          on the actual tweet, but also on the original root tweet that
[18] who used these sources of information jointly for scoring    had been retweeted. We define the root tweet as the first
tweets according to their credibility. Although credibility is    occurrence of a given tweet.
related to social influence, the prediction of the credibility
and the size of retweet cascade of a message requires dif-
ferent background information. Hence, we employ different         3.     RETWEET CASCADES
network and content features.
                                                                  3.1      Constructing retweet cascades
                                                                    In case of a retweet, the Twitter API provides us with
2.    DATA SET                                                    the ID of the original tweet. By collecting retweets for a
                                                                  given original tweet ID, we may obtain the set users who
  The dataset was collected by Aragón et al. [1] using the
                                                                  have retweeted a given tweet with the corresponding retweet
Twitter API that we extended by a crawl of the user net-
                                                                  timestamps. The Twitter API however does not tell us the
work. Our data set hence consists of two parts:
                                                                  actual path of cascades if the original tweet was retweeted
                                                                  1
     • Tweet dataset: tweet text and user metadata on the             http://en.wikipedia.org/wiki/Occupy Wall Street
                                                                      In Table 3, we give a few examples of highly retweeted
Root tweeter       Root tweeter            Root tweeter      Time
                                                                    messages with the actual urls and names replaced by [url]
                                                                    and [name].
                                         Virtual edge
                                                                    4.    FEATURE ENGINEERING
                                                                       To train our models, we generate features for each root
           a)                     b)                    c)
                                                                    tweet in the data and then we predict the future cascade
                                                                    size of the root tweet from these feature sets. For a given
                                                                    root tweet, we compute features about
Figure 2: Creation of retweet cascades: Figure (a)
shows the computation of the cascade edges. In Fig-                      • the author user and her follower network (network fea-
ures (b) and (c) we show the possible solutions in                         tures) and
case of missing cascade edges.                                           • the textual content of the tweet itself (content fea-
                                                                           tures).
                                                                    Table 4 gives an overview of the feature templates used in
Table 3: Examples of some highly retweeted mes-                     our experiments.
sages in the data set.
 message                                 retweet counts             4.1     Network Features
 @OWS Live #OWS We can do the same                  325                We consider statistics about the user and her cascades in
 reducing burning of fossil fuels too !!                            the past as well as the influence and impressibility of her
 Long Live The Peaceful Tea Party!!                 325             followers. We capture the influence and impressibility of a
 #gameon #college #twisters #ampat                                  user from previously observed cascades by measuring the
 #sgp @OWS Live #ows #violence                                      following quantities:
 #stupid #liberal #usefulidiots #geta-
                                                                         • Number of tweets in different time frames: for a given
 clue
                                                                           root tweet appeared in time t and a predefined time
 @[user] we need our own banking system             319                    frame τ , we count the number of tweets generated by
 by the people for the people. #Occupy-                                    the corresponding user in the time interval [t − τ, t].
 WallStreet and have the 99% put their                                     We set τ for 1, 6, 12, 24, 48 and 168 hours.
 money there
 The #NYPD officer who maced peaceful               143                  • Average number of tweets in different time frames: We
 young women in the face got 10 vacation                                   divide the number of tweets in a given time frame by
 days docked. Not joking. [url] #ows                                       τ.
                                                                         • User influence: for a given user, we compute the num-
                                                                           ber of times one of her followers retweeted her, divided
several times. The information from the Twitter API on                     by the number of the followers of the user.
the tweet needs to be combined with the follower network                 • User impressibility: for a given user, we compute the
to reconstruct the possible information pathways for a given               number of times she retweeted one of her followees,
tweet. However it can happen that for a given retweeter,                   divided by the number of followees of the user.
more than one friend has retweeted the corresponding tweet
before and hence we do not know the exact information               4.2     Content features
source of the retweeter. The retweet ambiguity problem is             The first step of content processing is text normalization.
well described in [3]. In what follows we consider all friends      We converted the text them into lower case form except
as possible information sources. In other words for a given         those which are fully upper cased and replaced tokens by
tweet we consider all directed edges in the follower network        their stem given by the Porter stemming algorithm. We
in which information flow could occur (see Fig. 2 (a)).             replaced user mentions (starting with ’@’) and numbers by
                                                                    placeholder strings and removed the punctuation marks.
3.2     Restoring missing cascade edges                               The content features are extracted from the normalized
  For a given tweet, the computed edges define us a retweet         texts. The basic feature template in text analysis consists
cascade. However our dataset contains only a sample of              the terms of the message. We used a simple whitespace to-
tweets on the given hashtags and hence may not be com-              kenizer rather than a more sophisticated linguistic tokenizer
plete: it can happen that a few intermediate retweeters are         as previous studies reported its empirical advantage [19].
missing from our data. As a result, sometimes the recon-            We employed unigrams, bigrams and trigrams of tokens be-
structed cascade graphs are disconnected. As detailed in            cause longer phrases just hurt the performance of the system
Fig. 2 (b) and (c), we handle this problem in two differ-           in our preliminary experiments.
ent ways. One possible solution is to only consider the first         Besides terms, we extracted the following features describ-
connected component of the cascade (see Fig. 2 (b)). An-            ing the orthography of the message:
other one is to connect each disconnected part to the root               • Hashtags are used to mark specific topics, they can
tweeter with one virtual cascade edge (see Fig. 2 (c)). In                 be appended after the tweets or inline in the content,
what follows, we work with cascades that contain virtual                   marked by #. From the counts of hashtags the user
edges, therefore every retweeter is included in the cascade.               can tips the topic categories of tweet content but too
                                                                           many hashtag can be irritating to the readers as they
3.3     Examples of highly retweeted messages                              just make confusion.
   • Telephone number: If the tweet contains telephone
     number it is more likely to be spam or ads.                                    Table 4: Feature set.
                                                                 network      number of {followers, tweets, root tweets},
   • Urls: The referred urls can navigate the reader to text,                 average {cascade size, root cascade size},
     sound, and image information, like media elements and                    maximum {cascade size, root cascade size},
     journals thus they can attract interested readers. We                    variance of {cascade sizes, root cascade sizes},
     distinguish between full and truncated urls. The trun-                   number of tweets generated with different time
     cated urls are ended with three dot, its probably copied                 frames,
     from other tweet content, so it was interested by some-                  time average of the number of tweets in different
     body.                                                                    time frames
   • The like sign is an illustrator, encouragement to others                 tweeter’s influence and impressibility
     to share the tweet.                                                      followers’ average influence and impressibility
   • The presence of a question mark indicates uncertainty.      terms        normalized unigrams, bigrams and trigrams
     In Twitter, questions are usually rhetorical—people do      ortho-       number of # with the values 0, 1, 2 . . . 4 or 4 <
     not seek answers on Twitter [19]). The author more          graphic      number of {like signs, ?, !, mentions}
     likely wants to make the reader think about the mes-                     number of full and truncated urls
     sage content.                                                            number of arabic numbers and phone numbers
                                                                              number of positive/negative/other emoticons
   • The Exclamation mark highlights the part of the tweet,      modality     number of swear words and weasel phrases
     it expresses emotions and opinions.                                      union of the inquiry categories of the words
   • If Numerical expressions are present the facts are quan-
     tified then it is more likely to have real information
     content. The actual value of numbers were ignored.              • Modality: The n-gram contains at least one swear or
   • Mentions: If a user mentioned (referred) in the tweet             weasel word or expression (overall 208,368);
     the content of the tweet is probably connected to the           • Orthographic: No swear or weasel word but at least
     mentioned user. It can have informal or private con-              one orthographic term (overall 2,751,935);
     tent.
                                                                     • Terms: N-grams formed only of terms, no swear or
   • Emoticons are short character sequences representing
                                                                       weasel words and orthographic features (overall 771,196).
     emotions. We clustered the emoticons into positive,
     negative and neutral categories.                           For efficiency, we selected the most frequent 1,000 n-grams
 The last group of content features tries to capture the        from each class. The entire feature set hence consists of
modality of the message:                                        3,000 trigrams.
   • Swear words influence the style and attractiveness of
     the tweet. The reaction for swearing can be ignorance      5.    TEMPORAL TRAINING AND EVALUA-
     and also reattacking, which is not relevant in terms             TION
     of retweet cascade size prediction. We extracted 458          Here we describe the way we generate training and test
     swear words from http://www.youswear.com.                  sets for our algorithms detailed in Section 6. First, for each
   • Weasel words and phrases 2 aimed at creating an im-        root tweet we compute the corresponding network and con-
     pression that a specific and/or meaningful statement       tent features. We create daily re-trained models: for a given
     has been made when in fact only a vague or ambigu-         day t, we train a model on all root tweets that have been
     ous claim has been communicated. We used the weasel        generated before t but appeared later than t − τ , where τ
     word lexicon of [27].                                      is the preset time frame. After training based on the data
   • We employed the linguistic inquiry categories (LIWC)       before a given day, we compute our predictions for all root
     [25] of the tweets’ words as well. These categories de-    tweets appeared in that day.
     scribe words from emotional, cognitive and structural         In order to keep the features up to date, we recompute all
     points of view. For example the “ask” word it is in        network properties online, on the fly and use the new values
     Hear, Senses, Social and Present categories. Differ-       to give predictions. By this method, we may immediately
     ent LIWC categories can have different effect on the       notice if a user starts gaining high attention or if a bursty
     influence of the tweet in question.                        event happens.
                                                                   We take special attention to defining the values used for
                                                                training and evaluation. For evaluation, we used the in-
4.3   N-grams                                                   formation till the end of the three week data set collection
   By using all the content features, we built n-grams as       period, i.e. we used all the known tweets that belong to the
consecutive sequences in the tweet text that may include        given cascade. However, for training, we are only allowed
simply three terms (“posted a photo”), @-mentions, hash-        to use and count the tweets up to the end of the training
tags, url (“@OccupyPics Photo http://t.co/. . . ” coded as      period. Since the testing period is longer, we linearly ap-
[[user] Photo [url]]), numbers (“has [number] followers”),      proximated the values for the remaining part of the testing
non-alphanumeric (“right now !”) as well as markers for         period.
swear or weasel expressions (“[weasel word] people say”).          Our goal is to predict cascade size at the time when the
We defined the following classes of n-grams, for n ≤ 3:         root tweet is generated. One method we use is regression,
2                                                               which directly predict the size of the retweet cascade. For
  See       http://en.wikipedia.org/wiki/Wikipedia:
Embrace_weasel_words.                                           regression, we only use the global error measures:
                                                                  We select and use the training set average value in each class
                                                                  as the ideal value for the prediction.

                                                                  6.     RESULTS
                                                                              0.85

                                                                             0.845

                                                                              0.84

                                                                             0.835

                                                                              0.83




                                                                       AUC
                                                                             0.825

                                                                              0.82

                                                                             0.815

                                                                              0.81

                                                                             0.805

                                                                               0.8
         Figure 3: Cascade size distribution.                                        1   2   3   4      5      6     7        8   9   10   11
                                                                                                     time frame size (days)




   • Mean Average Error (MAE);
   • Root Mean Squared Error (RMSE);
   • Root Relative Squared Error (RRSE).

   We also experiment with multiclass classification for ranges
of the cascade size. The cascade size follows a power law dis-
tribution (see Fig. 3) and we defined three buckets, one with
0. . . 10 (referred as “low”), one with 11. . . 100 (“medium”)
and a largest one with more than 100 (“high”) retweeters
participating in the cascade. We evaluate performance by
AUC [13] averaged for the three classes. Note that AUC has
a probabilistic interpretation: for the example of the “high”
class, the value of the AUC is equal to the probability that a
random highly retweeted message is ranked before a random         Figure 4: Daily average AUC of classifiers trained
non-highly retweeted one.                                         with different set of features, evaluated both as a
   By the probabilistic interpretation of AUC, we may realize     global list (top) and as average on the user level by
that a classifier will perform well if it orders the users well   equation (1), bottom.
with little consideration on their individual messages. Since
our goal is to predict the messages in time and not the rather
static user visibility and influence, we define new averaging        In this section, we train and evaluate first the classification
schemes for predicting the success of individual messages.        and then the regression models to predict the future cascade
   We consider the classification of the messages of a single     size of tweets. We predict day by day, for each day in the
user and define two aggregations of the individual AUC val-       testing period. For classification, we also evaluate on the
ues. First, we simply average the AUC values of users for         user level by using equations (1) and (2). For classification,
each day (user average)                                           we show the best performing features as well.
                                                                     As mentioned in Section 5, we may train our model with
                                  N                               different τ . In Figure 4 we show the average AUC value with
                               1 X
                 AU Cuser =          AU Ci ,               (1)    different time frames. As Twitter trends change rapidly, we
                               N i=1
                                                                  achieve the best average results if we train our algorithms
Second, we are weighting the individual AUC values with           on root tweets that were generated in the previous week
the activity of the user (number of tweets by the user for        (approximately seven days), both for global and for user
the actual day)                                                   level average evaluation.
                             PN
                               i=1 AU Ci Ti
                                                                  6.1         Cascade size by multiclass classification
                AU Cwuser =     PN                     (2)           First, we measure classifier performance by computing
                                   i Ti
                                                                  the average AUC values of the final results for the three
where Ti is the number of tweets by the i-th user.                size ranges. We were interested in how different classifiers
  We may also obtain regressors from the multiclass classifi-     perform and how different feature sets affect classifier per-
cation results. In order to make classification and regression    formance. For this reason, we repeated our experiments
comparable, we give a very simple transformation that re-         with different feature subsets. Figure 5 shows our results.
places each class by a value that can be used as regressor.       For each day, the network features give a strong baseline.
                 Figure 5: Daily average AUC of classifiers trained with different set of features.



Table 5: Retweet size classification daily average performance of different feature sets. The ideal values are
MAE=2.435, RMSE=15.94, RRSE=0.414.
                                            Retweet range     Weighted    MAE RMSE RRSE
             Features                   Low Medium High        Average
             network                   0.799     0.785 0.886      0.799   5.156   22.93    2.449
             network & modality        0.827     0.814 0.905      0.827   4.843   22.40    2.033
             network & orthographic    0.844     0.829 0.912      0.843   4.521   22.13    1.790
             network & terms           0.857     0.847 0.914      0.857   4.157   21.90    1.323
             network & all content     0.862     0.849 0.921      0.862   3.926   22.15    1.286



Table 6: Weighted average AUC over low, medium                   Table 7: Retweet size regression daily average per-
and high retweet range of different classifiers. Note            formance of different feature sets.
that Multi-Layer Perceptron (MLP) did not termi-                                                   MAE RMSE RRSE
nate in 3 days for the large feature set.                         Features
 Weighted Average AUC network network & all content               network, linear regression       3.225  14.30  0.909
 Random Forest              0.799                 0.862           network, MLP                     3.015  14.91  0.716
 Logistic Regression        0.605                 0.689           network, RepTree                 2.989  12.60  0.853
 MLP                        0.783                   n/a           network & modality, RepTree      3.099  13.86  0.867
                                                                  network & orthographic, RepTree 3.100   13.87  0.865
                                                                  network & terms, RepTree         3.090  13.86  0.868
The combination of these features with the content result         all, RepTree                     3.100  13.87  0.865
in strong improvement in classifier performance. In Table 5
we summarize the average AUC values for different feature
subsets over all four datasets. Our results are consistent:      fectly classify the three classes, the ideal error values would
in all cases, the content related features improve the perfor-   be MAE=2.435, RMSE=15.94, RRSE=0.414. We could not
mance. Finally, we give the performance of other classifiers     reach close to the ideal values by regression either.
in Table 6 and conclude the superiority of the Random For-
est classifier [12]. We use the classifier implementations of    6.3    Cascade size on the user level
Weka [29] and LibLinear [11].                                      Our main evaluation is found in Table 8 where we consider
                                                                 the user level average AUC values as described in Section 5.
6.2   Cascade size by regression                                 As expected, since the new evaluation metrics give more em-
   We give regression results by the linear regression, mul-     phasis on distinguishing between the tweets of the same user,
tilayer perceptron and the regression tree implementation        we see even stronger gain of the modality and orthographic
of Weka [29] in Table 7. As seen when compared to the            features.
last three columns in Table 5, regression methods outper-
form multiclass classification results transformed to regres-    6.4    Feature contribution analysis
sors. Note that for the transformation, we use class averages      We selected the most important network features by run-
obtained from the training data. If however we could per-        ning a LogitBoost classifier [14]. The best features were all
Table 8: Retweet size classification daily average performance of different feature sets evaluated on the user
level as defined in equations (1) and (2).
                   Retweet range         Low               Medium                High                Average
  Features                       Uniform Weighted Uniform Weighted Uniform Weighted            Uniform Weighted
  network                  AUC      0.684     0.712     0.752     0.800     0.746      0.796     0.719       0.756
  network & modality       AUC      0.700     0.722     0.751     0.796     0.737      0.756     0.726       0.757
  network & orthographic AUC        0.702     0.731     0.753     0.797     0.768      0.782     0.730       0.764
  network & terms          AUC      0.705     0.732     0.757     0.800     0.767      0.786     0.733       0.766
  network & all content     AUC     0.740     0.783     0.763     0.812     0.769      0.820     0.752       0.797


characterizing the network. We list the first five, in the order
of importance:                                                     Table 9: Retweet size classification with fixed user
                                                                   network features.
  1. The number of followers of the root tweet user;                                          Retweet range       Weighted
  2. The average cascade size of previous root tweets by the        Features              Low Medium High          Average
     user.                                                          static network       0.798      0.779 0.868      0.797
  3. The number of root tweets of the user so far (retweets         static network & all 0.854      0.804 0.932      0.851
     excluded);                                                     content
  4. The average cascade size of previous tweets (including         static network per     0.5        0.5   0.5         0.5
     retweets) by the user;                                         user
  5. The number of tweets of the user so far;                       static network & all 0.798      0.784 0.935      0.798
                                                                    content per user
6.5    Content feature contribution analysis
  We selected the most important content features by run-
ning logistic regression over the 3,000 trigrams described in      features are already capable of distinguishing between the
Section 4.3. The features are complex expressions contain-         messages of the same user, as seen in Tables 5 and 7.
ing elements from the three major group of linguistic feature
sets in the following order of absolute weight obtained by lo-     7.   CONCLUSIONS
gistic regression:                                                    In this paper we investigated the possibility of predicting
                                                                   the future popularity of a recently appeared text message
  1. Three words [marriage between democracy], in this or-
                                                                   in Twitter’s social networking system. Besides the typical
     der;
                                                                   user and network related features, we consider hashtag and
  2. [at [hashtag occupywallstreet][url]]: the word “at”,
                                                                   linguistic analysis based ones as well. Our results do not only
     followed by the hashtag “#occupywallstreet”, and a
                                                                   confirm the possibility of predicting the future popularity
     url;
                                                                   of a tweet, but also indicate that deep content analysis is
  3. [between democracy and];
                                                                   important to improve the quality of the prediction.
  4. [capitalism is over];
                                                                      In our experiments, we give high importance to the tem-
  5. [[hashtag ows] pls];
                                                                   poral aspects of the prediction: we predict immediately after
  6. [[weasel word] marriage between]: the expression
                                                                   the message is published, and we also evaluate on the user
     “marriage between” on the weasel word list, which counts
                                                                   level. We consider user level evaluation key in temporal
     as the third element of the trigram;
                                                                   analysis, since the influence and popularity of a given user
  7. [[hashtag zizek] at [hashtag occupywallstreet]];
                                                                   is relative stable while the retweet count of her particular
  8. [[hashtag occupywallstreet][url][hashtag auspol]];
                                                                   messages may greatly vary in time. On the user level, we
  9. [over [hashtag zizek] at];
                                                                   observe the importance of linguistic elements of the content.
 10. [calientan la]: means “heating up”.

Note that all these features have negative weight for the          Acknowledgments
upper two classes and positive or close to 0 for the lower         We thank Andreas Kaltenbrunner for providing us with the
class. Hence the appearance of these trigrams decrease the         Twitter data set [1].
value obtained by the network feature based model. We may
conclude that the use of weasel words and uninformative
phrases reduce the chance of getting retweeted, as opposed         8.   REFERENCES
to the sample highly retweeted messages in Table 3.                 [1] P. Aragón, K. E. Kappler, A. Kaltenbrunner,
                                                                        D. Laniado, and Y. Volkovich. Communication
6.6    Frozen network features                                          dynamics in twitter during political campaigns: The
  To illustrate the importance of the temporal training and             case of the 2011 spanish national election. Policy &
evaluation framework and the online update of the network               Internet, 5(2):183–206, 2013.
features, we made an experiment where we replaced user              [2] E. Bakshy, D. Eckles, R. Yan, and I. Rosenn. Social
features by static ones. The results are summarized in Ta-              influence in social advertising: evidence from field
ble 9. Note that on the user level, all messages will have the          experiments. In Proceedings of the 13th ACM
same network features and hence classification will be ran-             Conference on Electronic Commerce, pages 146–161.
dom with AUC=0.5. In contrast, online updated network                   ACM, 2012.
 [3] E. Bakshy, J. M. H., W. A. Mason, and D. J. Watts.          [17] V. Gómez, H. J. Kappen, N. Litvak, and
     Everyone’s an influencer: quantifying influence on               A. Kaltenbrunner. A likelihood-based framework for
     twitter. In Proceedings of the fourth ACM                        the analysis of discussion threads. World Wide Web,
     international conference on Web search and data                  pages 1–31, 2012.
     mining, pages 65–74. ACM, 2011.                             [18] A. Gupta, P. Kumaraguru, C. Castillo, and P. Meier.
 [4] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J.                  Tweetcred: Real-time credibility assessment of content
     Watts. Identifying influencers on twitter. In Fourth             on twitter. In Social Informatics, volume 8851 of
     ACM International Conference on Web Seach and                    Lecture Notes in Computer Science, pages 228–243.
     Data Mining (WSDM), 2011.                                        2014.
 [5] E. Bakshy, B. Karrer, and L. A. Adamic. Social              [19] V. Hangya and R. Farkas. Filtering and polarity
     influence and the diffusion of user-created content. In          detection for reputation management on tweets. In
     Proceedings of the 10th ACM conference on Electronic             Working Notes of CLEF 2013 Evaluation Labs and
     commerce, pages 325–334. ACM, 2009.                              Workshop, 2013.
 [6] E. Bakshy, I. Rosenn, C. Marlow, and L. Adamic. The         [20] L. Hong, O. Dan, and B. D. Davison. Predicting
     role of social networks in information diffusion. In             popular messages in twitter. In Proceedings of the 20th
     Proceedings of the 21st international conference on              International Conference Companion on World Wide
     World Wide Web, pages 519–528. ACM, 2012.                        Web, WWW ’11, pages 57–58, New York, NY, USA,
 [7] M. S. Bernstein, E. Bakshy, M. Burke, and B. Karrer.             2011. ACM.
     Quantifying the invisible audience in social networks.      [21] H. Kwak, C. Lee, H. Park, and S. Moon. What is
     In Proceedings of the SIGCHI Conference on Human                 twitter, a social network or a news media? In
     Factors in Computing Systems, pages 21–30. ACM,                  Proceedings of the 19th international conference on
     2013.                                                            World wide web, pages 591–600. ACM, 2010.
 [8] D. Boyd, S. Golder, and G. Lotan. Tweet, tweet,             [22] K. Lerman and R. Ghosh. Information contagion: An
     retweet: Conversational aspects of retweeting on                 empirical study of the spread of news on digg and
     twitter. In System Sciences (HICSS), 2010 43rd                   twitter social networks. In Proceedings of 4th
     Hawaii International Conference on, pages 1–10.                  International Conference on Weblogs and Social Media
     IEEE, 2010.                                                      (ICWSM), 2010.
 [9] M. Cha, H. Haddadi, F. Benevenuto, and                      [23] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi.
     K. Gummadi. Measuring user influence in twitter:                 Bad news travel fast: A content-based analysis of
     The million follower fallacy. In 4th International               interestingness on twitter. In Proceedings of the 3rd
     AAAI Conference on Weblogs and Social Media                      International Web Science Conference, WebSci ’11.
     (ICWSM), 2010.                                                   ACM, 2011.
[10] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and        [24] R. Palovics, B. Daroczy, and A. Benczur. Temporal
     J. Leskovec. Can cascades be predicted? In                       prediction of retweet count. In Cognitive
     Proceedings of the 23rd international conference on              Infocommunications (CogInfoCom), 2013 IEEE 4th
     World wide web, pages 925–936. International World               International Conference on, pages 267–270. IEEE,
     Wide Web Conferences Steering Committee, 2014.                   2013.
[11] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang,            [25] J. Pennebaker, C. Chung, M. Ireland, A. Gonzales,
     and C.-J. Lin. Liblinear: A library for large linear             and R. Booth. The development and psychometric
     classification. The Journal of Machine Learning                  properties of liwc2007. Technical report, University of
     Research, 9:1871–1874, 2008.                                     Texas at Austin, 2007.
[12] FastRandomForest. Re-implementation of the random           [26] S. Petrovic, M. Osborne, and V. Lavrenko. Rt to win!
     forest classifier for the weka environment.                      predicting message propagation in twitter. In ICWSM,
     http://code.google.com/p/fast-random-forest/.                    2011.
[13] J. Fogarty, R. S. Baker, and S. E. Hudson. Case             [27] Gy. Szarvas, V. Vincze, R. Farkas, Gy. Móra, and
     studies in the use of roc curve analysis for sensor-based        I. Gurevych. Cross-genre and cross-domain detection
     estimates in human computer interaction. In                      of semantic uncertainty. Computational Linguistics,
     Proceedings of Graphics Interface 2005, GI ’05, pages            38(2):335–367, 2012.
     129–136, School of Computer Science, University of          [28] A. Wang, T. Chen, and M.-Y. Kan. Re-tweeting from
     Waterloo, Waterloo, Ontario, Canada, 2005. Canadian              a linguistic perspective. In Proceedings of the Second
     Human-Computer Communications Society.                           Workshop on Language in Social Media, pages 46–55,
[14] J. Friedman, T. Hastie, and R. Tibshirani. Additive              2012.
     logistic regression: A statistical view of boosting.        [29] I. H. Witten and E. Frank. Data Mining: Practical
     Annals of statistics, pages 337–374, 2000.                       Machine Learning Tools and Techniques. Morgan
[15] R. Ghosh and K. Lerman. Predicting influential users             Kaufmann Series in Data Management Systems.
     in online social networks. arXiv preprint                        Morgan Kaufmann, second edition, June 2005.
     arXiv:1005.4882, 2010.
[16] V. Gómez, H. J. Kappen, and A. Kaltenbrunner.
     Modeling the structure and evolution of discussion
     cascades. In Proceedings of the 22nd ACM conference
     on Hypertext and hypermedia, pages 181–190. ACM,
     2011.