=Paper= {{Paper |id=Vol-1542/paper1 |storemode=property |title=Predicting User-specific Temporal Retweet Count Based on Network and Content Information |pdfUrl=https://ceur-ws.org/Vol-1542/paper1.pdf |volume=Vol-1542 |authors=Balint Daroczy,Robert Palovics,Vilmos Wieszner,Richard Farkas,Andras A. Benczur |dblpUrl=https://dblp.org/rec/conf/recsys/DaroczyPWFB15 }} ==Predicting User-specific Temporal Retweet Count Based on Network and Content Information== https://ceur-ws.org/Vol-1542/paper1.pdf

ish Digg) and Wikipedia. They propose models for cascade
growth and estimate model parameters but give no size pre-
dictions.
From the information spread point of view, a num-
ber of related studies have largely descriptive focus, unlike
our quantitative prediction goals. In [9] high correlation is
observed between indegree, retweet and mention influence,
while outdegree (the number of tweets sent by the user) is
found to be heavily spammed. [21] reports similar findings
on the relation among follower, mention and retweet influ-
ence. Several more results describe the specific means of
information spread on Facebook [6, 2, 7].
There are only a limited number of related work on retweet
count prediction. Cheng et al. [10] predict retweet count
based on network features. Unlike in our result where we Figure 1: Temporal density of tweeting activity.
predict immediately after the tweet is published, they con-
sider prediction after the first few retweets. The network fea-
tures used in their work are similar to the ones in the present
paper and in our earlier work [24]. The main contribution Table 1: Size of the tweet time series.
of this work is the investigation of content-based features Number of users 371,401
and the interaction between network and content features. Number of tweets 1,947,234
Petrovic et al. [26] predicts if a tweet will be retweeted at Number of retweets 1,272,443
all, and give no evaluation on distinguishing between the
messages of the same user. As another result very similar to
the previous one, [20] give batch evaluation, for all users and Table 2: Size of the follower network.
the entire time range. They also use logistic regression; their Number of users 330,677
features include tf.idf and an LDA based topic model. Simi- Number of edges 16,585,837
lar to us, they classify for ranges of retweet counts, however Average in/out degree 37
they mention that their accuracy is very low for the mid-
range. We include logistic regression among other classifiers
as baseline methods in our work. Occupy Wall Street movement1 .
From the content analysis point of view, Bakshy et al. • Follower network: The list of followers of users who
[3, 4] investigate bit.ly urls but finds little connection be- posted at least one message in the tweet dataset.
tween influence and url content, unlike in our experiments
where message content elements prove to be valuable for Table 1 shows the number of users and tweets in the dataset.
predicting influence. There has been several studies focusing One can see that a large part of the collected tweets are
exclusively on the analysis of the tweet message textual con- retweets. Table 2 contains the size of the crawled social net-
tent to solve the re-tweet count prediction problem. Besides works. Note that the average in- and outdegree is relatively
the terms of the message, Naveed et al. [23] introduced the high. Fig. 1 shows the temporal density of tweeting activity.
features of direct message, mention, hashtag, url, exclama-
tion mark, question mark, positive and negative sentiment, For each tweet, our data contains
positive and negative emoticons and valence, arousal, dom-
• tweet and user ID,
inance lexicon features. Wang et al. [28] proposed deeper
linguistic features like verb tense, named entities, discourse • timestamp of creation,
relations and sentence similarity. Similar to [26], neither of • hashtags used in the tweet, and
these results attempt to distinguish between the tweets of
the same user. • the tweet text content.
Regarding the idea of combining author, network and In case of a retweet, we have all these information not only
content information, our work is related to Gupta et al. on the actual tweet, but also on the original root tweet that
[18] who used these sources of information jointly for scoring had been retweeted. We define the root tweet as the first
tweets according to their credibility. Although credibility is occurrence of a given tweet.
related to social influence, the prediction of the credibility
and the size of retweet cascade of a message requires dif-
ferent background information. Hence, we employ different 3. RETWEET CASCADES
network and content features.
3.1 Constructing retweet cascades
In case of a retweet, the Twitter API provides us with
2. DATA SET the ID of the original tweet. By collecting retweets for a
given original tweet ID, we may obtain the set users who
The dataset was collected by Aragón et al. [1] using the
have retweeted a given tweet with the corresponding retweet
Twitter API that we extended by a crawl of the user net-
timestamps. The Twitter API however does not tell us the
work. Our data set hence consists of two parts:
actual path of cascades if the original tweet was retweeted
1
• Tweet dataset: tweet text and user metadata on the http://en.wikipedia.org/wiki/Occupy Wall Street
In Table 3, we give a few examples of highly retweeted
Root tweeter Root tweeter Root tweeter Time
messages with the actual urls and names replaced by [url]
and [name].
Virtual edge
4. FEATURE ENGINEERING
To train our models, we generate features for each root
a) b) c)
tweet in the data and then we predict the future cascade
size of the root tweet from these feature sets. For a given
root tweet, we compute features about
Figure 2: Creation of retweet cascades: Figure (a)
shows the computation of the cascade edges. In Fig- • the author user and her follower network (network fea-
ures (b) and (c) we show the possible solutions in tures) and
case of missing cascade edges. • the textual content of the tweet itself (content fea-
tures).
Table 4 gives an overview of the feature templates used in
Table 3: Examples of some highly retweeted mes- our experiments.
sages in the data set.
message retweet counts 4.1 Network Features
@OWS Live #OWS We can do the same 325 We consider statistics about the user and her cascades in
reducing burning of fossil fuels too !! the past as well as the influence and impressibility of her
Long Live The Peaceful Tea Party!! 325 followers. We capture the influence and impressibility of a
#gameon #college #twisters #ampat user from previously observed cascades by measuring the
#sgp @OWS Live #ows #violence following quantities:
#stupid #liberal #usefulidiots #geta-
• Number of tweets in different time frames: for a given
clue
root tweet appeared in time t and a predefined time
@[user] we need our own banking system 319 frame τ , we count the number of tweets generated by
by the people for the people. #Occupy- the corresponding user in the time interval [t − τ, t].
WallStreet and have the 99% put their We set τ for 1, 6, 12, 24, 48 and 168 hours.
money there
The #NYPD officer who maced peaceful 143 • Average number of tweets in different time frames: We
young women in the face got 10 vacation divide the number of tweets in a given time frame by
days docked. Not joking. [url] #ows τ.
• User influence: for a given user, we compute the num-
ber of times one of her followers retweeted her, divided
several times. The information from the Twitter API on by the number of the followers of the user.
the tweet needs to be combined with the follower network • User impressibility: for a given user, we compute the
to reconstruct the possible information pathways for a given number of times she retweeted one of her followees,
tweet. However it can happen that for a given retweeter, divided by the number of followees of the user.
more than one friend has retweeted the corresponding tweet
before and hence we do not know the exact information 4.2 Content features
source of the retweeter. The retweet ambiguity problem is The first step of content processing is text normalization.
well described in [3]. In what follows we consider all friends We converted the text them into lower case form except
as possible information sources. In other words for a given those which are fully upper cased and replaced tokens by
tweet we consider all directed edges in the follower network their stem given by the Porter stemming algorithm. We
in which information flow could occur (see Fig. 2 (a)). replaced user mentions (starting with ’@’) and numbers by
placeholder strings and removed the punctuation marks.
3.2 Restoring missing cascade edges The content features are extracted from the normalized
For a given tweet, the computed edges define us a retweet texts. The basic feature template in text analysis consists
cascade. However our dataset contains only a sample of the terms of the message. We used a simple whitespace to-
tweets on the given hashtags and hence may not be com- kenizer rather than a more sophisticated linguistic tokenizer
plete: it can happen that a few intermediate retweeters are as previous studies reported its empirical advantage [19].
missing from our data. As a result, sometimes the recon- We employed unigrams, bigrams and trigrams of tokens be-
structed cascade graphs are disconnected. As detailed in cause longer phrases just hurt the performance of the system
Fig. 2 (b) and (c), we handle this problem in two differ- in our preliminary experiments.
ent ways. One possible solution is to only consider the first Besides terms, we extracted the following features describ-
connected component of the cascade (see Fig. 2 (b)). An- ing the orthography of the message:
other one is to connect each disconnected part to the root • Hashtags are used to mark specific topics, they can
tweeter with one virtual cascade edge (see Fig. 2 (c)). In be appended after the tweets or inline in the content,
what follows, we work with cascades that contain virtual marked by #. From the counts of hashtags the user
edges, therefore every retweeter is included in the cascade. can tips the topic categories of tweet content but too
many hashtag can be irritating to the readers as they
3.3 Examples of highly retweeted messages just make confusion.
• Telephone number: If the tweet contains telephone
number it is more likely to be spam or ads. Table 4: Feature set.
network number of {followers, tweets, root tweets},
• Urls: The referred urls can navigate the reader to text, average {cascade size, root cascade size},
sound, and image information, like media elements and maximum {cascade size, root cascade size},
journals thus they can attract interested readers. We variance of {cascade sizes, root cascade sizes},
distinguish between full and truncated urls. The trun- number of tweets generated with different time
cated urls are ended with three dot, its probably copied frames,
from other tweet content, so it was interested by some- time average of the number of tweets in different
body. time frames
• The like sign is an illustrator, encouragement to others tweeter’s influence and impressibility
to share the tweet. followers’ average influence and impressibility
• The presence of a question mark indicates uncertainty. terms normalized unigrams, bigrams and trigrams
In Twitter, questions are usually rhetorical—people do ortho- number of # with the values 0, 1, 2 . . . 4 or 4 <
not seek answers on Twitter [19]). The author more graphic number of {like signs, ?, !, mentions}
likely wants to make the reader think about the mes- number of full and truncated urls
sage content. number of arabic numbers and phone numbers
number of positive/negative/other emoticons
• The Exclamation mark highlights the part of the tweet, modality number of swear words and weasel phrases
it expresses emotions and opinions. union of the inquiry categories of the words
• If Numerical expressions are present the facts are quan-
tified then it is more likely to have real information
content. The actual value of numbers were ignored. • Modality: The n-gram contains at least one swear or
• Mentions: If a user mentioned (referred) in the tweet weasel word or expression (overall 208,368);
the content of the tweet is probably connected to the • Orthographic: No swear or weasel word but at least
mentioned user. It can have informal or private con- one orthographic term (overall 2,751,935);
tent.
• Terms: N-grams formed only of terms, no swear or
• Emoticons are short character sequences representing
weasel words and orthographic features (overall 771,196).
emotions. We clustered the emoticons into positive,
negative and neutral categories. For efficiency, we selected the most frequent 1,000 n-grams
The last group of content features tries to capture the from each class. The entire feature set hence consists of
modality of the message: 3,000 trigrams.
• Swear words influence the style and attractiveness of
the tweet. The reaction for swearing can be ignorance 5. TEMPORAL TRAINING AND EVALUA-
and also reattacking, which is not relevant in terms TION
of retweet cascade size prediction. We extracted 458 Here we describe the way we generate training and test
swear words from http://www.youswear.com. sets for our algorithms detailed in Section 6. First, for each
• Weasel words and phrases 2 aimed at creating an im- root tweet we compute the corresponding network and con-
pression that a specific and/or meaningful statement tent features. We create daily re-trained models: for a given
has been made when in fact only a vague or ambigu- day t, we train a model on all root tweets that have been
ous claim has been communicated. We used the weasel generated before t but appeared later than t − τ , where τ
word lexicon of [27]. is the preset time frame. After training based on the data
• We employed the linguistic inquiry categories (LIWC) before a given day, we compute our predictions for all root
[25] of the tweets’ words as well. These categories de- tweets appeared in that day.
scribe words from emotional, cognitive and structural In order to keep the features up to date, we recompute all
points of view. For example the “ask” word it is in network properties online, on the fly and use the new values
Hear, Senses, Social and Present categories. Differ- to give predictions. By this method, we may immediately
ent LIWC categories can have different effect on the notice if a user starts gaining high attention or if a bursty
influence of the tweet in question. event happens.
We take special attention to defining the values used for
training and evaluation. For evaluation, we used the in-
4.3 N-grams formation till the end of the three week data set collection
By using all the content features, we built n-grams as period, i.e. we used all the known tweets that belong to the
consecutive sequences in the tweet text that may include given cascade. However, for training, we are only allowed
simply three terms (“posted a photo”), @-mentions, hash- to use and count the tweets up to the end of the training
tags, url (“@OccupyPics Photo http://t.co/. . . ” coded as period. Since the testing period is longer, we linearly ap-
[[user] Photo [url]]), numbers (“has [number] followers”), proximated the values for the remaining part of the testing
non-alphanumeric (“right now !”) as well as markers for period.
swear or weasel expressions (“[weasel word] people say”). Our goal is to predict cascade size at the time when the
We defined the following classes of n-grams, for n ≤ 3: root tweet is generated. One method we use is regression,
2 which directly predict the size of the retweet cascade. For
See http://en.wikipedia.org/wiki/Wikipedia:
Embrace_weasel_words. regression, we only use the global error measures:
We select and use the training set average value in each class
as the ideal value for the prediction.

6. RESULTS
0.85

0.845

0.84

0.835

0.83

AUC
0.825

0.82

0.815

0.81

0.805

0.8
Figure 3: Cascade size distribution. 1 2 3 4 5 6 7 8 9 10 11
time frame size (days)

• Mean Average Error (MAE);
• Root Mean Squared Error (RMSE);
• Root Relative Squared Error (RRSE).

We also experiment with multiclass classification for ranges
of the cascade size. The cascade size follows a power law dis-
tribution (see Fig. 3) and we defined three buckets, one with
0. . . 10 (referred as “low”), one with 11. . . 100 (“medium”)
and a largest one with more than 100 (“high”) retweeters
participating in the cascade. We evaluate performance by
AUC [13] averaged for the three classes. Note that AUC has
a probabilistic interpretation: for the example of the “high”
class, the value of the AUC is equal to the probability that a
random highly retweeted message is ranked before a random Figure 4: Daily average AUC of classifiers trained
non-highly retweeted one. with different set of features, evaluated both as a
By the probabilistic interpretation of AUC, we may realize global list (top) and as average on the user level by
that a classifier will perform well if it orders the users well equation (1), bottom.
with little consideration on their individual messages. Since
our goal is to predict the messages in time and not the rather
static user visibility and influence, we define new averaging In this section, we train and evaluate first the classification
schemes for predicting the success of individual messages. and then the regression models to predict the future cascade
We consider the classification of the messages of a single size of tweets. We predict day by day, for each day in the
user and define two aggregations of the individual AUC val- testing period. For classification, we also evaluate on the
ues. First, we simply average the AUC values of users for user level by using equations (1) and (2). For classification,
each day (user average) we show the best performing features as well.
As mentioned in Section 5, we may train our model with
N different τ . In Figure 4 we show the average AUC value with
1 X
AU Cuser = AU Ci , (1) different time frames. As Twitter trends change rapidly, we
N i=1
achieve the best average results if we train our algorithms
Second, we are weighting the individual AUC values with on root tweets that were generated in the previous week
the activity of the user (number of tweets by the user for (approximately seven days), both for global and for user
the actual day) level average evaluation.
PN
i=1 AU Ci Ti
6.1 Cascade size by multiclass classification
AU Cwuser = PN (2) First, we measure classifier performance by computing
i Ti
the average AUC values of the final results for the three
where Ti is the number of tweets by the i-th user. size ranges. We were interested in how different classifiers
We may also obtain regressors from the multiclass classifi- perform and how different feature sets affect classifier per-
cation results. In order to make classification and regression formance. For this reason, we repeated our experiments
comparable, we give a very simple transformation that re- with different feature subsets. Figure 5 shows our results.
places each class by a value that can be used as regressor. For each day, the network features give a strong baseline.
Figure 5: Daily average AUC of classifiers trained with different set of features.

Table 5: Retweet size classification daily average performance of different feature sets. The ideal values are
MAE=2.435, RMSE=15.94, RRSE=0.414.
Retweet range Weighted MAE RMSE RRSE
Features Low Medium High Average
network 0.799 0.785 0.886 0.799 5.156 22.93 2.449
network & modality 0.827 0.814 0.905 0.827 4.843 22.40 2.033
network & orthographic 0.844 0.829 0.912 0.843 4.521 22.13 1.790
network & terms 0.857 0.847 0.914 0.857 4.157 21.90 1.323
network & all content 0.862 0.849 0.921 0.862 3.926 22.15 1.286

Table 6: Weighted average AUC over low, medium Table 7: Retweet size regression daily average per-
and high retweet range of different classifiers. Note formance of different feature sets.
that Multi-Layer Perceptron (MLP) did not termi- MAE RMSE RRSE
nate in 3 days for the large feature set. Features
Weighted Average AUC network network & all content network, linear regression 3.225 14.30 0.909
Random Forest 0.799 0.862 network, MLP 3.015 14.91 0.716
Logistic Regression 0.605 0.689 network, RepTree 2.989 12.60 0.853
MLP 0.783 n/a network & modality, RepTree 3.099 13.86 0.867
network & orthographic, RepTree 3.100 13.87 0.865
network & terms, RepTree 3.090 13.86 0.868
The combination of these features with the content result all, RepTree 3.100 13.87 0.865
in strong improvement in classifier performance. In Table 5
we summarize the average AUC values for different feature
subsets over all four datasets. Our results are consistent: fectly classify the three classes, the ideal error values would
in all cases, the content related features improve the perfor- be MAE=2.435, RMSE=15.94, RRSE=0.414. We could not
mance. Finally, we give the performance of other classifiers reach close to the ideal values by regression either.
in Table 6 and conclude the superiority of the Random For-
est classifier [12]. We use the classifier implementations of 6.3 Cascade size on the user level
Weka [29] and LibLinear [11]. Our main evaluation is found in Table 8 where we consider
the user level average AUC values as described in Section 5.
6.2 Cascade size by regression As expected, since the new evaluation metrics give more em-
We give regression results by the linear regression, mul- phasis on distinguishing between the tweets of the same user,
tilayer perceptron and the regression tree implementation we see even stronger gain of the modality and orthographic
of Weka [29] in Table 7. As seen when compared to the features.
last three columns in Table 5, regression methods outper-
form multiclass classification results transformed to regres- 6.4 Feature contribution analysis
sors. Note that for the transformation, we use class averages We selected the most important network features by run-
obtained from the training data. If however we could per- ning a LogitBoost classifier [14]. The best features were all
Table 8: Retweet size classification daily average performance of different feature sets evaluated on the user
level as defined in equations (1) and (2).
Retweet range Low Medium High Average
Features Uniform Weighted Uniform Weighted Uniform Weighted Uniform Weighted
network AUC 0.684 0.712 0.752 0.800 0.746 0.796 0.719 0.756
network & modality AUC 0.700 0.722 0.751 0.796 0.737 0.756 0.726 0.757
network & orthographic AUC 0.702 0.731 0.753 0.797 0.768 0.782 0.730 0.764
network & terms AUC 0.705 0.732 0.757 0.800 0.767 0.786 0.733 0.766
network & all content AUC 0.740 0.783 0.763 0.812 0.769 0.820 0.752 0.797

characterizing the network. We list the first five, in the order
of importance: Table 9: Retweet size classification with fixed user
network features.
1. The number of followers of the root tweet user; Retweet range Weighted
2. The average cascade size of previous root tweets by the Features Low Medium High Average
user. static network 0.798 0.779 0.868 0.797
3. The number of root tweets of the user so far (retweets static network & all 0.854 0.804 0.932 0.851
excluded); content
4. The average cascade size of previous tweets (including static network per 0.5 0.5 0.5 0.5
retweets) by the user; user
5. The number of tweets of the user so far; static network & all 0.798 0.784 0.935 0.798
content per user
6.5 Content feature contribution analysis
We selected the most important content features by run-
ning logistic regression over the 3,000 trigrams described in features are already capable of distinguishing between the
Section 4.3. The features are complex expressions contain- messages of the same user, as seen in Tables 5 and 7.
ing elements from the three major group of linguistic feature
sets in the following order of absolute weight obtained by lo- 7. CONCLUSIONS
gistic regression: In this paper we investigated the possibility of predicting
the future popularity of a recently appeared text message
1. Three words [marriage between democracy], in this or-
in Twitter’s social networking system. Besides the typical
der;
user and network related features, we consider hashtag and
2. [at [hashtag occupywallstreet][url]]: the word “at”,
linguistic analysis based ones as well. Our results do not only
followed by the hashtag “#occupywallstreet”, and a
confirm the possibility of predicting the future popularity
url;
of a tweet, but also indicate that deep content analysis is
3. [between democracy and];
important to improve the quality of the prediction.
4. [capitalism is over];
In our experiments, we give high importance to the tem-
5. [[hashtag ows] pls];
poral aspects of the prediction: we predict immediately after
6. [[weasel word] marriage between]: the expression
the message is published, and we also evaluate on the user
“marriage between” on the weasel word list, which counts
level. We consider user level evaluation key in temporal
as the third element of the trigram;
analysis, since the influence and popularity of a given user
7. [[hashtag zizek] at [hashtag occupywallstreet]];
is relative stable while the retweet count of her particular
8. [[hashtag occupywallstreet][url][hashtag auspol]];
messages may greatly vary in time. On the user level, we
9. [over [hashtag zizek] at];
observe the importance of linguistic elements of the content.
10. [calientan la]: means “heating up”.

Note that all these features have negative weight for the Acknowledgments
upper two classes and positive or close to 0 for the lower We thank Andreas Kaltenbrunner for providing us with the
class. Hence the appearance of these trigrams decrease the Twitter data set [1].
value obtained by the network feature based model. We may
conclude that the use of weasel words and uninformative
phrases reduce the chance of getting retweeted, as opposed 8. REFERENCES
to the sample highly retweeted messages in Table 3. [1] P. Aragón, K. E. Kappler, A. Kaltenbrunner,
D. Laniado, and Y. Volkovich. Communication
6.6 Frozen network features dynamics in twitter during political campaigns: The
To illustrate the importance of the temporal training and case of the 2011 spanish national election. Policy &
evaluation framework and the online update of the network Internet, 5(2):183–206, 2013.
features, we made an experiment where we replaced user [2] E. Bakshy, D. Eckles, R. Yan, and I. Rosenn. Social
features by static ones. The results are summarized in Ta- influence in social advertising: evidence from field
ble 9. Note that on the user level, all messages will have the experiments. In Proceedings of the 13th ACM
same network features and hence classification will be ran- Conference on Electronic Commerce, pages 146–161.
dom with AUC=0.5. In contrast, online updated network ACM, 2012.
[3] E. Bakshy, J. M. H., W. A. Mason, and D. J. Watts. [17] V. Gómez, H. J. Kappen, N. Litvak, and
Everyone’s an influencer: quantifying influence on A. Kaltenbrunner. A likelihood-based framework for
twitter. In Proceedings of the fourth ACM the analysis of discussion threads. World Wide Web,
international conference on Web search and data pages 1–31, 2012.
mining, pages 65–74. ACM, 2011. [18] A. Gupta, P. Kumaraguru, C. Castillo, and P. Meier.
[4] E. Bakshy, J. M. Hofman, W. A. Mason, and D. J. Tweetcred: Real-time credibility assessment of content
Watts. Identifying influencers on twitter. In Fourth on twitter. In Social Informatics, volume 8851 of
ACM International Conference on Web Seach and Lecture Notes in Computer Science, pages 228–243.
Data Mining (WSDM), 2011. 2014.
[5] E. Bakshy, B. Karrer, and L. A. Adamic. Social [19] V. Hangya and R. Farkas. Filtering and polarity
influence and the diffusion of user-created content. In detection for reputation management on tweets. In
Proceedings of the 10th ACM conference on Electronic Working Notes of CLEF 2013 Evaluation Labs and
commerce, pages 325–334. ACM, 2009. Workshop, 2013.
[6] E. Bakshy, I. Rosenn, C. Marlow, and L. Adamic. The [20] L. Hong, O. Dan, and B. D. Davison. Predicting
role of social networks in information diffusion. In popular messages in twitter. In Proceedings of the 20th
Proceedings of the 21st international conference on International Conference Companion on World Wide
World Wide Web, pages 519–528. ACM, 2012. Web, WWW ’11, pages 57–58, New York, NY, USA,
[7] M. S. Bernstein, E. Bakshy, M. Burke, and B. Karrer. 2011. ACM.
Quantifying the invisible audience in social networks. [21] H. Kwak, C. Lee, H. Park, and S. Moon. What is
In Proceedings of the SIGCHI Conference on Human twitter, a social network or a news media? In
Factors in Computing Systems, pages 21–30. ACM, Proceedings of the 19th international conference on
2013. World wide web, pages 591–600. ACM, 2010.
[8] D. Boyd, S. Golder, and G. Lotan. Tweet, tweet, [22] K. Lerman and R. Ghosh. Information contagion: An
retweet: Conversational aspects of retweeting on empirical study of the spread of news on digg and
twitter. In System Sciences (HICSS), 2010 43rd twitter social networks. In Proceedings of 4th
Hawaii International Conference on, pages 1–10. International Conference on Weblogs and Social Media
IEEE, 2010. (ICWSM), 2010.
[9] M. Cha, H. Haddadi, F. Benevenuto, and [23] N. Naveed, T. Gottron, J. Kunegis, and A. C. Alhadi.
K. Gummadi. Measuring user influence in twitter: Bad news travel fast: A content-based analysis of
The million follower fallacy. In 4th International interestingness on twitter. In Proceedings of the 3rd
AAAI Conference on Weblogs and Social Media International Web Science Conference, WebSci ’11.
(ICWSM), 2010. ACM, 2011.
[10] J. Cheng, L. Adamic, P. A. Dow, J. M. Kleinberg, and [24] R. Palovics, B. Daroczy, and A. Benczur. Temporal
J. Leskovec. Can cascades be predicted? In prediction of retweet count. In Cognitive
Proceedings of the 23rd international conference on Infocommunications (CogInfoCom), 2013 IEEE 4th
World wide web, pages 925–936. International World International Conference on, pages 267–270. IEEE,
Wide Web Conferences Steering Committee, 2014. 2013.
[11] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, [25] J. Pennebaker, C. Chung, M. Ireland, A. Gonzales,
and C.-J. Lin. Liblinear: A library for large linear and R. Booth. The development and psychometric
classification. The Journal of Machine Learning properties of liwc2007. Technical report, University of
Research, 9:1871–1874, 2008. Texas at Austin, 2007.
[12] FastRandomForest. Re-implementation of the random [26] S. Petrovic, M. Osborne, and V. Lavrenko. Rt to win!
forest classifier for the weka environment. predicting message propagation in twitter. In ICWSM,
http://code.google.com/p/fast-random-forest/. 2011.
[13] J. Fogarty, R. S. Baker, and S. E. Hudson. Case [27] Gy. Szarvas, V. Vincze, R. Farkas, Gy. Móra, and
studies in the use of roc curve analysis for sensor-based I. Gurevych. Cross-genre and cross-domain detection
estimates in human computer interaction. In of semantic uncertainty. Computational Linguistics,
Proceedings of Graphics Interface 2005, GI ’05, pages 38(2):335–367, 2012.
129–136, School of Computer Science, University of [28] A. Wang, T. Chen, and M.-Y. Kan. Re-tweeting from
Waterloo, Waterloo, Ontario, Canada, 2005. Canadian a linguistic perspective. In Proceedings of the Second
Human-Computer Communications Society. Workshop on Language in Social Media, pages 46–55,
[14] J. Friedman, T. Hastie, and R. Tibshirani. Additive 2012.
logistic regression: A statistical view of boosting. [29] I. H. Witten and E. Frank. Data Mining: Practical
Annals of statistics, pages 337–374, 2000. Machine Learning Tools and Techniques. Morgan
[15] R. Ghosh and K. Lerman. Predicting influential users Kaufmann Series in Data Management Systems.
in online social networks. arXiv preprint Morgan Kaufmann, second edition, June 2005.
arXiv:1005.4882, 2010.
[16] V. Gómez, H. J. Kappen, and A. Kaltenbrunner.
Modeling the structure and evolution of discussion
cascades. In Proceedings of the 22nd ACM conference
on Hypertext and hypermedia, pages 181–190. ACM,
2011.