DATA SET

The dataset was collected by Aragon et al. [1] using the Twitter API that we extended by a crawl of the user network. Our data set hence consists of two parts:

Tweet dataset: tweet text and user metadata on the

In case of a retweet, the Twitter API provides us with the ID of the original tweet. By collecting retweets for a given original tweet ID, we may obtain the set users who have retweeted a given tweet with the corresponding retweet timestamps. The Twitter API however does not tell us the actual path of cascades if the original tweet was retweeted 1http://en.wikipedia.org/wiki/Occupy Wall Street several times. The information from the Twitter API on the tweet needs to be combined with the follower network to reconstruct the possible information pathways for a given tweet. However it can happen that for a given retweeter, more than one friend has retweeted the corresponding tweet before and hence we do not know the exact information source of the retweeter. The retweet ambiguity problem is well described in [ 3 ]. In what follows we consider all friends as possible information sources. In other words for a given tweet we consider all directed edges in the follower network in which information ow could occur (see Fig. 2 (a)). 3.2

Restoring missing cascade edges

For a given tweet, the computed edges de ne us a retweet cascade. However our dataset contains only a sample of tweets on the given hashtags and hence may not be complete: it can happen that a few intermediate retweeters are missing from our data. As a result, sometimes the reconstructed cascade graphs are disconnected. As detailed in Fig. 2 (b) and (c), we handle this problem in two di erent ways. One possible solution is to only consider the rst connected component of the cascade (see Fig. 2 (b)). Another one is to connect each disconnected part to the root tweeter with one virtual cascade edge (see Fig. 2 (c)). In what follows, we work with cascades that contain virtual edges, therefore every retweeter is included in the cascade. 3.3

Examples of highly retweeted messages

In Table 3, we give a few examples of highly retweeted messages with the actual urls and names replaced by [url] and [name]. 4.

FEATURE ENGINEERING

To train our models, we generate features for each root tweet in the data and then we predict the future cascade size of the root tweet from these feature sets. For a given root tweet, we compute features about the author user and her follower network (network features) and the textual content of the tweet itself (content features).

The rst step of content processing is text normalization. We converted the text them into lower case form except those which are fully upper cased and replaced tokens by their stem given by the Porter stemming algorithm. We replaced user mentions (starting with '@') and numbers by placeholder strings and removed the punctuation marks.

The content features are extracted from the normalized texts. The basic feature template in text analysis consists the terms of the message. We used a simple whitespace tokenizer rather than a more sophisticated linguistic tokenizer as previous studies reported its empirical advantage [ 19 ]. We employed unigrams, bigrams and trigrams of tokens because longer phrases just hurt the performance of the system in our preliminary experiments.

Besides terms, we extracted the following features describing the orthography of the message:

Hashtags are used to mark speci c topics, they can be appended after the tweets or inline in the content, marked by #. From the counts of hashtags the user can tips the topic categories of tweet content but too many hashtag can be irritating to the readers as they just make confusion.

Telephone number: If the tweet contains telephone number it is more likely to be spam or ads.

Urls: The referred urls can navigate the reader to text, sound, and image information, like media elements and journals thus they can attract interested readers. We distinguish between full and truncated urls. The truncated urls are ended with three dot, its probably copied from other tweet content, so it was interested by somebody.

The like sign is an illustrator, encouragement to others to share the tweet.

The presence of a question mark indicates uncertainty. In Twitter, questions are usually rhetorical|people do not seek answers on Twitter [ 19 ]). The author more likely wants to make the reader think about the message content.

The Exclamation mark highlights the part of the tweet, it expresses emotions and opinions.

If Numerical expressions are present the facts are quanti ed then it is more likely to have real information content. The actual value of numbers were ignored. Mentions: If a user mentioned (referred) in the tweet the content of the tweet is probably connected to the mentioned user. It can have informal or private content.

Emoticons are short character sequences representing emotions. We clustered the emoticons into positive, negative and neutral categories.

The last group of content features tries to capture the modality of the message:

Swear words in uence the style and attractiveness of the tweet. The reaction for swearing can be ignorance and also reattacking, which is not relevant in terms of retweet cascade size prediction. We extracted 458 swear words from http://www.youswear.com.

Weasel words and phrases2 aimed at creating an impression that a speci c and/or meaningful statement has been made when in fact only a vague or ambiguous claim has been communicated. We used the weasel word lexicon of [ 27 ].

We employed the linguistic inquiry categories (LIWC) [ 25 ] of the tweets' words as well. These categories describe words from emotional, cognitive and structural points of view. For example the \ask" word it is in Hear, Senses, Social and Present categories. Di erent LIWC categories can have di erent e ect on the in uence of the tweet in question. 4.3

N-grams

By using all the content features, we built n-grams as consecutive sequences in the tweet text that may include simply three terms (\posted a photo"), @-mentions, hashtags, url (\@OccupyPics Photo http://t.co/. . . " coded as [[user] Photo [url]]), numbers (\has [number] followers"), non-alphanumeric (\right now !") as well as markers for swear or weasel expressions (\[weasel word] people say"). We de ned the following classes of n-grams, for n 3: 2See http://en.wikipedia.org/wiki/Wikipedia: Embrace_weasel_words. Modality: The n-gram contains at least one swear or weasel word or expression (overall 208,368); Orthographic: No swear or weasel word but at least one orthographic term (overall 2,751,935); Terms: N-grams formed only of terms, no swear or weasel words and orthographic features (overall 771,196). For e ciency, we selected the most frequent 1,000 n-grams from each class. The entire feature set hence consists of 3,000 trigrams. 5.

TEMPORAL TRAINING AND EVALUA TION

Here we describe the way we generate training and test sets for our algorithms detailed in Section 6. First, for each root tweet we compute the corresponding network and content features. We create daily re-trained models: for a given day t, we train a model on all root tweets that have been generated before t but appeared later than t , where is the preset time frame. After training based on the data before a given day, we compute our predictions for all root tweets appeared in that day.

In order to keep the features up to date, we recompute all network properties online, on the y and use the new values to give predictions. By this method, we may immediately notice if a user starts gaining high attention or if a bursty event happens.

We take special attention to de ning the values used for training and evaluation. For evaluation, we used the information till the end of the three week data set collection period, i.e. we used all the known tweets that belong to the given cascade. However, for training, we are only allowed to use and count the tweets up to the end of the training period. Since the testing period is longer, we linearly approximated the values for the remaining part of the testing period.

Our goal is to predict cascade size at the time when the root tweet is generated. One method we use is regression, which directly predict the size of the retweet cascade. For regression, we only use the global error measures:

We also experiment with multiclass classi cation for ranges of the cascade size. The cascade size follows a power law distribution (see Fig. 3) and we de ned three buckets, one with 0. . . 10 (referred as \low"), one with 11. . . 100 (\medium") and a largest one with more than 100 (\high") retweeters participating in the cascade. We evaluate performance by AUC [ 13 ] averaged for the three classes. Note that AUC has a probabilistic interpretation: for the example of the \high" class, the value of the AUC is equal to the probability that a random highly retweeted message is ranked before a random non-highly retweeted one.

By the probabilistic interpretation of AUC, we may realize that a classi er will perform well if it orders the users well with little consideration on their individual messages. Since our goal is to predict the messages in time and not the rather static user visibility and in uence, we de ne new averaging schemes for predicting the success of individual messages.

We consider the classi cation of the messages of a single user and de ne two aggregations of the individual AUC values. First, we simply average the AUC values of users for each day (user average)

N AU Cuser = 1 X AU Ci;

N i=1 Second, we are weighting the individual AUC values with the activity of the user (number of tweets by the user for the actual day)

AU Cwuser =

PN i=1 AU CiTi PiN Ti where Ti is the number of tweets by the i-th user.

We may also obtain regressors from the multiclass classi cation results. In order to make classi cation and regression comparable, we give a very simple transformation that replaces each class by a value that can be used as regressor. (1) (2)

We select and use the training set average value in each class as the ideal value for the prediction.

In this section, we train and evaluate rst the classi cation and then the regression models to predict the future cascade size of tweets. We predict day by day, for each day in the testing period. For classi cation, we also evaluate on the user level by using equations (1) and (2). For classi cation, we show the best performing features as well.

As mentioned in Section 5, we may train our model with di erent . In Figure 4 we show the average AUC value with di erent time frames. As Twitter trends change rapidly, we achieve the best average results if we train our algorithms on root tweets that were generated in the previous week (approximately seven days), both for global and for user level average evaluation.

Cascade size by multiclass classification

First, we measure classi er performance by computing the average AUC values of the nal results for the three size ranges. We were interested in how di erent classi ers perform and how di erent feature sets a ect classi er performance. For this reason, we repeated our experiments with di erent feature subsets. Figure 5 shows our results. For each day, the network features give a strong baseline. The combination of these features with the content result in strong improvement in classi er performance. In Table 5 we summarize the average AUC values for di erent feature subsets over all four datasets. Our results are consistent: in all cases, the content related features improve the performance. Finally, we give the performance of other classi ers in Table 6 and conclude the superiority of the Random Forest classi er [ 12 ]. We use the classi er implementations of Weka [ 29 ] and LibLinear [ 11 ]. 6.2

Cascade size by regression

We give regression results by the linear regression, multilayer perceptron and the regression tree implementation of Weka [ 29 ] in Table 7. As seen when compared to the last three columns in Table 5, regression methods outperform multiclass classi cation results transformed to regressors. Note that for the transformation, we use class averages obtained from the training data. If however we could per

We selected the most important network features by running a LogitBoost classi er [ 14 ]. The best features were all

Content feature contribution analysis

We selected the most important content features by running logistic regression over the 3,000 trigrams described in Section 4.3. The features are complex expressions containing elements from the three major group of linguistic feature sets in the following order of absolute weight obtained by logistic regression: 1. Three words [marriage between democracy], in this order; 2. [at [hashtag occupywallstreet][url]]: the word \at", followed by the hashtag \#occupywallstreet", and a url; 3. [between democracy and]; 4. [capitalism is over]; 5. [[hashtag ows] pls]; 6. [[weasel word] marriage between]: the expression \marriage between" on the weasel word list, which counts as the third element of the trigram; 7. [[hashtag zizek] at [hashtag occupywallstreet]]; 8. [[hashtag occupywallstreet][url][hashtag auspol]]; 9. [over [hashtag zizek] at]; 10. [calientan la]: means \heating up".

Note that all these features have negative weight for the upper two classes and positive or close to 0 for the lower class. Hence the appearance of these trigrams decrease the value obtained by the network feature based model. We may conclude that the use of weasel words and uninformative phrases reduce the chance of getting retweeted, as opposed to the sample highly retweeted messages in Table 3. 6.6

Frozen network features

To illustrate the importance of the temporal training and evaluation framework and the online update of the network features, we made an experiment where we replaced user features by static ones. The results are summarized in Table 9. Note that on the user level, all messages will have the same network features and hence classi cation will be random with AUC=0.5. In contrast, online updated network features are already capable of distinguishing between the messages of the same user, as seen in Tables 5 and 7. 7.

CONCLUSIONS

In this paper we investigated the possibility of predicting the future popularity of a recently appeared text message in Twitter's social networking system. Besides the typical user and network related features, we consider hashtag and linguistic analysis based ones as well. Our results do not only con rm the possibility of predicting the future popularity of a tweet, but also indicate that deep content analysis is important to improve the quality of the prediction.

In our experiments, we give high importance to the temporal aspects of the prediction: we predict immediately after the message is published, and we also evaluate on the user level. We consider user level evaluation key in temporal analysis, since the in uence and popularity of a given user is relative stable while the retweet count of her particular messages may greatly vary in time. On the user level, we observe the importance of linguistic elements of the content.

Acknowledgments

We thank Andreas Kaltenbrunner for providing us with the Twitter data set [1].

REFERENCES

[1] P. Aragon, K. E. Kappler, A. Kaltenbrunner, D. Laniado, and Y. Volkovich. Communication dynamics in twitter during political campaigns: The case of the 2011 spanish national election. Policy & Internet, 5(2):183{206, 2013. [2] E. Bakshy, D. Eckles, R. Yan, and I. Rosenn. Social in uence in social advertising: evidence from eld experiments. In Proceedings of the 13th ACM Conference on Electronic Commerce, pages 146{161. ACM, 2012.

[3]

Bakshy , J. M. H. , W. A.

Mason , and D. J.

Watts . Everyone's an in uencer: quantifying in uence on twitter . In Proceedings of the fourth ACM international conference on Web search and data mining , pages 65 { 74 . ACM, 2011 .

[4]

Bakshy ,

J. M.

Hofman ,

W. A.

Mason , and

D. J.

Watts . Identifying in uencers on twitter . In Fourth ACM International Conference on Web Seach and Data Mining (WSDM) , 2011 .

[5]

Bakshy ,

Karrer , and

L. A.

Adamic . Social in uence and the di usion of user-created content . In Proceedings of the 10th ACM conference on Electronic commerce , pages 325 { 334 . ACM, 2009 .

[6]

Bakshy , I. Rosenn,

Marlow , and

Adamic . The role of social networks in information di usion . In Proceedings of the 21st international conference on World Wide Web , pages 519 { 528 . ACM, 2012 .

[7]

M. S.

Bernstein , E. Bakshy,

Burke , and

Karrer . Quantifying the invisible audience in social networks . In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , pages 21 { 30 . ACM, 2013 .

[8]

Boyd ,

Golder , and

Lotan . Tweet, tweet, retweet: Conversational aspects of retweeting on twitter . In System Sciences (HICSS) , 2010 43rd Hawaii International Conference on, pages 1 { 10 . IEEE, 2010 .

[9]

Cha ,

Haddadi ,

Benevenuto , and

Gummadi . Measuring user in uence in twitter: The million follower fallacy . In 4th International AAAI Conference on Weblogs and Social Media (ICWSM) , 2010 .

[10]

Cheng , L. Adamic,

P. A.

Dow ,

J. M.

Kleinberg , and

Leskovec . Can cascades be predicted? In Proceedings of the 23rd international conference on World wide web , pages 925 { 936 . International World Wide Web Conferences Steering Committee, 2014 .

[11] R.-E. Fan , K.-W. Chang , C.-J. Hsieh , X.-R.

Wang , and C.-J.

Lin . Liblinear: A library for large linear classi cation . The Journal of Machine Learning Research , 9 : 1871 { 1874 , 2008 .

[12] FastRandomForest. Re-implementation of the random forest classi er for the weka environment . http://code.google.com/p/fast-random-forest/.

[13]

Fogarty ,

R. S.

Baker , and

S. E.

Hudson . Case studies in the use of roc curve analysis for sensor-based estimates in human computer interaction . In Proceedings of Graphics Interface 2005 , GI ' 05 , pages 129 { 136 , School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, 2005 . Canadian Human-Computer Communications Society.

[14]

Friedman ,

Hastie , and

Tibshirani . Additive logistic regression: A statistical view of boosting . Annals of statistics , pages 337 { 374 , 2000 .

[15]

Ghosh and

Lerman . Predicting in uential users in online social networks . arXiv preprint arXiv:1005.4882 , 2010 .

[16]

Gomez ,

H. J.

Kappen , and

Kaltenbrunner . Modeling the structure and evolution of discussion cascades . In Proceedings of the 22nd ACM conference on Hypertext and hypermedia , pages 181 { 190 . ACM, 2011 .

[17]

Gomez ,

H. J.

Kappen ,

Litvak , and

Kaltenbrunner . A likelihood-based framework for the analysis of discussion threads . World Wide Web , pages 1 { 31 , 2012 .

[18]

Gupta ,

Kumaraguru ,

Castillo , and

Meier . Tweetcred: Real-time credibility assessment of content on twitter . In Social Informatics , volume 8851 of Lecture Notes in Computer Science, pages 228 { 243 . 2014 .

[19]

Hangya and

Farkas . Filtering and polarity detection for reputation management on tweets . In Working Notes of CLEF 2013 Evaluation Labs and Workshop , 2013 .

[20]

Hong ,

Dan , and

B. D.

Davison . Predicting popular messages in twitter . In Proceedings of the 20th International Conference Companion on World Wide Web, WWW '11 , pages 57 { 58 , New York, NY, USA, 2011 . ACM.

[21]

Kwak ,

Lee ,

Park , and

Moon . What is twitter, a social network or a news media ? In Proceedings of the 19th international conference on World wide web , pages 591 { 600 . ACM, 2010 .

[22]

Lerman and

Ghosh . Information contagion: An empirical study of the spread of news on digg and twitter social networks . In Proceedings of 4th International Conference on Weblogs and Social Media (ICWSM) , 2010 .

[23]

Naveed ,

Gottron ,

Kunegis , and

A. C.

Alhadi . Bad news travel fast: A content-based analysis of interestingness on twitter . In Proceedings of the 3rd International Web Science Conference, WebSci '11. ACM , 2011 .

[24]

Palovics ,

Daroczy , and

Benczur . Temporal prediction of retweet count . In Cognitive Infocommunications (CogInfoCom) , 2013 IEEE 4th International Conference on, pages 267 { 270 . IEEE, 2013 .

[25]

Pennebaker ,

Chung ,

Ireland ,

Gonzales , and

Booth . The development and psychometric properties of liwc2007 . Technical report , University of Texas at Austin, 2007 .

[26]

Petrovic ,

Osborne , and

Lavrenko . Rt to win! predicting message propagation in twitter . In ICWSM , 2011 .

[27] Gy. Szarvas , V.

Vincze , R.

Farkas , Gy. Mora, and I.

Gurevych . Cross-genre and cross-domain detection of semantic uncertainty . Computational Linguistics , 38 ( 2 ): 335 { 367 , 2012 .

[28]

Wang ,

Chen , and M.-

Kan . Re-tweeting from a linguistic perspective . In Proceedings of the Second Workshop on Language in Social Media , pages 46 { 55 , 2012 .

[29]

I. H.

Witten and

Frank . Data Mining: Practical Machine Learning Tools and Techniques . Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, second edition, June 2005 .