=Paper=
{{Paper
|id=Vol-1395/paper_11
|storemode=property
|title=Connections between Twitter Spammer Categories
|pdfUrl=https://ceur-ws.org/Vol-1395/paper_11.pdf
|volume=Vol-1395
|dblpUrl=https://dblp.org/rec/conf/msm/EdwardsG15
}}
==Connections between Twitter Spammer Categories==
Connections between Twitter Spammer Categories Gordon Edwards Amy Guy School of Informatics School of Informatics University of Edinburgh University of Edinburgh Edinburgh, Scotland, UK Edinburgh, Scotland, UK g.n.edwards@sms.ed.ac.uk Amy.Guy@ed.ac.uk ABSTRACT It is suggested in [7] that spammers collude within Twitter Twitter has become a viable platform for spammers, who networks — that if each account is a node in a graph, then often form networks to further their reach. Troublesomely, from each node a spam account can be reached by travers- targeted users become increasingly frustrated, or worse, view ing five edges with probability p = 0.63. Working together content resulting in computer virus infection. We build on in networks helps spammers proliferate, as it is unlikely a previous work around detecting spam on Twitter, propos- whole network will successfully be taken down. Adding new ing that subcategorising spammers can increase our under- accounts to their network as others are removed, each can standing of their connections in spammer networks and aid rely on follows from accounts within the network. A desir- detection. After defining five subcategories of spammers and able but false impression of popularity is thus given. Detect- classifying users accordingly, correlations between the cate- ing and classifying whole spammer networks at once could gories of spammers and the categories of their followers and enable more efficient elimination of spam, compared to as- followees are explored. We also find that all spam subcate- sessing on a continual basis all individual accounts on the gories follow a higher share of non-spam accounts than any site. individual spam subcategories, and, unexpectedly, that ev- ery spammer subcategory is followed by non-spammers more Previous work considers various machine learning techniques than by individual counterparts. for detecting spam, such as Random Forest and Naı̈ve Bayes, either from live feeds or from research corpora [1, 4]. Broadly, it refers to two sets of features upon which Keywords users can be classified: content-based, such as mean number Twitter, spammer categories, spam, social media, microp- of hashtags per tweet, and user-based, such as number of osts, machine learning followers of the authoring user [4]. 1. INTRODUCTION The preceding literature frames spam classification as a bi- Twitter’s popularity attracts spammers, providing them nary process (not spam/spam). However, further investiga- with a very publicly-accessible user base. It reported that tion reveals recurring subtypes of spam—for example users less than 5% of its users are spammers, but that figure is advertising products, or users disseminating pornography— likely to be higher in reality [2], especially with the more providing a novel approach to classification. Aside from aca- wide-ranging criteria for spam adopted in this paper. Spam demic interest, classifying into subtypes means users could can pose a security threat to users, or just cause annoyance engage in more refined decisions about blocking of content — either way leaving them disillusioned with Twitter. or users than Twitter’s spam filtering currently allows. It also facilitates pinpointing of the most harmful spam, such Users are not compelled to follow accounts they deem to be as tweets concealing viruses and phishing attacks. spam. However, the ability to quickly determine if a new follower is a spammer is useful in deciding whether to follow Emergent trends, which we will examine, in the distribution back. Automatic detection could save users from wasting of an account’s followers and those they follow between the time checking each new follower, and spare them from po- categories may increase confidence that it belongs to a par- tentially dangerous spam. Spammers can also reach users ticular category. Finding that one spammer is commonly via a mention or a direct message; in this case investigating connected to a particular type yields a fast way to discover the tweet author safeguards against spam. accounts of that type, potentially to block or suspend. Con- nections between di↵erent spammer categories are not very dangerous in themselves—though could lure a user to view- ing further spam accounts—but they form a potential means of detecting spammer networks. Copyright c 2015 held by author(s)/owner(s); copying permitted only for private and academic purposes. This paper, part of an ongoing research project, lays the Published as part of the #Microposts2015 Workshop proceedings, available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395) groundwork for investigating the extent to which di↵erent categories of spammers are connected to others, and to gen- #Microposts2015, May 18th, 2015, Florence, Italy. uine users. It establishes that these connections result from · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 spammers’ collusion within networks. We build on the work of [7], but contrastingly not confining ourselves to just one trending topic. In Section 2 we describe our defined subcate- gories of spam, training set, features, and classifier. We then summarise our findings in Section 3 and their limitations in Section 4. 2. CLASSIFICATION 2.1 Spam Subcategories The Twitter API [6] o↵ers the means to collect a sample of 1,420 users to form a training set, to subsequently hand- label as spam and not spam. During this annotation pro- cess spam subcategories become apparent. Whilst not nec- essarily definitive, they are reasonably defensible. Though Figure 1: Class distribution of the dataset applicable to users and tweets, we only use the categories in relation to users. They are defined below with example usage of an online app. Tweets that fall into this cat- tweets typical from the type of spammer. Their distribution egory often contain a URL, but, again, to be certain is displayed in Figure 1. in classification the authoring account may need to be • advertising: users who tweet extremely frequently, examined. mostly, if not always, advertising products, or tweets advertising a product authored by such a user. Nor- mally the tweets contain links, often shortened using a URL shortener. 2.2 Features Feature representations of Twitter users can be formed, as per previous work, using content-based and user-based fea- tures [4]. Fifty features, 15 user-based and 35 content-based, • explicit: users who post exclusively, or almost so, pho- sufficiently represent users. The content features require the tos, videos, and links, perhaps shortened with a URL tweet history of the user: their latest 200 tweets, or fewer if shortener, to websites of a pornographic or adult na- they do not have that many. Some features unique to this ture, or tweets that that contain this kind of content. paper are: User-based Features Screen name and description Levenshtein similarity1 • follower gain: users claiming the ability to boost Percentage of non-alphanumeric characters in description other users’ follower bases, frequently, in most of their tweets, asking users for retweets and to follow certain Content-based Features accounts. A tweet in this category claims that retweet- Mean number of new lines in the user’s tweets ing or following a mentioned (via @username) account Relative standard deviation of the number of new lines in will result in the receipt of followers. the user’s tweets 2.3 Classifier The Random Forest classifier implementation in the Weka Java library [5] provides the basis for implementing a clas- sifier tailored to the spam subcategory classification task. • celebrity: users who tweet plead relentlessly for the fol- Maximising the spam recall desirably increases the probabil- low back of a public figure in their tweets. Ascertaining ity of classifying a spammer’s spam followers and followees2 whether an individual tweet falls into this category is into the subcategories correctly. Thus, the classifier first bi- generally harder. Examining the authoring user should narily classifies users as not spam and spam, using the Ran- be indicative — ascertaining whether a suspect tweet dom Forest classifier — considering all instances labelled as is a unique occurrence for that user and therefore not one of the spam subcategories as labelled spam. Then, if representative. the outputted classification is not spam and the associated confidence is not less than a set threshold3 , not spam is re- turned. Otherwise, the instance is reclassified, again with 1 Description of Levenshtein similarity: www.cs.tufts.edu/comp/150GEN/classpages/ Levenshtein.html 2 For the purposes of this paper “followees” refer to the ac- counts which a user is following. • bot: accounts whose tweets are generated by a bot 3 Given threshold ↵, instances initially classified with the that auto-posts content from some source, or details binary classifier not spam, with confidence c, c ↵, are 23 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 the Random Forest classifier, applied to dataset with the not spam instances filtered out, so one of the spam subcat- egories is necessarily returned. Conveniently, using Weka’s AdaBoostM1 implementation furthers reduces misclassifica- tion due to class imbalance. Ten-fold cross-validation, provided through Weka, allows the classifier to be evaluated, with the collected sample of 1,420 users forming the validation set: Recall Precision F-Measure not spam 0.74 0.80 0.77 explicit 0.77 0.83 0.80 advertising 0.84 0.64 0.72 follower gain 0.56 0.90 0.69 bot 0.36 0.56 0.44 celebrity 0.78 0.74 0.76 The classifier performs poorly on the class bot, most often misclassifying as advertising, so there can be no confidence in conclusions made regarding that class. The misclassifica- tion is probably due to the inherent similarity between the behaviours of spammers in each category. 2.4 Results Reporting Figure 2: Heat maps showing respectively the strength of For each class, given a sample of 70 contained users the connection between spammer subcategories and their fol- tailored classifier can be used to attain the mean class per- lower subcategories, and between spammer subcategories centages of followers and followees — 500 (or as many as and their followee subcategories. there are) are sampled for each. Given more time and com- putational resources, a larger dataset could be formed. All the percentages are rounded to the nearest integer. a spammer’s followers are genuine users for each subcat- egory. Users are either consciously following spammers— Contingency tables are also constructed given the counts perhaps advertising accounts hoping to find good deals or of (category, follower category) pairs and (category, followee celebrity accounts because they are interested in the associ- category) pairs. These help reveal the extent to which spam- ated celebrity—or through ignorance, lacking a tool to warn mers are connected to their followers and to their followees. them. No one spam category is a landslide winner in attain- ing genuine followers though. 3. DISCUSSION OF RESULTS On average, about 30% of advertising followers belong to Possible inaccuracies in classifications detailed in Section 4 the same category — a share much higher than any other mean care should be taken in drawing conclusions, and it spam subcategory. Also around 23% of the accounts fol- is unlikely all of them will be infallible. The results report lowed are advertising—again, a share much higher than the that genuine users have 73% not spam followers on average, other spam subcategories—suggesting a significant degree of 20% higher than the not spam followers share of advertis- connection between advertising accounts, confirmed later. ing and bot accounts. Tallying with our intuition, the fair conclusion to draw here given the classifier performance on Other subcategories appearing to have a high degree of intra- these follower classes for not spam is that genuine users will connection are explicit and celebrity. Accounts in the former have a noticeably higher share of not spam followers than have a higher share of explicit followers than any other fol- spammers, a trait that can increase the confidence that a lower subcategory, averaging at 20%, and also follow more user classified as not spam is indeed so. With a fair degree accounts of the same subcategory than the others, with a of confidence the results show that genuine users are likely share averaging around 42%. Accounts in the latter have to follow back around half of their genuine followers. The re- a higher share of celebrity followers than the other follower ported number of followers and followees for accounts that subcategories, averaging at 33%. Such accounts also follow spammers follow back is usually higher than for accounts more accounts of the same subcategory than the others, with they do not, implying that spammers target their connec- a share averaging around 42%. tions to popular accounts. However, accounts in the bot category have a higher share, The average share of not spam accounts followed across the averaging at 26%, of advertising followers than bot followers advertising, bot, celebrity, and follower gain categories, 60%, (averaging at only 6%) or any other subcategory of follower. is notably higher than that of any of the spam subcategories, Likewise the followees share is higher for advertising, av- showing their persistent e↵orts to gain genuine users’ at- eraging at 18%, than bot (averaging at only 4%) and the tention. However, perhaps surprisingly, on average 50% of other subcategories. This discrepancy could be due to the assumed to be spam, to further increase the spam recall. categories’ inherent similarity; arguably both have the same 24 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015 Class Follower Recall Precision F1 5. SUMMARY AND FUTURE WORK advertising 0.51 0.60 0.59 This paper presents the findings of new research. By forming advertising bot 0.43 0.74 0.55 a training set of users and implementing a classifier tailored not spam 0.57 0.43 0.49 to the task, underpinned by Random Forest, users can be advertising 0.56 0.63 0.59 classified into the defined classes. Analysing the distribu- bot bot 0.44 0.55 0.49 tion of these classes in users’ followers and followees allows not spam 0.48 0.58 0.52 inferences to be made about the relationships between users, celebrity 0.63 0.50 0.56 crucially between spammers. We observe that many genuine celebrity not spam 0.37 0.93 0.53 users are falling into the trap of connecting with a range of explicit explicit 0.43 1.0 0.6 types of spammer. follower gain not spam 0.39 0.83 0.53 follower gain 1.0 0.50 0.67 We reveal that spammers mainly have their largest share not spam not spam 0.44 0.98 0.61 of connections devoted to non-spammers and their second largest to spammers of the same subcategory. However there Table 1: For each subcategory of spammer the performance are exceptions, with some subcategories connecting with a when the classifying each subcategory of follower. proportionally very much smaller number of spammers from the same category. Correlations are found between spam- mer subcategories and their follower and followee subcate- gories, showing that spammers are colluding with each other aim— to direct users to content—so there is incentive for in networks, with a significant degree of connection between them to connect with each other. As previously warned, spammers of the same category. given the categories are not definitive, advertising and bot could reasonably be merged into one category, probably re- Establishing connections between subcategories in a large ducing the classification error. contiguous network, starting from one account and branch- ing outwards, recursively analysing the followers and fol- We confirm the hypothesised relationships in the connec- lowees, could be a future extension. Visualising this network tions between spammers of the same subcategory using would be interesting, allowing clusters of spammers of di↵er- Cramér’s V correlation c [3]. Measuring the correlation be- ent subcategories to be determined. Also the subcategories tween two categorical random variables given a constructed could usefully be refined, and perhaps more introduced. contingency table, it ranges from 0, where the two random variables are independent, to 1, where they are equal. Let- ting X = Subcategory of spammer and Y = Subcategory 6. ACKNOWLEDGMENTS of follower, c = 0.39, showing that there is some asso- We thank Krzysztof Jerzy Geras, School of Informatics, Uni- ciation between a spammer subcategory and their follower versity of Edinburgh, for explaining to us how to find cor- subcategory. Similarly, if Subcategory of spammer and Y relations, which we subsequently found and included in this = Subcategory of followee, then c = 0.47, showing there paper. is an analogous correlation between a spammer subcategory and their followee subcategory. 7. REFERENCES [1] F. Benevenuto, G. Magno, T. Rodrigues, and The fairly strong positive correlations and attained percent- V. Almeida. Detecting spammers on twitter. In Annual age shares aforementioned evidence the degree of collusion Collaboration, Electronic messaging, Anti-Abuse and between spammers, and that those in the same subcategories Spam Conference (CEAS), 2010. are deliberately connecting to form networks — notable re- [2] J. Brustein. Twitter’s bot census didn’t actually lationships are present. Predicated on these correlations, happen. http://www.businessweek.com/articles/ the heat maps in Figure 2 show the strength of spammer 2014-08-12/twitters-bot-population% connections. Because it is a hallmark of spam, establish- -remains-a-mystery-and-a-problem. [Online; ing the presence of such connections aids spammer network accessed 14/11/2014]. detection and individual account classification. [3] P. Dattalo. Nominal association: Phi and cramer’s v. http://www.people.vcu.edu/~pdattalo/ 702SuppRead/MeasAssoc/NominalAssoc.%html, 2002. [Online; accessed 10/03/2015]. 4. LIMITATIONS [4] M. McCord and M. Chuah. Spam detection on twitter When the classifier is further tested by classifying a sample using traditional classifiers. In Proceedings of the 8th of followers of users from each of the categories, the perfor- International Conference on Autonomic and Trusted mance reported in Table 4 is worse than the cross-validation Computing, ATC’11, pages 175–186, Berlin, Heidelberg, in Section 2.3, likely due to large variations in the distribu- 2011. Springer-Verlag. tion as the sample is more deterministic than the validation set. Thus in Section 3 only sound conclusions respecting [5] N. Z. The University of Waikato. Weka. these figures were drawn, but improvements made in future http://www.cs.waikato.ac.nz/ml/weka/. work could allow further conclusions regarding the connec- [6] Twitter4j. Twitter4j. http://twitter4j.org/. tions between some of the combinations of categories not [7] S. Yardi, D. Romero, G. Schoenebeck, and D. Boyd. considered. A larger test sample, perhaps yielding di↵erent Detecting spam in a twitter network. In Volume 15, figures, would clearly be preferable but was not practicable Number 1 - 4 January 2010, First Monday given the time constraints. peer-reviewed journal. First Monday, 2010. 25 · #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015