=Paper= {{Paper |id=Vol-1395/paper_11 |storemode=property |title=Connections between Twitter Spammer Categories |pdfUrl=https://ceur-ws.org/Vol-1395/paper_11.pdf |volume=Vol-1395 |dblpUrl=https://dblp.org/rec/conf/msm/EdwardsG15 }} ==Connections between Twitter Spammer Categories== https://ceur-ws.org/Vol-1395/paper_11.pdf
          Connections between Twitter Spammer Categories

                          Gordon Edwards                                              Amy Guy
                         School of Informatics                                  School of Informatics
                        University of Edinburgh                                University of Edinburgh
                        Edinburgh, Scotland, UK                                Edinburgh, Scotland, UK
                  g.n.edwards@sms.ed.ac.uk                                     Amy.Guy@ed.ac.uk


 ABSTRACT                                                          It is suggested in [7] that spammers collude within Twitter
 Twitter has become a viable platform for spammers, who            networks — that if each account is a node in a graph, then
 often form networks to further their reach. Troublesomely,        from each node a spam account can be reached by travers-
 targeted users become increasingly frustrated, or worse, view     ing five edges with probability p = 0.63. Working together
 content resulting in computer virus infection. We build on        in networks helps spammers proliferate, as it is unlikely a
 previous work around detecting spam on Twitter, propos-           whole network will successfully be taken down. Adding new
 ing that subcategorising spammers can increase our under-         accounts to their network as others are removed, each can
 standing of their connections in spammer networks and aid         rely on follows from accounts within the network. A desir-
 detection. After defining five subcategories of spammers and      able but false impression of popularity is thus given. Detect-
 classifying users accordingly, correlations between the cate-     ing and classifying whole spammer networks at once could
 gories of spammers and the categories of their followers and      enable more efficient elimination of spam, compared to as-
 followees are explored. We also find that all spam subcate-       sessing on a continual basis all individual accounts on the
 gories follow a higher share of non-spam accounts than any        site.
 individual spam subcategories, and, unexpectedly, that ev-
 ery spammer subcategory is followed by non-spammers more          Previous work considers various machine learning techniques
 than by individual counterparts.                                  for detecting spam, such as Random Forest and Naı̈ve
                                                                   Bayes, either from live feeds or from research corpora [1,
                                                                   4]. Broadly, it refers to two sets of features upon which
 Keywords                                                          users can be classified: content-based, such as mean number
 Twitter, spammer categories, spam, social media, microp-          of hashtags per tweet, and user-based, such as number of
 osts, machine learning                                            followers of the authoring user [4].

 1.   INTRODUCTION                                                 The preceding literature frames spam classification as a bi-
 Twitter’s popularity attracts spammers, providing them            nary process (not spam/spam). However, further investiga-
 with a very publicly-accessible user base. It reported that       tion reveals recurring subtypes of spam—for example users
 less than 5% of its users are spammers, but that figure is        advertising products, or users disseminating pornography—
 likely to be higher in reality [2], especially with the more      providing a novel approach to classification. Aside from aca-
 wide-ranging criteria for spam adopted in this paper. Spam        demic interest, classifying into subtypes means users could
 can pose a security threat to users, or just cause annoyance      engage in more refined decisions about blocking of content
 — either way leaving them disillusioned with Twitter.             or users than Twitter’s spam filtering currently allows. It
                                                                   also facilitates pinpointing of the most harmful spam, such
 Users are not compelled to follow accounts they deem to be        as tweets concealing viruses and phishing attacks.
 spam. However, the ability to quickly determine if a new
 follower is a spammer is useful in deciding whether to follow     Emergent trends, which we will examine, in the distribution
 back. Automatic detection could save users from wasting           of an account’s followers and those they follow between the
 time checking each new follower, and spare them from po-          categories may increase confidence that it belongs to a par-
 tentially dangerous spam. Spammers can also reach users           ticular category. Finding that one spammer is commonly
 via a mention or a direct message; in this case investigating     connected to a particular type yields a fast way to discover
 the tweet author safeguards against spam.                         accounts of that type, potentially to block or suspend. Con-
                                                                   nections between di↵erent spammer categories are not very
                                                                   dangerous in themselves—though could lure a user to view-
                                                                   ing further spam accounts—but they form a potential means
                                                                   of detecting spammer networks.
 Copyright c 2015 held by author(s)/owner(s); copying permitted
 only for private and academic purposes.                           This paper, part of an ongoing research project, lays the
 Published as part of the #Microposts2015 Workshop proceedings,
 available online as CEUR Vol-1395 (http://ceur-ws.org/Vol-1395)   groundwork for investigating the extent to which di↵erent
                                                                   categories of spammers are connected to others, and to gen-
 #Microposts2015, May 18th, 2015, Florence, Italy.                 uine users. It establishes that these connections result from




· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
 spammers’ collusion within networks. We build on the work
 of [7], but contrastingly not confining ourselves to just one
 trending topic. In Section 2 we describe our defined subcate-
 gories of spam, training set, features, and classifier. We then
 summarise our findings in Section 3 and their limitations in
 Section 4.

 2. CLASSIFICATION
 2.1 Spam Subcategories
 The Twitter API [6] o↵ers the means to collect a sample
 of 1,420 users to form a training set, to subsequently hand-
 label as spam and not spam. During this annotation pro-
 cess spam subcategories become apparent. Whilst not nec-
 essarily definitive, they are reasonably defensible. Though                Figure 1: Class distribution of the dataset
 applicable to users and tweets, we only use the categories
 in relation to users. They are defined below with example               usage of an online app. Tweets that fall into this cat-
 tweets typical from the type of spammer. Their distribution             egory often contain a URL, but, again, to be certain
 is displayed in Figure 1.                                               in classification the authoring account may need to be
    • advertising: users who tweet extremely frequently,                 examined.
      mostly, if not always, advertising products, or tweets
      advertising a product authored by such a user. Nor-
      mally the tweets contain links, often shortened using
      a URL shortener.
                                                                   2.2    Features
                                                                   Feature representations of Twitter users can be formed, as
                                                                   per previous work, using content-based and user-based fea-
                                                                   tures [4]. Fifty features, 15 user-based and 35 content-based,
    • explicit: users who post exclusively, or almost so, pho-     sufficiently represent users. The content features require the
      tos, videos, and links, perhaps shortened with a URL         tweet history of the user: their latest 200 tweets, or fewer if
      shortener, to websites of a pornographic or adult na-        they do not have that many. Some features unique to this
      ture, or tweets that that contain this kind of content.      paper are:

                                                                    User-based Features
                                                                    Screen name and description Levenshtein similarity1
    • follower gain: users claiming the ability to boost            Percentage of non-alphanumeric characters in description
      other users’ follower bases, frequently, in most of their
      tweets, asking users for retweets and to follow certain       Content-based Features
      accounts. A tweet in this category claims that retweet-       Mean number of new lines in the user’s tweets
      ing or following a mentioned (via @username) account          Relative standard deviation of the number of new lines in
      will result in the receipt of followers.                      the user’s tweets


                                                                   2.3    Classifier
                                                                   The Random Forest classifier implementation in the Weka
                                                                   Java library [5] provides the basis for implementing a clas-
                                                                   sifier tailored to the spam subcategory classification task.
    • celebrity: users who tweet plead relentlessly for the fol-   Maximising the spam recall desirably increases the probabil-
      low back of a public figure in their tweets. Ascertaining    ity of classifying a spammer’s spam followers and followees2
      whether an individual tweet falls into this category is      into the subcategories correctly. Thus, the classifier first bi-
      generally harder. Examining the authoring user should        narily classifies users as not spam and spam, using the Ran-
      be indicative — ascertaining whether a suspect tweet         dom Forest classifier — considering all instances labelled as
      is a unique occurrence for that user and therefore not       one of the spam subcategories as labelled spam. Then, if
      representative.                                              the outputted classification is not spam and the associated
                                                                   confidence is not less than a set threshold3 , not spam is re-
                                                                   turned. Otherwise, the instance is reclassified, again with
                                                                   1
                                                                     Description of Levenshtein similarity:
                                                                   www.cs.tufts.edu/comp/150GEN/classpages/
                                                                   Levenshtein.html
                                                                   2
                                                                     For the purposes of this paper “followees” refer to the ac-
                                                                   counts which a user is following.
    • bot: accounts whose tweets are generated by a bot            3
                                                                     Given threshold ↵, instances initially classified with the
      that auto-posts content from some source, or details         binary classifier not spam, with confidence c, c  ↵, are




                                                                                                                              23
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
 the Random Forest classifier, applied to dataset with the
 not spam instances filtered out, so one of the spam subcat-
 egories is necessarily returned. Conveniently, using Weka’s
 AdaBoostM1 implementation furthers reduces misclassifica-
 tion due to class imbalance.

 Ten-fold cross-validation, provided through Weka, allows the
 classifier to be evaluated, with the collected sample of 1,420
 users forming the validation set:

                   Recall    Precision     F-Measure
  not spam         0.74      0.80          0.77
  explicit         0.77      0.83          0.80
  advertising      0.84      0.64          0.72
  follower gain    0.56      0.90          0.69
  bot              0.36      0.56          0.44
  celebrity        0.78      0.74          0.76

 The classifier performs poorly on the class bot, most often
 misclassifying as advertising, so there can be no confidence
 in conclusions made regarding that class. The misclassifica-
 tion is probably due to the inherent similarity between the
 behaviours of spammers in each category.

 2.4    Results Reporting                                          Figure 2: Heat maps showing respectively the strength of
 For each class, given a sample of 70 contained users the          connection between spammer subcategories and their fol-
 tailored classifier can be used to attain the mean class per-     lower subcategories, and between spammer subcategories
 centages of followers and followees — 500 (or as many as          and their followee subcategories.
 there are) are sampled for each. Given more time and com-
 putational resources, a larger dataset could be formed. All
 the percentages are rounded to the nearest integer.               a spammer’s followers are genuine users for each subcat-
                                                                   egory. Users are either consciously following spammers—
 Contingency tables are also constructed given the counts          perhaps advertising accounts hoping to find good deals or
 of (category, follower category) pairs and (category, followee    celebrity accounts because they are interested in the associ-
 category) pairs. These help reveal the extent to which spam-      ated celebrity—or through ignorance, lacking a tool to warn
 mers are connected to their followers and to their followees.     them. No one spam category is a landslide winner in attain-
                                                                   ing genuine followers though.
 3.    DISCUSSION OF RESULTS                                       On average, about 30% of advertising followers belong to
 Possible inaccuracies in classifications detailed in Section 4    the same category — a share much higher than any other
 mean care should be taken in drawing conclusions, and it          spam subcategory. Also around 23% of the accounts fol-
 is unlikely all of them will be infallible. The results report    lowed are advertising—again, a share much higher than the
 that genuine users have 73% not spam followers on average,        other spam subcategories—suggesting a significant degree of
 20% higher than the not spam followers share of advertis-         connection between advertising accounts, confirmed later.
 ing and bot accounts. Tallying with our intuition, the fair
 conclusion to draw here given the classifier performance on       Other subcategories appearing to have a high degree of intra-
 these follower classes for not spam is that genuine users will    connection are explicit and celebrity. Accounts in the former
 have a noticeably higher share of not spam followers than         have a higher share of explicit followers than any other fol-
 spammers, a trait that can increase the confidence that a         lower subcategory, averaging at 20%, and also follow more
 user classified as not spam is indeed so. With a fair degree      accounts of the same subcategory than the others, with a
 of confidence the results show that genuine users are likely      share averaging around 42%. Accounts in the latter have
 to follow back around half of their genuine followers. The re-    a higher share of celebrity followers than the other follower
 ported number of followers and followees for accounts that        subcategories, averaging at 33%. Such accounts also follow
 spammers follow back is usually higher than for accounts          more accounts of the same subcategory than the others, with
 they do not, implying that spammers target their connec-          a share averaging around 42%.
 tions to popular accounts.
                                                                   However, accounts in the bot category have a higher share,
 The average share of not spam accounts followed across the        averaging at 26%, of advertising followers than bot followers
 advertising, bot, celebrity, and follower gain categories, 60%,   (averaging at only 6%) or any other subcategory of follower.
 is notably higher than that of any of the spam subcategories,     Likewise the followees share is higher for advertising, av-
 showing their persistent e↵orts to gain genuine users’ at-        eraging at 18%, than bot (averaging at only 4%) and the
 tention. However, perhaps surprisingly, on average 50% of         other subcategories. This discrepancy could be due to the
 assumed to be spam, to further increase the spam recall.          categories’ inherent similarity; arguably both have the same




                                                                                                                            24
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015
  Class            Follower        Recall     Precision     F1     5.   SUMMARY AND FUTURE WORK
                   advertising     0.51       0.60          0.59   This paper presents the findings of new research. By forming
  advertising      bot             0.43       0.74          0.55   a training set of users and implementing a classifier tailored
                   not spam        0.57       0.43          0.49   to the task, underpinned by Random Forest, users can be
                   advertising     0.56       0.63          0.59   classified into the defined classes. Analysing the distribu-
  bot              bot             0.44       0.55          0.49   tion of these classes in users’ followers and followees allows
                   not spam        0.48       0.58          0.52   inferences to be made about the relationships between users,
                   celebrity       0.63       0.50          0.56   crucially between spammers. We observe that many genuine
  celebrity
                   not spam        0.37       0.93          0.53   users are falling into the trap of connecting with a range of
  explicit         explicit        0.43       1.0           0.6    types of spammer.
  follower gain    not spam        0.39       0.83          0.53
                   follower gain   1.0        0.50          0.67   We reveal that spammers mainly have their largest share
  not spam
                   not spam        0.44       0.98          0.61   of connections devoted to non-spammers and their second
                                                                   largest to spammers of the same subcategory. However there
 Table 1: For each subcategory of spammer the performance          are exceptions, with some subcategories connecting with a
 when the classifying each subcategory of follower.                proportionally very much smaller number of spammers from
                                                                   the same category. Correlations are found between spam-
                                                                   mer subcategories and their follower and followee subcate-
                                                                   gories, showing that spammers are colluding with each other
 aim— to direct users to content—so there is incentive for         in networks, with a significant degree of connection between
 them to connect with each other. As previously warned,            spammers of the same category.
 given the categories are not definitive, advertising and bot
 could reasonably be merged into one category, probably re-        Establishing connections between subcategories in a large
 ducing the classification error.                                  contiguous network, starting from one account and branch-
                                                                   ing outwards, recursively analysing the followers and fol-
 We confirm the hypothesised relationships in the connec-          lowees, could be a future extension. Visualising this network
 tions between spammers of the same subcategory using              would be interesting, allowing clusters of spammers of di↵er-
 Cramér’s V correlation c [3]. Measuring the correlation be-      ent subcategories to be determined. Also the subcategories
 tween two categorical random variables given a constructed        could usefully be refined, and perhaps more introduced.
 contingency table, it ranges from 0, where the two random
 variables are independent, to 1, where they are equal. Let-
 ting X = Subcategory of spammer and Y = Subcategory
                                                                   6.   ACKNOWLEDGMENTS
 of follower, c = 0.39, showing that there is some asso-           We thank Krzysztof Jerzy Geras, School of Informatics, Uni-
 ciation between a spammer subcategory and their follower          versity of Edinburgh, for explaining to us how to find cor-
 subcategory. Similarly, if Subcategory of spammer and Y           relations, which we subsequently found and included in this
 = Subcategory of followee, then c = 0.47, showing there           paper.
 is an analogous correlation between a spammer subcategory
 and their followee subcategory.                                   7.   REFERENCES
                                                                   [1] F. Benevenuto, G. Magno, T. Rodrigues, and
 The fairly strong positive correlations and attained percent-         V. Almeida. Detecting spammers on twitter. In Annual
 age shares aforementioned evidence the degree of collusion            Collaboration, Electronic messaging, Anti-Abuse and
 between spammers, and that those in the same subcategories            Spam Conference (CEAS), 2010.
 are deliberately connecting to form networks — notable re-        [2] J. Brustein. Twitter’s bot census didn’t actually
 lationships are present. Predicated on these correlations,            happen. http://www.businessweek.com/articles/
 the heat maps in Figure 2 show the strength of spammer                2014-08-12/twitters-bot-population%
 connections. Because it is a hallmark of spam, establish-             -remains-a-mystery-and-a-problem. [Online;
 ing the presence of such connections aids spammer network             accessed 14/11/2014].
 detection and individual account classification.                  [3] P. Dattalo. Nominal association: Phi and cramer’s v.
                                                                       http://www.people.vcu.edu/~pdattalo/
                                                                       702SuppRead/MeasAssoc/NominalAssoc.%html, 2002.
                                                                       [Online; accessed 10/03/2015].
 4.     LIMITATIONS                                                [4] M. McCord and M. Chuah. Spam detection on twitter
 When the classifier is further tested by classifying a sample
                                                                       using traditional classifiers. In Proceedings of the 8th
 of followers of users from each of the categories, the perfor-
                                                                       International Conference on Autonomic and Trusted
 mance reported in Table 4 is worse than the cross-validation
                                                                       Computing, ATC’11, pages 175–186, Berlin, Heidelberg,
 in Section 2.3, likely due to large variations in the distribu-
                                                                       2011. Springer-Verlag.
 tion as the sample is more deterministic than the validation
 set. Thus in Section 3 only sound conclusions respecting          [5] N. Z. The University of Waikato. Weka.
 these figures were drawn, but improvements made in future             http://www.cs.waikato.ac.nz/ml/weka/.
 work could allow further conclusions regarding the connec-        [6] Twitter4j. Twitter4j. http://twitter4j.org/.
 tions between some of the combinations of categories not          [7] S. Yardi, D. Romero, G. Schoenebeck, and D. Boyd.
 considered. A larger test sample, perhaps yielding di↵erent           Detecting spam in a twitter network. In Volume 15,
 figures, would clearly be preferable but was not practicable          Number 1 - 4 January 2010, First Monday
 given the time constraints.                                           peer-reviewed journal. First Monday, 2010.




                                                                                                                            25
· #Microposts2015 · 5th Workshop on Making Sense of Microposts · @WWW2015