Twitter Spam Account Detection
                  by Effective Labeling

    Federico Concone, Giuseppe Lo Re, Marco Morana, and Claudio Ruocco

       University of Palermo, Viale delle Scienze, ed. 6 - 90128 Palermo, Italy
                          {firstname.lastname}@unipa.it


      Abstract. In the last years, the widespread diffusion of Online Social
      Networks (OSNs) has enabled new forms of communications that make
      it easier for people to interact remotely. Unfortunately, one of the first
      consequences of such a popularity is the increasing number of malicious
      users who sign-up and use OSNs for non-legit activities. In this paper
      we focus on spam detection, and present some preliminary results of a
      system that aims at speeding up the creation of a large-scale annotated
      dataset for spam account detection on Twitter. To this aim, two different
      algorithms capable of capturing the spammer behaviors, i.e., to share
      malicious urls and recurrent contents, are exploited. Experimental results
      on a dataset of about 40.000 users show the effectiveness of the proposed
      approach.

      Keywords: Social Network Security · Spam Detection · Twitter Data
      Analysis.


1    Introduction
Online Social Networks (OSNs) are platforms through which a multitude of
people can interact remotely. Nowadays different types of OSNs are available,
each with its own characteristics and functionalities depending on the purpose
and target for which it is intended. The simplicity of use of these tools, together
with the diffusion of smart personal devices that allow continuous access to the
network, stimulate users to overcome some communication barriers typical of
real life. As a result, people are encouraged to share personal information, even
with entities (people or other systems) that are actually unknown.
    Although the number of OSNs is ever increasing, many researches have fo-
cused on Twitter analysis because the information content of the tweets is usually
very high, being strictly related to popular events which involve many people in
different parts of the world [9, 10]. Moreover, it is extremely easy to access the
Twitter stream thanks to the API platform that provides broad access to public
data that users have chosen to share.
    Among the different analyses concerning Twitter, spam accounts detection
is one of the most investigated and relevant one. In general terms, spammers are
entities, real users or automated bots, whose aim is to repeatedly share messages
that include unwanted content for commercial or offensive purposes [13], e.g.,
links to malicious sites, in order to spread malwares, phishing attacks, and other
harmful activity [5].
     Spam detection is part of the unending fight between cops and robbers. In
order to discourage malicious behaviors, social networks are continuously trans-
forming and, as a consequence, spammers have also evolved, adopting more
sophisticated techniques that make it easy to evade security mechanisms [23].
Since the design of new spam detection techniques requires stable and annotated
datasets to assess their performances, such a dynamism makes the datasets in the
literature quickly obsolete and almost useless. Moreover, providing the ground-
truth of a huge amount of data is a time consuming task that, in most cases, is
still performed manually.
     In this paper, we present the preliminary results of a work that aims at
modeling Twitter spammers’ behavior in order to speed up the creation of large-
scale annotated datasets.
     The system consists of different software modules whose purpose is to capture
certain aspects of the spammers’ modus operandi. In particular, we focused on
two common characteristics, namely sharing malicious urls, and the presence of
messages with the same information content.
     The remainder of the paper is organized as follows: Related work is outlined in
Section 2. The spammer detecting system is described in Section 3. Experimental
results are presented in Section 4. Conclusions will follow in Section 5.


2   Related Work

In recent years, the spam detection on Twitter has been investigated in many
works.
    The different ways in which malicious users operate can be categorized ac-
cording to the method they adopt to disseminate illegitimate information [13].
Generally, a spam campaign is created by exploiting a number of fake, com-
promised, and sibyl accounts that operate in conjunction with social bots. For
each of these threats, various detection techniques have been proposed [21]. The
general idea is very simple and consists in attracting and deceiving possible at-
tackers by means of an isolated and monitored environment. To this aim, some
works propose the use of honeypots to analyze spamming activities. In [14],
for instance, the authors present a social honeypot able to collect spam profiles
from social networking communities. Every time an attacker attempts to interact
with the honeypot, an automated bot is used to retrieve some observable fea-
tures, e.g., number of friends, of the malicious users. Then, this set is analyzed
to create a profile of spammers and train the corresponding classifiers.
    Despite the advantages of performing a dynamic analysis on a controlled
environment, the effort of creating a honeypot for each element to be analyzed
is usually too high [7]. For this reason, most works focus on static machine
learning approaches capable of capturing some relevant features about the users
and their interactions. In [8], three classifiers, i.e. Random Forest, Decision Tree,
and Bayesian Networks, are used for learning nineteen features that reflect the
spammers’ behaviors.
    Another work using machine learning approach to identify malicious accounts
is presented in [2]. The authors developed a browser plug-in, called TSD (Twitter
Sybils Detector), capable of classifying a Twitter profile as human, sybil, or
hybrid according to a set of seventeen features. Such a system provides good
results when distinguishing human from sybil, but the performances get worse
when dealing with hybrid profiles. This limitation is common to several works,
suggesting that statistical features alone are not sufficient to correctly distinguish
multiple classes of users. The reason is that spammers change their behavior over
time to bypass security measures.
    A strategy that is becoming increasingly popular is to use urls as a key
element to recognize a spammer [4]. A system exploiting urls to detect spammers
within social networks is Monarch [20]. Here, three different modules aim to
capture urls submitted by web services, extract a feature set (e.g., domains
tokens, path tokens, path length), and label a specific url as spam or non-spam.
In addition, supplementary data, such as IP addresses and routing information,
are collected using DNS resolver and IP analysis.
    All these techniques require two preliminary phases: collecting a great number
of tweets, and label each element of the set as “spam” or “non-spam”.
    One of the first long-term data collection work is [15]. The dataset, captured
by means of a honeypot, contains a total of 5.5 million tweets associated with
both legitimate and malicious users.
    HSpam14 [18] is probably the most diffused dataset for spam detection on
Twitter. This dataset contains the IDs of 14 million tweets obtained by searching
for some trending topics. These identifiers should be used to access the original
tweets through the standard Twitter APIs. Unfortunately, although it has been
released just a few years ago, we observed that most of the requests fail because
of different errors, i.e., user account suspended, tweet ID not found, and account
protected.
    Conversely, the dataset described in [3] consists of over 600 million public
tweets, 6.5 million of which are labeled as spam and 6 million as non-spam.
The labeling is performed according to the output of the Trend Micro’s Web
Reputation Service, that checks if an url is malicious or not. If so, they label the
corresponding tweet as Twitter spam. Differently from HSpam14, this dataset
contains the tweets and a fixed set of 12 features, but does not report the tweet
IDs that could be used to access other relevant information.


3    Twitter Dataset Labeling

In this section we present a novel approach that aims at supporting the labeling
of large-scale Twitter datasets.
    The design of a smart labeling technique requires the definition of some crite-
ria that allow to distinguish between spammers and trustworthy users [1,17]. The
official Twitter documentation defines the spam activity as a series of behaviors
                                       URL ANALYSIS


                   spam                                        genuine


             TIMELINE ANALYSIS                          TIMELINE ANALYSIS


            spam          genuine                       spam             genuine


             SPAM                   MANUAL ANNOTATION              GENUINE
           ACCOUNT                                                 ACCOUNT


            Fig. 1. Overview of the proposed automatic labeling schema.


and actions that negatively affect other users and violate the rules of the social
network. Considering that malicious behaviors are constantly evolving, it is not
possible to provide a definitive set of them, but we can identify some strategies
that are common to most of the spammers.
    The first point to consider is the publication of malicious urls that direct
to phishing sites or induce users to download unwanted software [6]. Detecting
such links is not simple because spammers adopt strategies that obfuscate the
target url, thus deceiving the end user. For this reason, despite the possible
countermeasures, links are the easiest way to disseminate malicious contents [8].
    Currently, both because of the tweets’ character limit and the diffusion of
url blacklist services, a popular approach for spreading malicious links is the
usage of url shortening services. Twitter, for instance, provides an automatic
service (t.co) that allows users to share long urls in a tweet while maintaining
the maximum number of characters respected. However, since all shortened urls
look the same, users may not be aware of the actual destination address.
    Another typical spammers’ behavior is to repeatedly publish duplicate mes-
sages, or messages with the same information content. This strategy is often
complemented by exploiting a set of topics that are highly interesting to the user
community. Generally, OSNs allow legitimate users to report suspicious behav-
iors in order to let the administrators verify whether a given account is malicious
or not. However, manually detecting this type of behavior is time-consuming and
resource-intensive.
    On the basis of these characteristics, the labeling schema we propose is based
on two phases: url analysis, and similar tweets discovery.
    As summarized in Fig.1, a tweet is firstly analyzed to verify whether it con-
tains links or not. Either way, the result of this check provides only a preliminary
outcome that needs to be further investigated. Thus, the next phase consists of
a timeline (e.g., the last 200 tweets) analysis for every user. If both results are
consistent, i.e., both agree in considering the user as spammer or genuine, then
the account is labeled consequently. Otherwise, the automatic labeling fails and
a manual annotation step is required.


3.1   URL Analysis

Not surprisingly, tweets containing links are more likely to get re-tweeted, which
is the primary goal of most spammers. For this reason, the presence of a url in
a tweet is frequently indicative of potential spam activities.
    Most of the works in the literature perform link analysis by exploiting black-
list services, e.g., Google Safe Browsing (GSB), that are able to find whether a
given url is malicious or not. Unfortunately, the effectiveness of such a solution is
quite limited because these services usually take an average of four days to add a
new website to the blacklist, while most of the accesses on a tweet occur within
two days from the publication [12]. Even the url shortening and safe-browsing
services integrated with Twitter present some limitations. This system, for in-
stance, is not able to detect a malicious link that have been shortened twice or
more.
    Another point to be considered is that by relying on these tools only, a user
continuously sharing the same safe link, or the same kind of content, although
being a spammer, would never be recognized.
    For these reasons, the url analyzer we propose takes into account a greater
number of features related to link sharing activity. In particular, three factors are
considered while analyzing a tweet: i) the presence of malicious urls according to
GSB, ii) the total number of urls, T , and iii) the ratio RU T between the number
of unique urls, U , and T . The value of T permits also to discard those users that
have not published a sufficient number of urls in their timelines.
    Preliminary experiments showed that accounts satisfying one of the two fol-
lowing conditions can be labeled as spammers for this module: i) at least one
malicious url is found by GSB; ii) the ratio RU T ≤ 0.25 and T ≥ 50. Otherwise,
the account is considered genuine.


3.2   Finding Similar Tweets

In many applications, it is often necessary to divide data into homogeneous
groups, clusters, whose elements share the same characteristics. Several cluster-
ing techniques have been proposed in the literature [22].
    The second phase of our annotation schema is based on a clustering approach,
known as near duplicates clustering, intended for grouping items, i.e., tweets, that
are identical copies or slightly differ from each other, e.g., by a few characters.
    The aim of this phase is to measure the degree of similarity between the
tweets contained in the timeline of each user. Near-duplicated tweets can be
found by using MinHash and Locality-Sensitive Hashing (LSH) [11] alghoritms.
                           Table 1. Tweet pre-processing.

  Remove all non-englishBecause of the language-dependency of some tokeniza-
  tweets                tion algorithm, e.g., stemming, only english tweets have
                        been maintained.
  Remove mentions       Mentions are not semantically significant as they only
                        allow users to redirect their tweets to specific users.
  Convert text to lower There are no semantic differences between words writ-
  case                  ten in lowercase or uppercase.
  Apply stemming        Group words having the same stem (root).
  Remove # and common The character #, as well as punctuation marks, are fre-
  symbols               quently used and can negatively affect near-duplicates
                        detection.
  Expansion of urls     Follow all the re-directions of the urls included in the
                        tweets.
  Remove stop-words     Stop-words, such as conjunctions, articles and preposi-
                        tions, can be omitted without altering the meaning of
                        the tweet.
  Normalizing accented  Conversion of accented letters into the corresponding
  characters            non-accented versions.


Nevertheless, a few steps need to be performed before the applications of these
two algorithms.
    The first step aims to represent tweets as sets of tokens. These can be defined
either as consecutive characters, called shingles, or as single words composing
the document. The latter is the one we used.
    The second step includes different pre-processing operations, summarized in
Table 1, that are needed in order to improve the performances of MinHash and
LSH, as suggested in [18]. According to their model, we chose to remove all
those elements which do not contribute to the semantic of the tweet, such as
punctuation marks and stop-words. Moreover, we added some more steps, such
as url expansion and stemming. For instance, the tweet:

@helloworld I’m writing this #tweet. Trying tokenization. bit.ly/1hxXbR7

would be transformed into:

                       write tweet try token google.it.

    The last step involves the choice of K, i.e., the number of consecutive elements
to be considered as a single token. This choice deeply impacts on the system
performances since the higher is K, the lower is the number of documents that
will share the same word [16] and vice versa. A good rule is to set K equal to 1,
2, or 3 for small to medium sized documents, whilst 4 or 5 are reasonable values
for very large documents. Since the tweets are very short documents, we chose
K = 1, while in [18] authors used all the aforementioned values.
   Input: Set of tokens S
           N independent hash functions
   Output: < Hm (1), Hm (2), . . . , Hm (N ) >
   for i = 1 : N do
       Hm (i) ← ∞;
   end
   forall token ∈ S do
       for i = 1 : N do
           if Hashi (token) < Hm (i) then
               Hm (i) ← Hashi (token);
           end
       end
   end
                       Algorithm 1: MinHash signature.


    Representing every document as a set of tokens makes it easier to compute
the similarity between sets of documents. A simple similarity metric is the Jac-
card distance, which is the ratio between the size of the intersection of the two
documents and the size of their union. Since the Jaccard similarity can only be
applied to two objects at a time, it is required to analyze every possible pair
of documents in order to create clusters of similar items. When dealing with a
high number of document, this process is computationally expensive, being the
number of comparisons given by the binomial coefficient:

                                                   N2
                   
                 N   N        N!       N (N − 1)
                   =    =            =           ≈    .                       (1)
                 K   2    2!(N − 2)!       2       2

    Furthermore, a second factor that cannot be ignored is that the number of
tokens depends both on the amount of documents to be analyzed and their size.
To overcome such a limitation, the MinHash algorithm permits to approximate
the Jaccard similarity by using hash functions. The idea is to summarize the
large sets of tokens into smaller groups, i.e., signatures, so that two documents
D1 and D2 can be consider similar if their signatures Hash(D1 ), and Hash(D2 )
are similar.
    Algorithm 1 describes the MinHash signature generation when using N hash
functions. For every hash function hi and for every token tj a value is computed
as hi (tj ). Then, the i − th element of the signature is:

                                  si = min hi (tj ).                          (2)
                                         j

    Although MinHash solves the problem of comparing large datasets by com-
pressing every document into a signature, we still need to perform pairwise com-
parisons in efficient way. This is the reason behind the usage of LSH - Locality-
Sensitive Hashing - which exploits a hash table and maximizes the probability
of similar documents to be hashed into the same bucket.
    Essentially, LSH groups all the MinHash signatures into a matrix, then splits
it into B bands, each composed of R rows. Then, a hash value for every docu-
ment, for every band, is computed. If two documents fall into the same bucket
for at least one band, then they are considered as potential near-duplicates and
they can be further inspected through real or approximate Jaccard similarity.
    By applying MinHash and LSH, the tweets contained in the users’ timelines
are grouped into sets of clusters. The process of labeling a user as spammer or
genuine depends on the characteristics of these clusters. To this aim, different
features describing the size and the number of clusters were considered. In the
next Section, experimental results achieved while varying the feature set will be
presented.


4    Experimental Analysis

The first set of experiments aimed at finding the best set of parameters for Min-
Hash and LSH, i.e., the quadruple (N , K, B, J), where N is the number of hash
functions, K is the number of consecutive tokens, B is the number of bands, and
J is the minimum Jaccard distance to consider two tweets similar. Whereas N
has been set to 200 as suggested in the literature, the other parameters have been
selected varying their values as following: K = {1, 2, 3}, B = {5, 10, 20, 40, 50},
and J = {0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8}.
     In order to evaluate the results achieved by each quadruple, a reference
dataset was used. In particular, we exploited the dataset in [19], which is com-
posed by pairs of tweets manually labeled with a similarity score that varies from
1 (dissimilar) to 5 (equal). A pairwise similarity criterion was used to transform
these labels into a ground-truth about clusters of tweets. For instance, if a tweet
t1 is considered similar to t2 , and t2 is also similar to t3 , then t1 and t3 are similar
and the three tweets belong to the same cluster. Furthermore, to ensure a high
degree of similarity among tweets belonging to the same cluster, we considered
only those pairs whose similarity score is at least 3, i.e.,“strong near duplicates”.
     The performance of MinHash and LSH were evaluated in terms of precision,
recall, and f-score. Fig. 2 shows the f-score obtained for each triple (K, B, J).
According to these experiments, the best values are K = 1, B = 50, and J = 0.5,
which allow to achieve a f-score of 0.69.
     Once the parameters have been set, the next experiments were intended to
select the most suitable set of features in order to distinguish spammers from
genuine users. For instance, assuming that the timeline of a user is composed of
N tweets, we can expect that for a genuine user the number of clusters is close
to N ; whilst for a spammer this number would be M , with M  N .
     Thus, in order to obtain a compact representation of spammer and genuine
classes, feature vectors have been created by considering mean, variance, and
standard deviation of (i) the size of the largest cluster, (ii) mean size of clusters,
(iii) number of clustered tweets, (iv) the size of the smallest cluster, and (v)
number of generated cluster.
            Fig. 2. Calibration phase for MinHash and LSH algorithms.


    For these experiments we relied on a subset of the data in HSpam14 [18],
which contains 14 million labeled tweets. However, since our aim is to label
users, we sampled some of the tweets in HSpam14, retrieved information about
the authors, and then labeled the authors according to the original tweet’s label.
    Tests were run while varying the ratio between genuine users and spammers
and using different subset of features (see Fig. 3). Results showed that the best
values of accuracy and f-score are obtained when number of clusters (f5 ), average
size (f2 ), and maximum size (f1 ) of the clusters are considered, ignoring the
remaining 2 features.
    Finally, in order to assess the overall performances of the automatic labeling
procedure, a dataset was collected using the Twitter APIs.
    As first step, the Twitter stream was queried to obtain a set of relevant
tweets. Tweet collection is performed at regular intervals and exploits a set
of keywords that include both trending topics and “spammy” words, such as
money gain and adult contents [18]. For each tweet, the author and the list of
followers have been extracted, together with standard tweet-related data, such
as the tweet identifier, creation date, and so on. Extending the search to the
followers of potential spammers allowed us to increase the probability of finding
spammers. The complete list of authors and followers has then been processed
to obtain also the latest 200 tweets contained in each timeline. As a result of
this procedure we collected almost 8 million tweets and 40 thousands users.
    The dataset has been analyzed by applying the proposed procedure, that
allowed to automatically detect 20 thousands legitimate users and about 2 thou-
sands spammers. The outcomes of the labeling process are shown in Table 2.
These results were compared with a ground-truth obtained by manually labeling
the users we collected and the proposed approach achieved an average accuracy
           {f1, f2, f3, f4, f5}         {f1, f2, f3, f4}    {f1, f2, f3, f5}   {f1, f2, f4, f5}   {f1, f2, f5}   {f1, f2, f3}    {f1, f2}
     0,82
       0.82                                                                    0,82
     0,81                                                                      0,81
      0,8
       0.80                                                                     0,8
accuracy


                                                                                 f-score
     0,79                                                                      0,79
       0.78
     0,78                                                                      0,78
     0,77                                                                      0,77
       0.76
     0,76                                                                      0,76
     0,75                                                                      0,75
       0.84
     0,74                                                                      0,74
                     .25          .40       .50       .60     .75        .80               .25    .40      .50   .60       .75     .80
                     25           40        50        60       75      80          25      40              50    60        75      80
                                                                    spammers/genuine ratio
                                            Accuracy                                                        F-Score
  Fig. 3. Accuracy and f-score achieved while varying the ratio between spammers and
  genuine users in the range [25,80]. For each experiment, the following set of features
  were combined: size of the largest cluster (f1 ), mean size of clusters (f2 ), number of
  clustered tweets (f3 ), size of the smallest cluster (f4 ), and number of generated cluster
  (f5 ).

                          Number of users collected                 40.823
                            Automatically labeled as genuine        20.007
                            Automatically labeled as spammers        2.190
                          Number of tweets collected             8.010.147
                            Containing urls                      2.330.558
                            Containing hashtags                  1.640.521
                            Containing user mentions             4.334.056
              Table 2. Output of the detection/labeling process on the dataset collected.


  of about 80%. In particular we measured that the accuracy of the automatic sys-
  tem reaches the maximum value of 95% when detecting true genuine users, while
  this percentage is lower when dealing with spammers (about 70%). These values
  are not surprising and reflect the fact that activities carried out by genuine users
  are quite predictable, while spammers frequently vary their modus-operandi in
  order to elude spam detection system.

  5           Conclusion
  In this paper we presented a system able to capture some common behaviors
  of Twitter spammers, i.e., the habit to share malicious urls and the presence of
  patterns in spammers’ tweets.
      Since the design of any new spam detection technique requires stable and an-
  notated datasets to assess its performance, the idea is to recognize these common
  behaviors to provide the researchers with a tool capable of performing automatic
  annotation of large-scale datasets.
      Although malicious urls can be detected by relying on third-party blacklisting
  services, we noticed that these systems alone are not sufficient to detect any form
of link-based spam contents. Thus, a url analyzer taking into account a greater
number of features has been described.
    Regarding the analysis of recurring topics and near-duplicate contents, a
combination of MinHash and Local-Sensitive Hashing algorithms has been pre-
sented.
    Different experiments were performed in order to determine the best set of
parameters for both techniques, and to identify a set of features which permits
to distinguish between spammers and genuine users.
    Results showed that half of the accounts contained in the dataset can be
manually labeled by means of the proposed approach with an average accuracy
of about 80%. Such a result is very relevant for large-scale dataset and confirms
the suitability of the proposed approach to speed-up the annotation of huge
collections of Twitter data.
    As future work, we want to provide an analysis tool able to find further
similarities in the subset of users who need to be manually labeled. To this aim,
we are investigating efficient algorithms that could allow to group similar users,
analyze a few example per group, and then extend the label to the whole set.


References
 1. Agate, V., De Paola, A., Lo Re, G., Morana, M.: A simulation framework for
    evaluating distributed reputation management systems. In: Distributed Computing
    and Artificial Intelligence, 13th International Conference. pp. 247–254. Springer
    International Publishing, Cham (2016)
 2. Alsaleh, M., Alarifi, A., Al-Salman, A.M., Alfayez, M., Almuhaysin, A.:
    Tsd: Detecting sybil accounts in twitter. In: 2014 13th International Con-
    ference on Machine Learning and Applications. pp. 463–469 (Dec 2014).
    https://doi.org/10.1109/ICMLA.2014.81
 3. Chen, C., Zhang, J., Chen, X., Xiang, Y., Zhou, W.: 6 million spam tweets:
    A large ground truth for timely twitter spam detection. In: 2015 IEEE In-
    ternational Conference on Communications (ICC). pp. 7065–7070 (June 2015).
    https://doi.org/10.1109/ICC.2015.7249453
 4. Chen, C., Wen, S., Zhang, J., Xiang, Y., Oliver, J., Alelaiwi, A., Hassan, M.M.:
    Investigating the deceptive information in twitter spam. Future Gener. Comput.
    Syst. 72(C), 319–326 (Jul 2017). https://doi.org/10.1016/j.future.2016.05.036
 5. Concone, F., De Paola, A., Lo Re, G., Morana, M.: Twitter analysis for real-time
    malware discovery. In: 2017 AEIT International Annual Conference. pp. 1–6 (Sep
    2017). https://doi.org/10.23919/AEIT.2017.8240551
 6. De Paola, A., Gaglio, S., Lo Re, G., Morana, M.: A hybrid system for
    malware detection on big data. In: IEEE INFOCOM 2018 - IEEE Con-
    ference on Computer Communications Workshops. pp. 45–50 (April 2018).
    https://doi.org/10.1109/INFCOMW.2018.8406963
 7. De Paola, A., Favaloro, S., Gaglio, S., Lo Re, G., Morana, M.: Malware detection
    through low-level features and stacked denoising autoencoders. In: Proceedings of
    the Second Italian Conference on Cyber Security (ITASEC) (2018)
 8. Fazil, M., Abulaish, M.: A hybrid approach for detecting automated spammers in
    twitter. IEEE Transactions on Information Forensics and Security pp. 1–1 (2018).
    https://doi.org/10.1109/TIFS.2018.2825958
 9. Gaglio, S., Lo Re, G., Morana, M.: Real-time detection of twitter social events from
    the user’s perspective. In: 2015 IEEE International Conference on Communications
    (ICC). pp. 1207–1212 (June 2015). https://doi.org/10.1109/ICC.2015.7248487
10. Gaglio, S., Lo Re, G., Morana, M.: A framework for real-time twit-
    ter data analysis. Computer Communications 73, Part B, 236 – 242
    (2016). https://doi.org/http://dx.doi.org/10.1016/j.comcom.2015.09.021, online
    Social Networks
11. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hash-
    ing. In: Proceedings of the 25th International Conference on Very Large Data
    Bases. pp. 518–529. VLDB ’99, Morgan Kaufmann Publishers Inc., San Francisco,
    CA, USA (1999)
12. Grier, C., Thomas, K., Paxson, V., Zhang, M.: @ spam: the underground on 140
    characters or less. In: Proceedings of the 17th ACM conference on Computer and
    communications security. pp. 27–37. ACM (2010)
13. Kaur, R., Singh, S., Kumar, H.: Rise of spam and compromised accounts
    in online social networks: A state-of-the-art review of different combating ap-
    proaches. Journal of Network and Computer Applications 112, 53 – 88 (2018).
    https://doi.org/https://doi.org/10.1016/j.jnca.2018.03.015
14. Lee, K., Caverlee, J., Webb, S.: The social honeypot project: Protecting online
    communities from spammers. In: Proceedings of the 19th International Conference
    on World Wide Web. pp. 1139–1140. WWW ’10, ACM, New York, NY, USA
    (2010). https://doi.org/10.1145/1772690.1772843
15. Lee, K., Eoff, B.D., Caverlee, J.: Seven months with the devils: A long-term study
    of content polluters on twitter. In: ICWSM. pp. 185–192 (2011)
16. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Cambridge
    university press (2014)
17. Lua, E.K., Chen, R., Cai, Z.: Social trust and reputation in online social networks.
    In: 2011 IEEE 17th International Conference on Parallel and Distributed Systems.
    pp. 811–816 (Dec 2011). https://doi.org/10.1109/ICPADS.2011.123
18. Sedhai, S., Sun, A.: Hspam14: A collection of 14 million tweets for hashtag-oriented
    spam research. In: Proceedings of the 38th International ACM SIGIR Conference
    on Research and Development in Information Retrieval. pp. 223–232. SIGIR ’15,
    ACM, New York, NY, USA (2015). https://doi.org/10.1145/2766462.2767701
19. Tao, K., Abel, F., Hauff, C., Houben, G.J., Gadiraju, U.: Groundhog day: near-
    duplicate detection on twitter. In: Proceedings of the 22nd international conference
    on World Wide Web. pp. 1273–1284. ACM (2013)
20. Thomas, K., Grier, C., Ma, J., Paxson, V., Song, D.: Design and evaluation of
    a real-time url spam filtering service. In: Security and Privacy (SP), 2011 IEEE
    Symposium on. pp. 447–462. IEEE (2011)
21. Wu, T., Wen, S., Xiang, Y., Zhou, W.: Twitter spam detection: Survey
    of new approaches and comparative study. Computers & Security (2017).
    https://doi.org/https://doi.org/10.1016/j.cose.2017.11.013
22. Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Transactions on neural
    networks 16(3), 645–678 (2005)
23. Yang, C., Harkreader, R., Gu, G.: Empirical evaluation and new design for fight-
    ing evolving twitter spammers. IEEE Transactions on Information Forensics and
    Security 8(8), 1280–1293 (Aug 2013). https://doi.org/10.1109/TIFS.2013.2267732