Identifying Twitter Bots Using a Convolutional Neural
                           Network
                           Notebook for PAN at CLEF 2019

                   Michael Färber1 , Agon Qurdina2 , and Lule Ahmedi2
                      1
                          Karlsruhe Institute of Technology (KIT), Germany
                                 michael.faerber@kit.edu
                                  2
                                    University of Prishtina, Kosovo
                          agon.qurdina@studentet.uni-pr.edu
                                  lule.ahmedi@uni-pr.edu


        Abstract In this paper, we present an approach for identifying Twitter bots based
        on their written tweets using a convolutional neural network. We experiment
        with various embedding methods (pretrained and trained on the training dataset)
        and convolutional neural network architectures and compare their performance.
        When evaluating our best performing approach on the actual test data set of the
        CLEF 2019 Bots Profiling Subtask (English language), we obtain an accuracy of
        90.34%. We therefore see convolutional neural networks as a promising machine
        learning technique for Twitter bot detection.


1     Introduction

With over 1 billion active users, Twitter is among the most frequently used social media
platforms. It has taken an important role as a medium disseminating information and
opinions. However, due to its potential to influence the reader’s opinion, bot nets are
considered to be a threat to society and democracy. For instance, it has been reported
that botnets have been identified in the context of public voting, such as the Brexit
voting [9] and the federal elections of the United States in 2016 [6]. Botnets have also
been used for making the ideologies of terrorism attractive [5]. It is estimated that about
15% of all active accounts on Twitter (i.e., 48 million) are bot accounts [14]. Developing
approaches to detect Twitter bots automatically is therefore important.
    In this paper, we consider the following task proposed as a shared task at CLEF
2019: “Given a Twitter feed, determine whether its author is a bot or a human.” [4,13]
In the past, several approaches to Twitter bot identification have been proposed (see
Section 2). However, most of them do not use the content of the tweets, but instead the
metadata of the tweets and their implications (such as the publication frequency over
time). We present an approach based on convolutional neural networks where all tweets
of a user are used as input.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
    This paper is structured as follows. In Section 2, we present related works on bot
identification. In Section 3, we describe our approach for Twitter bot identification. In
Section 4, we present the evaluation setup, the evaluation data set, and the evaluation
results. We conclude in Section 5 with a summary and an outlook.


2   Related Work

Given that millions of user accounts on social media platforms are bots, classifying user
accounts in social media has already been approached in various regards. In 2016, Ade-
wole et al. [1] listed 65 articles about malicious account detection approaches for social
networks, including social tagging and other social network variations. We differentiate
between approaches that are non-content based and content-based.

Non-content based approaches to bot identification. Non-content-based approaches use
patterns in the behaviour of the Twitter accounts. Chavosi et al. [3], for instance, per-
form correlation analysis, exploiting the fact that highly synchronous cross-user activ-
ities indicate whether an account is a bot or not. In [2], Beskow and Carley state that
measured differences between bot and human conversation networks can be used to in-
crease the accuracy in bot detection. Mazza et al. [12], one of the most recent works on
bot detection, use patterns of the retweeting activity, with the specific goal of detecting
malicious retweeting bots. Recently, Lundberg et al. [11] use only tweets’ metadata in
order to provide a language-independent approach to bot detection.

Content-based approaches to bot identification. Content-based approaches target the
classification of Twitter accounts based on the account’s tweets. In [10], the authors
use content-based features and pattern contrast features for bot detection. They obtain
classification results over 0.90 of AUC with this approach.

Hybrid approaches to bot identification. In their paper [8], Kudugunta and Ferrara
propose a hybrid approach that exploits both tweet content and metadata (e.g., retweet
and reply count, or number of hashtags) to detect whether a given tweet was posted by a
human or a bot. The framework uses a deep neural network based on a contextual long
short-term memory (LSTM) architecture and exhibits promising performance of over
0.96 of AUC to bot detection at the tweet-level.

    Note that in this paper, we obtain an AUC score of 0.86 based solely on the content
of the tweets. To the best of our knowledge, we are the first ones approaching such a
task using a convolutional neural network model.


3   Approach

In this section, we present our approach for twitter bot identification. The source code of
our implementation is available online at https://github.com/agon-qurdina/
author-profiling.
             Figure 1. Architecture of our system used for classifying sentences.


3.1   Preprocessing
Normally, the input for our model would be the set of tweets for a given user. But we
decided to have the entire user’s feed of tweets as a single input to the model. Thus,
we merged all of the author’s tweets together into a single block of text which we
called “article.” Next, given a set of “articles” as input, we preprocessed them along the
following steps:
    Text Cleaning. New lines were replaced by spaces. We also expanded contractions
and removed stop words, HTML tags, and special characters from the articles’ content.
    Texts to Sequences. The articles’ content was tokenized and a word dictionary
(size: 155,227 unique words) was generated.
    Sequence Padding. We applied a post-padding with a fixed sequence length l. Due
to the way we concatenated the tweets to a single “article,” we were left with a limited
number of training samples. Thus, in order to keep as much information as possible,
we set l = 11000. By choosing this high value for the sequence length, only a small
percentage (284 from 2873) of the sequences were truncated.

3.2   Basic Model Architecture
Our basic architecture is shown in the Figure 1. We use a CNN architecture that is
based on Kim et al.’s approach for sentence classification [7]. His proposed architec-
ture has been widely applied for various tasks in the past. The CNN consists of two
one-dimensional convolutional neural networks layers, followed by MaxPooling lay-
ers, with a dense neural networks layer processing the output of the second CNN layer.
The model is completed by a final output layer that uses the sigmoid activation function
to return a binary output (namely, the classification into human or bot).
     In the following, we describe the architecture in more detail:
     Input layer. Considering the article’s words were embedded into d-dimensional
vectors, the final matrix used as input to the model can be written as I = l × d where l
is the chosen sequence length (i.e., the length of the articles). Recall that l = 11000 in
our setting.
     First convolutional layer. The first transformation this embedded input goes through
is a convolutional layer with f 1-dimensional filters of length k. Thus, the layer weights
can be considered a matrix of shape W c ∈ Rf ×k .
     We used a filter size f of 64 and a filter length k of 4. In our context, having 1-
dimensional filters means that for each word in an article, its three adjacent words are
considered as the context of the word. The output of the convolutional layer then is
C = conv(I, W c) where conv is the convolutional operation applied to input I using
the weights matrix W c. This operation includes applying the ReLu activation function
to complete weights calculations. Also, a dropout function is used to prevent overfitting.
We used a dropout rate of 0.5, which means 50% of the weights (randomly chosen)
during each training epoch are set to 0.
     First max pooling layer. The above output C is then considered the input to a 1-
dimensional Max Pooling layer. The purpose of the layer is to extract only the most
important features of the convolution outputs. This is done by keeping only the max
value from a pool size p. As we chose p = 4, the output of this layer can be written as
M = max pool(C, p). This operation reduces the number of weights by four times.
     Second convolutional layer and max pooling layer. The output M of the max
pooling layer is the input of the second convolutional layer, and the whole process
described above is applied to this input, to get a final output M2 .
     Fully connected layer. Given the two-dimensional matrix of weights from the last
step, the next layer in the network is a fully connected one with a size of 256 hidden
neurons. But in order for the convolutional output to serve as an input to this layer,
its dimensionality needs to be reduced. This was done using the flatten method, which
keeps all of the values but flatten them in a long vector. A ReLU activation function and
a dropout layer with a rate of 0.5 were used here.
     Output layer. The last layer is a fully connected layer with one neuron. The sigmoid
activation function is used to provide a binary output.

3.3   Architecture Variations
We developed and evaluated the following architecture variations:

 1. The first variation uses 1-dimensional MaxPooling layers of size 2 after each of the
    Convolutional layers. That means the resulting number of features, after passing
    from the convolutional layer to the MaxPooling layer and then out of it, will be
    halved. Considering we start with a sequence length of 11,000, the fully-connected
    layer contains more than 11 million weights to train. We will call this variation
    Large Model in the remaining part of this paper.
 2. The second variation uses 1-dimensional MaxPooling layers of size 4 after each of
    the Convolutional layers, which reduces the number of features generated by the
    Convolutional layers by a factor of 4. It does that by only keeping the max value
                    Table 1. Characteristics of the used evaluation data set.

                             Dataset      Number of articles In %
                             Train                      2873 69.85
                             Validation                 1240 30.15

                                  Table 2. Hyperparameters.

                    Parameter                   Value
                    Conv. filter size          64
                    Conv. kernel size          4
                    MaxPooling1D pool size 4
                    Dropout rate               0.5
                    Dense layer units          256
                    Layers Activation function ReLu
                    Optimizer                  Adam
                    Learning Rate              Adaptive (0.001→0.00001)
                    Loss function              Binary crossentropy
                    Batch Size                 32


      from 4 adjacent features. If the same input size is used, this makes for a significant
      lower number of weights in the fully-connected layer (around 16,000) compared to
      the Large Model.


4     Evaluation

4.1    Data Set

The actual CLEF 2019 Bots and Gender Profiling Task test data set is hidden and only
used for official system submissions. Thus, we used the training and validation data set
of the CLEF 2019 Bots and Gender Profiling Task data set for training and testing our
models before the CLEF submissions. In Table 1, we outline noteworthy statistics about
the used data set. Note that the data set provided by CLEF is balanced, i.e., the number
of bot profiles and human profiles is the same. For the official system submissions, the
performance of any system is evaluated based on the accuracy. We therefore also use
this evaluation metric for our own experiments.


                   Table 3. Results for the different embedding dimensions.

                                 Embedding          Accuracy
                                 100-dimensional        84.03%
                                 300-dimensional        85.65%
                      Table 4. Results for the architecture variations.

               # Trainable      Train Validation                                       AUC
Architecture                                     Precision          Recall F1 Score
                  Weights    Accuracy      Acc.                                        Score
Large Model 11,000,000        95.58%       79.68%      72.94%     94.35%    82.28%    79.68%
Simple Model    16,000        97.67%       85.65%      97.02%     73.55%    83.67%    85.65%


4.2   Evaluation Settings
We developed our models using Keras v2.1.2 with a Tensorflow v1.0.0 backend. Train-
ing the model was performed on a machine with 64GB memory and a GeForce GTX
1080 Ti GPU.
    We implemented and evaluated our basic model using a word-based embedding
method, where an embedding vector is generated for each unique word in the text cor-
pus. The embedding formed the first layer of our model and was trained alongside the
model itself. The texts are truncated to a length of 11,000 characters and we use a 300-
dimensional embedding. Thus, this layer on its own adds a further 46 million parameters
to be trained in the model’s training process. By training the embedding as part of the
model it allows the generated sequences of word indices to be the input to our model.
The first layer of the network will be responsible for generating the word vectors for
each of the words from the inputs.
    We fine-tuned the hyperparameters of our basic model using the dedicated valida-
tion data set. In the end, we used the parameters as shown in Table 2. Note that these
optimal hyperparameters performed best on both the architecture variations.

4.3   Evaluation Results
Evaluating Embeddings on the Basic Architecture Table 3 presents the evaluation
results for all the used embedding dimensions. The embedding dimensions gave sim-
ilar results, with some slight differences in the model accuracy in favor of the model
generating 300-dimensional vectors for each word. Thus, we decided to go with that
embedding in the final model for the CLEF test runs.

Evaluating the Architecture Variations We trained the extended models with the
same hyperparameters as our basic model and used word2vec as the method of gen-
erating the word embeddings, with a dimension size of 300 based on our results from
Table 3. The evaluation results for the two architecture variations are shown in Table 4.
    Although the training accuracies were similar, the validation accuracies were very
different. Even though the Large Model is a deeper model (having more trainable
weights and predicting power), the big difference between its training and validation
accuracies, led us to believe that it was overfitting on the training data. One possible
reason for this might be that the model is too complex for the limited number of train-
ing samples (articles) in our task. The table also shows that the Simple Model performs
better in all the other metrics as well. Thus, the model chosen for the CLEF submission
was the Simple Model.
Evaluating on CLEF’s Test Data Set Applying our basic model – using our simpler
architecture variation, a word2vec embedding layer generating 300-dimensional vec-
tors and the hyperparameters shown in Table 2 – on the official CLEF test data set via
official approach submissions, we obtained an accuracy of 90.34%. Based on the accu-
racy alone, our model performed even better on the test data than in our experiments
with the validation data. Comparing our approach to the approaches of other teams at
CLEF 2019 with respect to bot identification on the English language – ignoring the
other tasks proposed within the CLEF 2019 Bots and Gender Profiling Task –, our team
ranked in the top half of all participating teams.


5   Conclusion
In this paper, we presented a convolutional neural network architecture for determining
whether a Twitter feed is written by a bot or a human. In our experiments, we found
that a convolutional neural network, using the flatten transition function and a 300-
dimensional word2vec embedding method performs best among our methods. In the
official CLEF 2019 Bots and Gender Profiling Task test runs, we obtained a relatively
high accuracy of 90.34%. In the future, besides evaluating a deeper convolutional neural
network, we plan to develop further approaches based on LSTMs.


References
 1. Adewole, K.S., Anuar, N.B., Kamsin, A., Varathan, K.D., Razak, S.A.: Malicious accounts:
    Dark of the social networks. J. Network and Computer Applications 79, 41–67 (2017)
 2. Beskow, D.M., Carley, K.M.: Bot Conversations are Different: Leveraging Network Metrics
    for Bot Detection in Twitter. In: Proceedings of the IEEE/ACM 2018 International
    Conference on Advances in Social Networks Analysis and Mining. pp. 825–832.
    ASONAM’18 (2018)
 3. Chavoshi, N., Hamooni, H., Mueen, A.: DeBot: Twitter Bot Detection via Warped
    Correlation. In: Proceedings of the IEEE 16th International Conference on Data Mining. pp.
    817–822. ICDM’16 (2016)
 4. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P.,
    Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.:
    Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship
    Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber,
    A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the
    Tenth International Conference of the CLEF Association (CLEF 2019). Springer (2019)
 5. Garcı́a, S., Grill, M., Stiborek, J., Zunino, A.: An empirical comparison of botnet detection
    methods. Computers & Security 45, 100–123 (2014)
 6. Howard, P.N., Woolley, S., Calo, R.: Algorithms, bots, and political communication in the
    US 2016 election: The challenge of automated political communication for election law and
    administration. Journal of information technology & politics 15(2), 81–93 (2018)
 7. Kim, Y.: Convolutional Neural Networks for Sentence Classification. In: Proceedings of the
    2014 Conference on Empirical Methods in Natural Language Processing. pp. 1746–1751.
    EMNLP’14 (2014)
 8. Kudugunta, S., Ferrara, E.: Deep neural networks for bot detection. Information Sciences
    467, 312 – 322 (2018)
 9. Llewellyn, C., Cram, L., Favero, A., Hill, R.L.: For Whom the Bell Trolls: Troll Behaviour
    in the Twitter Brexit Debate. CoRR abs/1801.08754 (2018)
10. Loyola-González, O., Monroy, R., Rodrı́guez, J., López-Cuevas, A., Mata-Sanchez, J.I.:
    Contrast pattern-based classification for bot detection on twitter. IEEE Access 7,
    45800–45817 (2019)
11. Lundberg, J., Nordqvist, J., Laitinen, M.: Towards a language independent Twitter bot
    detector. In: Proceedings of the Digital Humanities in the Nordic Countries. pp. 308–319.
    DHN’19 (2019)
12. Mazza, M., Cresci, S., Avvenuti, M., Quattrociocchi, W., Tesconi, M.: Rtbust: Exploiting
    temporal patterns for botnet detection on twitter. CoRR abs/1902.04506 (2019)
13. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and
    Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019
    Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2019)
14. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online Human-Bot
    Interactions: Detection, Estimation, and Characterization. In: Proceedings of the Eleventh
    International Conference on Web and Social Media. pp. 280–289. ICWSM’17 (2017)