A Four Feature Types Approach for Detecting
         Bot and Gender of Twitter Users
                   Notebook for PAN at CLEF 2019

                                   Johan Fernquist

                           Swedish Defence Research Agency
                              johan.fernquist@foi.se


        Abstract. The main ideas of our classification model used in the PAN
        Bot and Gender profiling task 2019 was to combine different feature types
        with the ambition to detect different styles in writing to distinguishing
        bots, females and males from each other. We included both word and
        character TF-IDF features together with compression and tweet features.
        As classification algorithm we used the CatBoost method. We trained
        two models, one for the English data and one for the data in Span-
        ish. We achieved highest accuracy with our English model. Both models
        performed better in distinguishing bots and humans rather than distin-
        guishing females and males. For both languages we achieved an higher
        accuracy of the bot or human classification rather than the female or
        male classification.

        Keywords: Bot detection · Gender profiling · Twitter


1     Introduction
For several years, bots have been used for a large variety of purposes. Initially
their purpose were to automate otherwise unwieldy online processes which could
not be done manually, and have now become known commonly for mostly being
used for commercial purposes such as directing Internet users to advertisements
and posting spam in different social media channels. Bots are also often used to
further illegal activity such as collecting data from users for criminal gain. Bot
detection is therefore important for a variety of security purposes. Bot detec-
tion has for example been used when monitoring large events such as elections,
with the aim to prevent influential operations [4]. Gender profiling from text is
an important step in author profiling and can also be used for marketing and
commercial purposes.
    In this notebook, we will present the necessary steps for reproducing our
model used in the PAN [2] 2019 Bot and Gender profiling[13] task. We also
briefly describe what we hope to capture with the different types of features.
The concept of the model is illustrated in figure 1.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem-
    ber 2019, Lugano, Switzerland.
1.1   Previous work
There is a lot of research on bot detection such as [5] where a large variety of
features which have seen to perform well in previous researches are combined.
In [6], two type of features are used - meta-features and tweet features. In [15] a
total of 1,150 different features are used to train a supervised machine learning
model to bots. For example, the features consists of part-of-speech-tags (POS),
time features such as the statistics of times between consecutive tweets, retweets,
and mentions and entropy of words in a tweet. We have taken these feature types
in consideration while we developed our own model. Gender classification from
text is a well-researched problem. In [7], it is clear that POS-tags is an important
type of features when doing gender classification.


2     Model


                            Fig. 1. Work flow of model


    The model uses four different feature types: Term frequency-inverse docu-
ment frequency (TF-IDF) on both word and character level, compression fea-
tures and tweet features. The different feature types and the used supervised
classification model are described below.
    After doing some tests with a two step classifier (first bot or human, and then
female or male) we decided to create a classification model which only has one
classification step and classified bot, female or male directly. This decision was
made since we did not want to train several models and during our testing phase
we did not see any performance improvements using the two step classification
model.
    We created individual TF-IDF models (both for chars and words) for the
English and the Spanish dataset. The list of pronouns used in the twitter features
were also language dependent.
2.1   Data preprocessing
Both training and test data for the task was stored in an xml file. Every tweet
for every user was preprocessed by removing all markers for tabs and citation
characters (”).

2.2   Features
We calculated the feature types for the training and testing users and then
concatenated feature vectors for every feature type and user. Each of the feature
types are described in detail below.

Term frequency-inverse document frequency (words) TF-IDF is a sta-
tistical model which calculates the importance of words in a corpus and valuates
words that occurs more often in fewer documents higher. For a complete de-
scription of TF-IDF, see [11]. For this task we trained a TF-IDF model with all
our training data. Then, for each user in the training data, we concatenated all
their tweets into one string and calculated the TF-IDF values for every training
and testing user. The TF-IDF model saved n-grams from 1 to 3 and due to
time efficiency a maximum of 2000 features were used. Since it seems unlikely
that the occurrence of a term is as significant as it’s importance, sublinear term
frequency was applied, as well as smooth inverse document frequency (meaning
that every inverse document frequency is increased by 1). The TF-IDF features
were used with Python’s Scikit learn[9]. With the TF-IDF on words features,
we hope to capture that bots, females and males care to discuss different type
of topics and that bots might have a more compressed feature vector i.e. uses
several terms more often and have a decreased variety of words used compared
to humans.

Term frequency-inverse document frequency (characters) For the TF-
IDF features weighted on characters, the approach of the TF-IDF model is the
same as described above but instead of calculating the importance of words, the
model is calculating the importance of combination of characters. We included
character n-grams from 1 to 4 and we did not want to include uncommon char-
acter n-grams so we set a minimum document frequency of 20 percent with a
maximum of 2000 features. By using TF-IDF on chars as features, we mainly
hope to catch the different uses of blank space, and different symbols in con-
junction with letters and digits.

Compression features Compression features were used by compressing the
concatenated tweets of each user and do different statistical calculations of the
compressed tweets. The reason for including compressing features was based
on the assumption that bots might communicate in a more monotonous and
repetitious way compared to humans. Especially spambots are more likely to
just post the same tweet over and over again maybe not changing the content
at all. We wanted the compression features to catch that kind of behavior by
detect a difference in compression ratios between human and bot accounts.
    We used Python’s zipfile module to compress every users’ own concatenated
tweets into the three different compression methods Deflated, BZIP2, and LZMA
which are all included in the module. To obtain the compression feature vector
for a user, we concatenated the following entities giving us 19 features:

 – Original size (size of all concatenated tweets of a user before compression)
 – Compression size for every compression method
 – Mean, median, popularity standard deviation, standard deviation, max value
   and min value for the compression sizes
 – Normalized compression (each compression size divided by original size)
 – Mean, median, popularity standard deviation, standard deviation, max value
   and min value for the normalized compressions


Tweet features The tweet features consist of a variety of features connected to
the attributes of a user’s way of tweeting and the content of the tweets. We have
already done some classification regarding bot detection on tweets in [5], but in
this task we have no time stamps for the tweets or meta data of the users, and
therefore some of the features differs from our previous method. We have also
included some additional features such as part-of-speech tags and pronouns.
    Several of the attributes calculated for a user consist of vectors, and these
vectors have been represented as features by calculating statistics of the vector.
The statistics calculated for the vectors are always mean, median, popularity
standard deviation, standard deviation, maximum value and minimum value.
All tweet features are listed below:

 – Retweet ratio (number of tweets that are retweets divided by number of
   posted tweets)
 – The character length of all tweets concatenated
 – Shannon entropy[14] of all tweets concatenated
 – Number of unique words for all tweets
 – Number of tweets that have been truncated during the crawling process.
   They are always finished with a character showing three dots (...).
 – Number of different characters the tweets are started with
 – Number of unique starting character (including only letters and numbers)
 – Whether or not the user always starts the tweet with a mentioning of another
   user
 – Number of different characters the tweets are finished with
 – Number of tweets mentioning the word bot
 – Number of unique hashtags used divided by the total number of used hash-
   tags
 – Number of unique hashtags used divided by the total number of tweets
 – Number of unique users mentioned divided by the total number of mentioned
   users
 – Number of unique users mentioned divided by the total number of tweets
 – Number of unique tweets published divided by the total number of tweets
 – Number of unique 30 character beginnings of tweets
 – Number of unique 8 character beginnings of tweets
 – Number of tweets without including any hashtags, mentioning and URL:s or
   being a retweet, divided by the total number of tweets
 – Number of unique emojis used
 – Number of unique emojis used divided by the total number of emojis used
 – Number of unique characters to end tweets with
 – Number of unique URL:s in tweets
 – Number of unique URL:s in tweets divided by the total number of URL:s in
   tweets
 – Number of unique domains linked to
 – Number of unique domains linked to divided by the total number of linked
   domains
 – Statistics of number of URL:s per tweet
 – Statistics of length of tweets
 – Statistics of number of mentionings per tweet
 – Statistics of Shannon entropy per tweet
 – Statistics of number of hashtags per tweet
 – Statistics of number of words per tweet
 – Statistics of number of pronouns per tweet
 – Statistics of number of upper case letters per tweet
 – Statistics of number of lower case letters per tweet
 – Statistics of number of blank space per tweet
 – Statistics of number of digits per tweet
 – Statistics of number of row breaks per tweet
 – Statistics of number of tweets between two tweets including a hashtag
 – Statistics of number of tweets between two tweets including a URL
 – Statistics of number of tweets between two tweets including a mentioning
 – Statistics of number of tweets between two tweets including a question sign
 – Statistics of number of tweets between two tweets being retweets
 – Statistics of Levenshtein distance between every following tweets. Read more
   about the Levenshtein distance in [1]
 – Statistics of number of Part-of-speech (POS) vector where every element in
   the vector corresponds to the occurrence of a specific POS-tag. POS-tagging
   is done with the Natural language toolkit[8].

Some features’ denominators are increased by 1 to prevent division by zero. The
feature vector for the tweet features consist of 139 features.
    With the tweet features, we hope to distinguish the bots and humans from
each others in many ways. We went through the labeled data manually and could
for example see that there often were accounts which always started their tweets
with a mentioning, or always retweeted someone. In our labeled data we also saw
that women were using emojis more frequently which motivated us to implement
the features regarding emojis. With the hypothesis that a bot wants to contact
and be seen by as many users as possible (for commercial purposes for example)
the features concerning the use of mentionings and hashtags are important. If
an account is used for generating traffic to a website (which could be likely for a
spambot), the number of different URL:s posted would be reasonably small, but
should occur in several tweets. The Levenshtein feature, the statistics of number
of tweets between two tweets including hashtags, URL:s etc. and the entropy
features are all used for detecting the content is changed between tweets. It
seems more reasonably that a bot would not change the content of the tweets
as a human.

2.3    Classification algorithm
Initially we used the Random forest algorithm for classification, but we later
discovered that CatBoost gave us better performance. The CatBoost algorithm
is based on gradient boosting over decision trees. The CatBoost classification
algorithm is further described in [3].
    For the CatBoost classifier, we used our training data for training the model,
and to prevent the model from overfitting, we used the test set as validation
data. Since the evaluation metric for the PAN Bot and gender task would be
accuracy, we chose accuracy to be the metric to select the best final model after
a total of 5000 iterations.


3     Experiment and results
We parsed every tweet for every user training and test users. Our TF-IDF models
were trained with the tweets from our training users, and then calculated the
TF-IDF, compression and tweet features for all of our users. This gave us a
total of 4158 features calculated for each of the users. We let the CatBoost
model learn for 5000 iterations training on our training users and validating on
our test users. We then saved the model giving us the best accuracy for the
validation set. This was done for English and Spanish separately and resulted in
two different models. These two models were then used for the dataset provided
in the TIRA[10] environment and the results are shown in table 1.The Low
Dimensionality Statistical Embedding (LDSE) baseline described in [12] is also
included in the table.

                  Table 1. Result of bot and gender classification

                     English               Spanish
              Human/bot Female/male Human/bot Female/male Average
    Own model 0.9496    0.8273      0.9061    0.7667      0.8624
    LDSE      0.9054    0.7800      0.8372    0.6900      0.8032


    It is clear that our model performs better on the English data compared
to the data in Spanish. It might be several reasons for this. Since the TF-IDF
features are the only language dependent features, the signals of the English
language regarding gender profiling might be harder to catch in Spanish. There
might also be the case that the data set in Spanish is more complex, making
that problem harder to solve. The bot or human classification seems to be an
easier classification task for our models for both languages compared to the
gender classification. It is also clear that our model performs better on all of the
different tasks compared to the LDSE baseline.


References
 1. Black, P.E.: Dictionary of algorithms and data structures. National Institute of
    Standards and Technology Gaithersburg (2004)
 2. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F.,
    Rosso, P., Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M.,
    Zangerle, E.: Overview of PAN 2019: Author Profiling, Celebrity Profiling,
    Cross-domain Authorship Attribution and Style Change Detection. In: Crestani,
    F., Braschler, M., Savoy, J., Rauber, A., Müller, H., Losada, D., Heinatz, G.,
    Cappellato, L., Ferro, N. (eds.) Proceedings of the Tenth International
    Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019)
 3. Dorogush, A.V., Ershov, V., Gulin, A.: Catboost: gradient boosting with
    categorical features support. arXiv preprint arXiv:1810.11363 (2018)
 4. Fernquist, J., Kaati, L.: Online monitoring of large events. In: 2019 IEEE
    International Conference on Intelligence and Security Informatics (ISI). IEEE
    (2018)
 5. Fernquist, J., Kaati, L., Schroeder, R.: Political bots and the swedish general
    election. In: 2018 IEEE International Conference on Intelligence and Security
    Informatics (ISI). pp. 124–129. IEEE (2018)
 6. Gilani, Z., Farahbakhsh, R., Tyson, G., Wang, L., Crowcroft, J.: Of bots and
    humans (on twitter). In: Proceedings of the 2017 IEEE/ACM International
    Conference on Advances in Social Networks Analysis and Mining 2017. pp.
    349–354. ASONAM ’17, ACM, New York, NY, USA (2017).
    https://doi.org/10.1145/3110025.3110090,
    http://doi.acm.org/10.1145/3110025.3110090
 7. Isbister, T., Kaati, L., Cohen, K.: Gender classification with data independent
    features in multiple languages. In: 2017 European Intelligence and Security
    Informatics Conference (EISIC). pp. 54–60. IEEE (2017)
 8. Loper, E., Bird, S.: Nltk: The natural language toolkit. In: In Proceedings of the
    ACL Workshop on Effective Tools and Methodologies for Teaching Natural
    Language Processing and Computational Linguistics. Philadelphia: Association
    for Computational Linguistics (2002)
 9. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
    Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
    Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
    learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
10. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research
    Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in
    a Changing World - Lessons Learned from 20 Years of CLEF. Springer (2019)
11. Rajaraman, A., Ullman, J.D.: Mining of massive datasets. Cambridge University
    Press (2011)
12. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation
    for language variety identification. In: International Conference on Intelligent
    Text Processing and Computational Linguistics. pp. 156–169. Springer (2016)
13. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019:
    Bots and Gender Profiling. In: Cappellato, L., Ferro, N., Losada, D., Müller, H.
    (eds.) CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep
    2019)
14. Shannon, C.E.: A mathematical theory of communication. Bell system technical
    journal 27(3), 379–423 (1948)
15. Varol, O., Ferrara, E., Davis, C.A., Menczer, F., Flammini, A.: Online
    human-bot interactions: Detection, estimation, and characterization. CoRR
    abs/1703.03107 (2017), http://arxiv.org/abs/1703.03107