An Evolutionary Approach to Build User
    Representations for Profiling of Bots and Humans in
                          Twitter
                         Notebook for PAN at CLEF 2019

          Roberto López-Santillán1 , Luis Carlos González-Gurrola1 , Manuel
        Montes-y-Gómez2 , Graciela Ramírez-Alonso1 , and Olanda Prieto-Ordaz1
1
    Facultad de Ingeniería, Universidad Autónoma de Chihuahua, Circuito No. 1, Nuevo Campus
             Universitario, Apdo. postal 1552, Chihuahua, Chih., México. C.P. 31240
                 {jrlopez,lcgonzalez,galonso,oordaz}@uach.mx
    2
      Departamento de Ciencias Computacionales, Instituto Nacional de Astrofísica, Óptica y
                             Electrónica, Tonantzintla, Puebla, México
                                    mmontesg@inaoep.mx


        Abstract In this work, we describe a novel weighting scheme of terms to pro-
        duce document representations of Twitter posts for the Author Profiling task at
        PAN 2019. The purpose of this task is to predict the type and gender of the author,
        bot/human, and bot/female/male, respectively. The novelty of our approach re-
        sides in that we use an evolutionary approach to successfully combine traditional
        statistics features, e.g. tf, idf, generating, then an improved weighted scheme. Re-
        sults suggest that our proposal outperforms the baseline, which uses the mean of
        the word embeddings as the weight, for up to 6% in accuracy when predicting
        author type in the English language.


1     Introduction

Bots are autonomous programs that aim to pose as humans on online platforms. They
can be employed in a variety of applications that range from Customer Service activities
to attempting to influence mass opinion. Take for instance the 2016 U.S. presidential
election, where political organizations influenced voters through the use of these soft-
ware agents [4]. The problem of correctly identifying bots out of human beings has
been addressed following a variety of approaches that include computing some statis-
tics of its behavior summarized for example on the "friend to followers ratio", "friend
count" and "follower count" [13]. However, given its constant improvement in its prin-
ciples, more sophisticated approaches are needed with the capacity to analyze even the
discourse that they create.
    Author Profiling (AP) consist in predicting characteristics of authors based on infor-
mation that they create or share. Based on the characteristic of bots it comes naturally
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 September 2019, Lugano,
    Switzerland.
then, to approach its recognition as an AP Task. The PAN at CLEF [3], a yearly compe-
tition for AP, is proposing for 2019 to identify bots in two prediction problems: type of
writer (human or bot) and gender (bot, female or male), based on anonymized Twitter
posts [11,10].
     For this competition we addressed the problem using Word Embedding (WEs) and
Genetic Programming (GP), the latter to calculate a weighting-scheme so the former
could be aggregated to produce Document Embeddings (DEs). The DEs were later de-
ducted with the first component, obtained with Principal Component Analysis (PCA), to
boost the accuracy of prediction. Combining WEs to produce DEs has been attempted
before, using aggregate functions such as sum, mean and median [1]. Nonetheless we
propose custom aggregate formulas evolved through GP, which to the best of our knowl-
edge have been used for the AP problem only in our more comprehensive work [5], and
in the present report.
     In the present document we report the results attained by our approach in the train-
ing and competition stages at PAN 2019 [9]. Next we demonstrate our methodology,
the results and conclusions.


2     Methodology
Our proposed approach to solve the bot / human distinction, depicted in Figure 1, uses
an ensemble of novel ML and Natural Language Processing (NLP) algorithms. The
competitive results attained in other datasets by this technique, rises the question of
whether this method could deliver a practical outcome in the PAN 2019 AP task. We
have already tested a variant of this methodology in the AP datasets from PAN 2013-
2018, achieving promising results as demonstrated in the work of López-Santillán et. al
in [5]. Next we elaborate on the main steps of this approach.

2.1   Word Embeddings
The first step in our technique is to construct a vocabulary of Word Embeddings (WEs)
from the training dataset. To do so we used the Skip-gram variant of word2vec (w2v)
[6] to produce our own WEs set. We also tested several WEs algorithms such as GloVe
[8] and fastText [2], but preliminary results showed a better performance of w2v in
our task. Furthermore, we tried pre-trained WEs from large datasets like Wikipedia,
available for the English and Spanish languages. Nonetheless, WEs trained with the
PAN 2019 dataset delivered better accuracy.

2.2   Computing Statistical Features of Dataset
Simultaneously with the previous step, we calculate several statistics from the training
data. These variables represent the importance of terms within the dataset from different
perspectives. In Table 1, we present the list of features that we considered (below an
explanation of what each variable captures). At the end of this stage there will be seven
different dictionaries, all consisting of the words in the vocabulary of the dataset and
their respective values according to each formula.
Figure 1: The method is an ensemble of Word Embeddings (WEs) and Genetic Programming
(GP) to produce Document Embeddings (DEs) to identify bots from humans, and females from
males.


                   Table 1: Statistical features computed over the dataset.


       Variables (used to evolve)        Mathematical formula
                      X0                 tft,d = tf (t, d)
                                                                         N
                      X1                 idft,d = (1 + log tft,d ) · log df
                                                                             t


                      X2                 tft,len(d) = PD tflen(d(i))
                                                           (t,d)
                                                       i=1


                      X3                 wdistw0 ,wlast = wlast − w0

                      X4                 ddistd0 ,dlast = dlast − d0
                                                         P                    P (x,y)
                      X5                 iGaintf idf =       x,y P (x, y) ln P (x)P (y)

                                                                                        N
                      X6                 tf-idft,d = tf (t, d) ∗ (1 + log tft,d ) · log df
                                                                                          t

                                            F
                      X7                 W= H


 • X0 is the term frequency by user, t is a word from the vocabulary and d is a set of
   documents (tweets) by user.
 • X1 is the idf value, which represents the importance of the term across the dataset,
   N is the total number of documents (tweets) in the dataset and dft is the number of
   documents where the term appears, thus assigning words with values accordingly
   to how often appear in documents.
 • X2 is the term frequency by user, divided by the summation of the length of all
   documents (tweets) where the term appears, where D is the number of documents
   containing the term.
 • X3 computes the distance (number of words) between the 1st. and last appearance
   of a term by user, where w0 is the initial position and wlast the last position.
 • X4 is the distance (number of documents) between the first and last appearance of a
   term by user, where d0 is the initial document (tweet) and dlast is the last document
   where a term appears.
 • X5 establishes the mutual information (dependence) between x (features from a
   tf-idf sparse matrix) and y (the target), where P (x, y) is the probability of x given
   y, P (x) is the probability of the vector of features and P (y) the probability of the
   target.
 • X6 is the product of the term frequency in a document and its idf value. This means
   that a term that appears frequently in just a few documents (tweets) will have a
   greater tf-idf value than a term that appears very often in many documents.
 • X7 determines the importance of terms according to the most likely topic a doc-
   ument (tweet) belongs to. Topics are extracted from the dataset using the Non-
   Negative Matrix Factorization method, where F is a tf-idf sparse matrix of features,
   W is an array of the shape (topics, terms_in_vocabulary) which contains values
   for all terms according to the topic, and H is a non negative matrix that multiplied
   by W results in the original features (F ).

    We designed the last variable (X7) particularly for AP tasks. As already mentioned,
a more extensive work has been done in a larger number of datasets, where this feature
has shown promising results in the AP problem, as detailed in López-Santillán et. al [5].


2.3   Composing Document Embeddings: An Evolutionary Approach

Once the WEs and the statistical values for all terms in the training data vocabulary are
computed, we need a strategy to represent all the tweets from each user. The represen-
tation of larger chunks of text is a matter of current interest in the NLP community.
Our approach to compose Document Embeddings (DEs), that comprise all posts from
a user into a single vector, is an aggregation of all WEs from terms in the tweets, using
a Weighted Average Scheme (WAS) of them. To calculate the weights of terms for the
WAS, we opted for the evolutionary algorithm Genetic Programming (GP), which uses
a symbolic regression approach to evolve mathematical equations to approximate a tar-
get, by minimizing the error in each passing generation of new individuals (equations).
GP codes the math equations in form of tree graphs, where the terminal nodes are vari-
ables and constants. We implemented the GP phase using a library called GPLearn in
the Python language [12]. Table 2 shows the parameters employed by GP.
    A different equation was devised for each combination of language / target. For
instance, equation 1 is the evolved formula to aggregate WEs into DEs to predict gender
(bot / female / male) in the Spanish language.
                 Table 2: Parameters of the Genetic Programming Process.

                    Parameter           Value
                    ’Population size’   1000
                    ’Generations’       100
                    ’Function set’      ’add’, ’sub’, ’mul’, ’div’,
                                        ’sqrt’, ’log’, ’sin’, ’cos’,
                                        ’tan’, ’neg’, ’max’, ’min’, ’abs’
                    ’Metric’            mean absolute error
                    ’Tournament size’ 20
                    ’Constant range’ -10-10
                    ’Tree depths range’ 2-6
                    ’Init. method’      ramped half and half
                    ’Crossover prob.’ 0.9
                    ’Mutation prob.’    0.1


      GP W eight[wn ] =sub(sqrt(min(neg(cos(log(X2wn ))), add(log(X2wn ),             (1)
                       X4wn ))), neg(X5wn ))

    Where X2wn is the term frequency in documents containing the word; X4wn is the
term dispersion in # of documents (number of Twitter posts between the first and last
appearance of a term) and X5wn is the information gain value of such term to predict
the gender.
    Finally we produce DEs for all Twitter posts of each user, by computing a weighted-
average of the WEs of all terms in such posts, using the according formula to calculate
the term weights. Equation 2 shows how we integrate the DEs for each user in the
dataset. It is worth noting that each statistical value (as detailed in section 2.2), was
tested collectively as well as individually, yet the evolutionary approach to combine
such statistics, performed better every time.
                                     Pi
                                        n=1 (W En ∗ GP W eight[wn ])
                  DEs-GP-Fmla =           Pi                                          (2)
                                             n=1 GP W eight[wn ]

   Where W En is the Word Embedding of the n term in the tweets of each user, and
GP W eight[wn ] is the GP computed weight value of n term in the posts.


2.4    Boosting the Accuracy of DEs

Inspired by the work of of Arora et. al in [1], we used Principal Component Analysis
(PCA) to reduce the dimensionality of the DEs, but only its first component was em-
ployed, then we multiply it by its transpose and the DEs themselves. Finally we deduct
such product from the original DEs vectors. Arora et. al. proposed this step as they
found a boost up to 10% in accuracy in textual similarity tasks. Equation 3 shows this
final processing of the DEs.
                    Table 3: Distribution of users in the provided dataset.

                Language               # users (training)        # users (validation)
                English                           2880                          1240
                Spanish                           2080                           920

                             Table 4: Parameters of the SVM classifier.

                       Classifier                 Target    Parameter     Value
                       Support Vector Machine      Type     ’C’               10
                                                            ’degree’           1
                                                            ’gamma’          1.0
                                                            ’kernel’       ’rbf’
                                                            ’coef0’          0.0
                                                  Gender    ’C’                1
                                                            ’degree’           1
                                                            ’gamma’          1.0
                                                            ’kernel’       ’rbf’
                                                            ’coef0’          0.0


         DEs = DEs − (DEs ∗ f irstComp(DEs) ∗ f irstComp(DEs)T )                        (3)
    Although we did not achieve the 10% boost in accuracy attained by Arora et. al.
results in the training phase showed a small but clear increase in accuracy, probably
because the AP problem is more difficult.


2.5   Classification

The dataset provided included data in the English and Spanish languages. Also the en-
tries were divided into training and validation partitions. Table 3 details the distribution
of samples in the dataset, note that there is a 70% to 30% proportion for the training
and validation partitions respectively. Moreover each user contains 100 Twitter posts.
     We approach the classification stage using a Support Vector Machine (SVM) algo-
rithm implemented in the Scikit-learn library for Python [7]. We trained the SVM clas-
sifier using the proper partition, then we performed a Grid-search of hyper-parameters
using the validation partition. Table 4 details the best parameters found to predict each
target. The SVM classifier received the training and validation samples in the form of
the DEs previously built, as detailed in section 2.3.


3     Results

Since the dataset was already provided with training and validation partitions, we eval-
uated our method using 70% of all data to train and 30% to test. For comparison reasons
we also implemented a common baseline method, which consisted in composing DEs
using the mean of the WEs. Table 5 shows the accuracy results attained in the training
stage, by both the evolutionary weighted-scheme and the baseline (simple mean), to
predict the type of user (bot or human) and gender (bot, female and male). Moreover,
for the competition phase we trained our SVM classifier with the whole available data
(training + validation). Table 6 demonstrates the performance of our approach in the
actual competition test data.


   Table 5: Accuracy in training dataset              Table 6: Accuracy in competition-test
 predicting type (bot, human) and gender             dataset predicting type (bot, human) and
          (bot, female and male).                         gender (bot, female and male).

    Method              Type     Gender              Method              Type        Gender
    Evol.-weighted-avg.                              Evol.-weighted-avg.
    English             0.9105   0.7903              English             0.8867      0.7773
    Spanish             0.9130   0.7120              Spanish             0.8544      0.7100
    Average             0.9118   0.7511              Average             0.8706      0.7437
    Baseline-mean
    English             0.8589   0.7742
    Spanish             0.9087   0.7207
    Average             0.8838   0.7474


4     Conclusion

As can be noticed in the results section, our approach outperforms an accepted popular
baseline. Additionally a decrease in accuracy was expected for the competition stage,
as seen in Table 6. We already tested a similar version of the approach explained in this
paper on several datasets (AP task in PAN 2013-2018), and we attained competitive re-
sults, a similar performance is expected for the PAN 2019 dataset. Although we can not
guarantee the same performance or even to score in the top-quartile of the participants,
we believe our approach can be implemented as an engineering application that could
deliver good results in the real world.


5     Acknowledgments

This work was supported by CONACYT project FC-2410.


References

 1. Arora, S., Liang, Y., Ma, T.: A simple but tough-to-beat baseline for sentence embeddings
    (2017)
 2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword
    information. Transactions of the Association for Computational Linguistics 5, 135–146
    (2017)
 3. Daelemans, W., Kestemont, M., Manjavancas, E., Potthast, M., Rangel, F., Rosso, P.,
    Specht, G., Stamatatos, E., Stein, B., Tschuggnall, M., Wiegmann, M., Zangerle, E.:
    Overview of PAN 2019: Author Profiling, Celebrity Profiling, Cross-domain Authorship
    Attribution and Style Change Detection. In: Crestani, F., Braschler, M., Savoy, J., Rauber,
    A., Müller, H., Losada, D., Heinatz, G., Cappellato, L., Ferro, N. (eds.) Proceedings of the
    Tenth International Conference of the CLEF Association (CLEF 2019). Springer (Sep 2019)
 4. Howard, P.N., Woolley, S., Calo, R.: Algorithms, bots, and political communication in the
    us 2016 election: The challenge of automated political communication for election law and
    administration. Journal of Information Technology & Politics 15(2), 81–93 (2018),
    https://doi.org/10.1080/19331681.2018.1448735
 5. López-Santillán, R., Montes-Y-Gómez, M., González-Gurrola, L.C., Ramírez-Alonso, G.,
    Prieto-Ordaz, O.: A genetic programming strategy to produce document embeddings for
    author profiling tasks. Under Review (2019)
 6. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of
    words and phrases and their compositionality. In: Burges, C.J.C., Bottou, L., Welling, M.,
    Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing
    Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013),
    http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-
    their-compositionality.pdf
 7. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M.,
    Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D.,
    Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal
    of Machine Learning Research 12, 2825–2830 (2011)
 8. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation.
    In: In EMNLP (2014)
 9. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.
    In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World -
    Lessons Learned from 20 Years of CLEF. Springer (2019)
10. Rangel, F., Franco-Salvador, M., Rosso, P.: A low dimensionality representation for
    language variety identification. In: Proceedings of the 17th International Conference on
    Computational Linguistics and Intelligent Text Processing (CICLing 2016).
    Springer-Verlag (2016)
11. Rangel, F., Rosso, P.: Overview of the 7th Author Profiling Task at PAN 2019: Bots and
    Gender Profiling. In: Cappellato L., Ferro N., M.H.L.D. (ed.) CLEF 2019 Labs and
    Workshops, Notebook Papers. CEUR Workshop Proceedings. CEUR-WS.org. CEUR
    Workshop Proceedings, CLEF and CEUR-WS.org (Sep 2019)
12. Stephens, T.: gplearn documentation pp. 1–55 (Apr 2019),
    https://buildmedia.readthedocs.org/media/pdf/gplearn/stable/gplearn.pdf
13. Van Der Walt, E., Eloff, J.: Using machine learning to detect fake identities: Bots vs
    humans. IEEE Access 6, 6540–6549 (2018)