Profiling Fake News Spreaders on Twitter
                   Notebook for PAN at CLEF 2020

                           Álvaro López and Pasqual Martı́

                         Universitat Politècnica de València
                    allochi@prhlt.upv.es, pasmargi@inf.upv.es


        Abstract. The increase of fake news makes social media a dangerous
        and powerful tool due to the big impact that it can have on society. Fake
        news must be detected automatically to avoid the mass manipulation of
        people. In this paper we present our team participation at PAN 2020
        Shared Task: Profiling Fake News Spreaders on Twitter. We propose
        a deep learning model to classify authors from twitter into fake news
        spreaders or not according to their their tweets. The main problem that
        we tackle is the classification for Spanish and English authors separately,
        although we also considered bilingual models to solve the task. Our best
        model obtained an accuracy of 0.755 on Spanish and 0.68 on English.


1     Introduction
Global access to the internet has communicated people from every part of the
world and every social background. When it first appeared, the Internet was
intended to be an independent media, free from the pressure of governments and
big influential companies. Its users found in social media platforms a door to all
sorts of free information. However, because of its reachability, the Internet is also
full of false, intentionally or unintentionally misleading, information. Malicious
agents take advantage of social media to spread fake news. Fake news capitalize
on society’s polarization, appealing to the user’s feelings, which makes them be
easily shared without questioning their credibility. This makes fakes news a big
threat to our society and a powerful tool to manipulate people, ultimately having
a great impact on their lives.
    Currently, one of the most popular social media is Twitter, a platform in
which users can share tweets, short written messages that may contain pictures,
videos or links to other websites. In this work, we are presented with the task of
classifying Twitter users in two classes depending on whether they are fake news
spreaders or not. The aim is to build a system that quickly detects this type of
user before the news are spread or, at least, stops them from spreading more.
    In this paper, we present our participation at PAN 2020 Shared Task: Profil-
ing Fake News Spreaders on Twitter (4). Our approach is based on deep learning
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
classification models for every tweet language as well as a bilingual model that
classifies tweets of both languages.


2   Related work
The detection of fake news is a complex topic which can be addressed in very
different ways in order to make the most of the available information. In (6),
authors use data mining techniques to make use of all possible information before
the classification. They use not only a linguistic-based approach but also get data
from the author profile, their network of contacts in social media (to detect bots)
and other user’s opinions in their posts.
    The type of model used for the classification varies too. In (1) they used a
Naive Bayes classifier to tackle the task in a similar way that spam filters do. On
the other hand, in (2) authors used SVM classifiers with linguistic features such
as n-grams, punctuation, psycholinguistic features (aided by a lexicon), readabil-
ity and syntax. Finally, in (5) authors use a deep learning model that combined
linguistic information extracted from news using an RNN and extracted infor-
mation from user features with an FC network, building an ensemble of models
to classify.
    We have observed a diversity in approaches which might be derived from
the heterogeneity among fake news sources. The media from where the data is
extracted greatly influences the news extension and format, since the information
is presented differently in a tweet than in a web post, for instance. Besides that,
the metadata regarding the author and her/his social media network differs too.
Because of that, there is not just one predominant approach, although it must
be noted that most of them take into account linguistic features.


3   Our approach
In order to classify the authors we are given some of their tweets, all written in
either English or Spanish. We approached the problem in two different ways: on
the one hand, training a classifier that takes all tweets as a sequence and assigns
a class to the author; on the other hand, training a single tweet classifier and
then performing a vote with the classification of each tweet to decide the class
of the author.
    Between both approaches, we decided to use the single tweet classifier for two
main reasons: firstly, we prefer to train a solid single tweet classifier so that we
do not depend on having more or fewer tweets than in inference, and we will be
able to classify correctly new authors even if they have fewer or higher number of
tweets than the ones in our training set. Secondly, if we use the approach of the
sequence of tweets for each author, the number of samples to train the models is
reduced drastically since the number of single tweets is significantly higher than
the number of authors.
    In the next sections, we are going to describe the models implemented follow-
ing the chosen approach and how their training was performed. We will present
not only independent models for English and Spanish but also a bilingual model
that deals with the problem of classifying authors whose language is unknown
as well as bilingual authors.


4   Models
In this section we describe the model architectures employed and the main idea
behind them. The models used can be grouped in three classes based on the
type of encoding used for the tweets: untrained embedding, pretrained neural
net language model and pretrained multilingual encoder.
    The first type is model 1 which uses a new untrained embedding layer to
encode the tweets words. After this embedding layer it has two bidirectional
LSTM layers with 32 units and then a fully connected part with one dense layer
of 128 neurons and the last one with 1 neuron with sigmoid activation function
for classification (0 not fake, 1 fake). For the other layers we employed the ReLU
activation function and batch normalization between all the layers. To avoid
overfitting we used a dropout layer between the dense layers with a drop factor
of 0.5. With this model we tried to train our own embeddings for English and
Spanish from scratch. But as we show in the results section we couldn’t achieve
good results because we did not have enough data to train the embeddings and,
consequently, the models with this new embeddings couldn’t learn well because
of the bad encoding of the words.
    The second type of models are the ones with a pretrained embedding based on
feed-forward Neural Net Languange Model (nnlm)(8)1 , the size of the embedding
output is a vector of 128 dimensions with the full tweet encoding. This is because
this module internally combines the words embeddings into a single vetor to
encode the sentence. With this pretrained embedding we aimed to improve the
results taking advantage of the better representation of the tweets that it gives
thanks to the large amount of data2 and resources used to train it.
    We trained two models with this kind of embedding. The first one is model
3, this model is a basic fully connected net with two dense layers of 32 neurons
and the last one with 1 for making the classification with a sigmoid activation
function. Before these three layers we employed batch normalization and a drop
out with a drop factor of 0.5 to avoid overfitting and to regularize the training of
the parameters. With this regularization techniques we tried to avoid destroying
our embedding since it had been fine tuned during the training phase.
    The second model trained with the nnlm based embedding is model 4.
This one uses a slightly more complex architecture as can be seen in Figure
1. It also uses dense layers but we introduce skip connections that allow to
jump over each dense layer to give more paths to back-propagate the error.
To implement the skip connections we used concatenation layers, and to apply
1
  We get this nnlm models from tensorflow hub (https://tfhub.dev/google/
  collections/tf2-preview-nnlm/1).
2
  The datasets used to train these embeddings are the English Google News 200B and
  the Spanish Google News 50B.
regularization we used batch normalization after each dense layer and dropout
after each concatenation layer with a drop rate of 0.5.
    Finally the last type of models are the ones with a pretrained multilingual
encoder3 . This encoder is a CNN based model trained over 16 languages for tasks
such as text classification, text clustering, semantic textural similarity retrieval,
cross-lingual text retrieval, etc. All the model details can be found in (7). The
main idea of this model is to be able to encode sentences in different languages
but with the same meaning to vectors with a similar encoding. In this case the
encoder generates a 512 size vector to encode the whole input sentence. By using
this type of sentence encoder, not only we have been able to train models for
English and Spanish but also for both languages at the same time, aiming to get
similar results to the single language models.
   So based in this type of sentence encoder, not only we have been able to train
models for English and Spanish but also for both at the same time trying to get
similar results with this single model.
   We trained three models with the multilingual encoder. The first one is
model 5. This model shares exactly the same architecture than model 4 (shown
in Figure 1) but changing the nnlm encoder by the cnn multilingual encoder.
We used the same architecture to be able to compare directly the two encoders
and because the capacity of this model is big enough also for this encoding.
   The next model is model 6 which is an extension of model 5, since it has
the same layers but with double number of neurons in the fully-connected layers.
With this modification we tried to learn more patterns in the data thanks to
the bigger model and the strong regularization of the dropout layers to force
sparsity.
    The last model trained is the bilingual one, model 7. To approach this
harder problem of multilingual classification we take advantage of the multilin-
gual encoder to be able to work with nearly the same representation of the tweet
no matter the language it is written in. But as we are training for a bilingual
task, in order to force the network to learn some finer characteristics of each
language we trained a bilinear model with two parallel networks that have been
pretrained each one for a single language. For that, we used two pretrained ver-
sions of model 5, one for English and one for Spanish, and we connected the
output of the cnn encoding to both nets; the resulting output tensors of these
nets are then concatenated and passed through a simple fully-connected net with
a dense layer with 128 neurons and ReLu, and the output dense with 1 neuron
and sigmoid activation function. We also used batch normalization before the
two dense layers.


3
    We get this multilingual encoder from tensorflow hub (https://tfhub.dev/google/
    universal-sentence-encoder-multilingual-large/3).
                                                input_1: InputLayer


                                           multilingual_encoder: Lambda


                                             dense: Dense


                         batch_normalization: BatchNormalization


                                              concatenate: Concatenate


                                                 dropout: Dropout


                                            dense_1: Dense


                        batch_normalization_1: BatchNormalization


                                            concatenate_1: Concatenate


                                                dropout_1: Dropout


                                           dense_2: Dense


                        batch_normalization_2: BatchNormalization


                                            concatenate_2: Concatenate


                                                dropout_2: Dropout


                                                  dense_3: Dense


                                     batch_normalization_3: BatchNormalization


                                                  dense_4: Dense


Fig. 1: Model 4 architecture. The dense layers units are 32, 64, 128, 256 and
1 (from top to bottom). The drop rate for the dropout is 0.5. The activation
functions are ReLU except for the last dense which is sigmoid.
5     Experimental setup

5.1   Data processing and partitioning

In this section we describe how we split and processed the data in order to train
the models.
    The dataset has tweets from 600 authors written in English or Spanish. For
this task we will know in advance the language of the tweets of an author, so
we will be able to build independent models for each language. Nevertheless, we
will also confront the task with a multilingual model that works with tweets of
both languages.
    As can be seen in Table 1 we have a perfectly balanced dataset at author
level so we decided to make a random partition with 80% of the authors for
train split and the other 20% for validation. Since the models were tested on the
TIRA(3) platform, we avoided creating a test split so as to have more training
data.
    From this partitions we extracted all the tweets and their label from the
author and stored them in one file for each partition, obtaining a dataset to
train the single tweet classifiers. All these tweets have been processed to be in
one single line by removing the break line token (’\n’).
    After this we obtained the data balance shown in Tables 2 for English and 3
for Spanish.


                       Language Class 1 Class 0 Total
                         English     150     150      300
                         Spanish     150     150      300
                                     300     300      600
        Table 1: Statistics of the dataset authors by label and language.


                    Partition   Class 1     Class 0     Total
                      Train 12400 (52%) 11600 (48%) 24000
                    Validation 2600 (43%) 3400 (57%) 6000
                                 15000      15000    30000
Table 2: Statistics of the dataset tweets by label for English data. The percent-
ages shown are the percentage over the split of data (train or validation).
   We have implemented two input pipelines because not all the models previ-
ously described accept the same input data format.
   The first pipeline, used for model 1, takes the tweets strings and tokenizes
them separating the tokens by the blank space and then uses a token encoder
to get the id corresponding to each word token. We used one token encoder
with the English vocabulary and another for Spanish, both of them taking the
vocabulary set of words from the tweets in the train partition.
                    Partition   Class 1      Class 0   Total
                      Train 11800 (49%) 12200 (51%) 24000
                    Validation 3200 (53%) 2800 (47%) 6000
                                 15000      15000    30000
Table 3: Statistics of the dataset tweets by label for Spanish data. The percent-
ages shown are the percentage over the split of data (train or validation).


    For the rest of the models we used the other input pipeline that just forwards
the tweets strings to the model. This is because the pretrained encoders (nnlm
and multilingual) have their own sentence processing from the sentence string
to the feature vector.

5.2   Training
As we explained in the description of the models, we applied abundant regular-
ization with dropout and batch normalization. This is because we have observed
that without it the models quickly overfitted reaching training accuracies over
90% while having validation accuracies just over 60%. With all the regulariza-
tion we achieved a very controlled training with very low overfitting and in very
advanced epochs.
    Regarding models 1, 3 and 4, the ones with the new embedding and the nnlm
encoder, we trained them for 5000 epochs, which was possible thanks to their
training speed of two seconds per epoch during training. Once again, to avoid
overfitting and develop a more stable training we used the SGD optimizer with
a momentum of 0.9 and a low learning rate of 0.0002, which is lowered by a
learning rate scheduler to 0.0001 from epoch 1000 and to 0.00005 from epoch
3000.
    For the models with the cnn based multilingual encoder we had to adjust the
number of epochs because they required more computation than the previous
ones. In this case we trained for 1000 epochs using the Adam optimizer with a
learning rate of 0.0002.


6     Results
For getting the vote result for classifying an author we implemented two types
of voting techniques given the probability of each tweet of being fake. These two
voting methods are the averaged vote and product vote.
    The averaged vote is a more standard way of performing the voting process.
The author will be classified as a likely fake news spreader depending on if the
average of the probabilities of their tweets is higher than a given threshold.
    The product vote is slightly more sophisticated and consists of taking again
the probabilities of each tweet of being fake given by the model and we also
compute the probability of being true for each tweet (P (true) = 1 − P (f ake)).
Then we perform the product of of each group of probabilities and we get the
author label by choosing the class of the highest product result. This method
is shown in equation 1 where m is the total number of tweets for this author.
Note that in this case we don’t use any threshold to get the label, we just take
the class that maximizes the product. With this vote technique we aim to get
better results in the case of having very high probabilities because they have less
impact to reduce the product value.

                            m−1
                            Y                           m−1
                                                        Y
   author pred = argmax(           P (tweeti = true),         P (tweeti = f ake))   (1)
                             i=0                        i=0

    After making inference in the validation set with the best model of each type
we got the results of Table 4. Note that for all the results shown we obtained the
same value with both voting techniques described except in two cases (marked
with *) where we got modestly better results using the product vote. The table
also shows the false negatives (FN) and the false positives (FP) in order to make
a better analysis.


                     English        Spanish       Bilingual
           Model Accuracy FN/FP Accuracy FN/FP Accuracy FN/FP
            1      0.57    26/0   0.47    32/0
            3      0.58    8/17   0.75     9/6
            4      0.67    9/11  0.75*    10/5
            5      0.73     9/7   0.73    11/5  0.68*    18/21
            6      0.73    13/3   0.73     9/7   0.66    24/17
            7                                    0.66    27/14
Table 4: Results on the validation split for every model and language. In the
case of bilingual results the accuracy is computed over the combination of the
English and Spanish validation splits. *In this case the product voting obtains
better results.


    Analyzing model 1 results, if we look at the accuracies obtained and the
FN/FP ratio we can see that this model is just classifying always in the “true”
class because the obtained accuracy matches the number of true samples in the
validation sets for English and Spanish (see Tables 2 and 3). This is because of
the model embedding which is not pretrained and could not be trained with the
available data and resources. If the embedding is bad the results of the net will
also be bad.
    Looking at models 3 and 4 we can see how we started to get some better
results, mainly in the case of the Spanish data that seems to be an easier task
than the English one. In the case of model 3 and English data we get just one
more point in accuracy over the partition proportion but looking at the FN/FP
we see that the main source of errors are the FP so this model is not voting just
the majority class. However, a big step is achieved by model 4, which has a more
complex model architecture and gets significantly better results in the case of
English data.
    Finally looking at models 5, 6 and 7, the ones with the multilingual en-
coder, we see a huge increase in the case of English data and a smaller decrease
for the Spanish, but in overall the performance of the models 6 and 7 get a
better result for classifying both languages separately. Note that model 6 is just
a bigger version of model 5 but it doesn’t achieve any better results and, look-
ing at the bilingual classification, model 5 gets better results with the product
vote. Also note that model 7 gets the same result as model 6 despite having the
pretrained models 5 and 6 as two branches of its net.
    We also analyzed the models output probabilities to see if the predictions are
made with a good level of confidence by the models. In Figure 2 we show the
counts of probabilities outputted by model 5 making inference with the devel-
opment partition in both languages. Comparing both languages we can confirm
that for Spanish the models obtained better results because the fake samples
probabilities are more separated from the true ones in the right histogram. Also,
the range of probabilities is wider in Spanish so it has a more confident prediction
in some cases.
   In the case of having more data to make this histogram analysis we could
adjust the threshold for making the average vote behave in a way such that
the number of false negatives is minimized, lowering the threshold from 0.5
and making a trade-off between false positives and false negatives, taking into
account the shape of the histogram. We made some experiments by changing
the threshold and we got better results in some cases but since the development
partition is very small and we can not ensure that it is a representative set of
data, we decided to not implement it in the final model.


               (a) English                               (b) Spanish

Fig. 2: Probability count for the predictions made for each class and dataset
language.
7   Test set submission on TIRA

After analyzing the results we decided to select model 5 as the one to submit
on TIRA platform and perform the test inference. We did it on the early bird
submission period and we got 0.755 of accuracy on Spanish and a 0.68 on
English. If we compare it with the results from the experiments, we got more
or less an expected result having more accuracy on Spanish than English.


8   Conclusions
Being able to detect fake news spreaders automatically is a really hard task.
Although a fully automatic detection of malicious users may still be far, a robust
model with a low number of false negatives would have a big impact, providing
a first filtering for potential fake news spreaders for humans to later review more
carefully, saving their time and resources.
    We have seen that the task difficulty also depends on the language, probably
not because of the language itself but because of the different ways of making
fake news depending on the language that the target people speak. In this case
it seems that for the Spanish language it is easier to find patterns to detect fake
news.
    Despite of the results we have seen that the task is approachable. We are
convinced that with a bigger dataset and more research we could get good enough
results.
                                References


[1] Granik, M., Mesyura, V.: Fake news detection using naive bayes classifier.
    In: 2017 IEEE First Ukraine Conference on Electrical and Computer
    Engineering (UKRCON). pp. 900–903 (2017)
[2] Pérez-Rosas, V., Kleinberg, B., Lefevre, A., Mihalcea, R.: Automatic
    detection of fake news. CoRR abs/1708.07104 (2017),
    http://arxiv.org/abs/1708.07104
[3] Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated
    Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval
    Evaluation in a Changing World. Springer (Sep 2019)
[4] Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th
    Author Profiling Task at PAN 2020: Profiling Fake News Spreaders on
    Twitter. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF
    2020 Labs and Workshops, Notebook Papers. CEUR-WS.org (Sep 2020)
[5] Ruchansky, N., Seo, S., Liu, Y.: CSI: A hybrid deep model for fake news.
    CoRR abs/1703.06959 (2017), http://arxiv.org/abs/1703.06959
[6] Shu, K., Sliva, A., Wang, S., Tang, J., Liu, H.: Fake news detection on
    social media: A data mining perspective (2017)
[7] Yang, Y., Cer, D., Ahmad, A., Guo, M., Law, J., Constant, N., Ábrego,
    G.H., Yuan, S., Tar, C., Sung, Y., Strope, B., Kurzweil, R.: Multilingual
    universal sentence encoder for semantic retrieval. CoRR abs/1907.04307
    (2019), http://arxiv.org/abs/1907.04307
[8] Yoshua Bengio, Réjean Ducharme, P.V.C.J.: A neural probabilistic
    language model. Journal of Machine Learning Research 3, 1137–1155 (2003)