=Paper=
{{Paper
|id=Vol-2696/paper_193
|storemode=property
|title=Ensemble of ELECTRA for Profiling Fake News Spreaders
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_193.pdf
|volume=Vol-2696
|authors=Kaushik Amar Das,Arup Baruah,Ferdous Ahmed Barbhuiya,Kuntal Dey
|dblpUrl=https://dblp.org/rec/conf/clef/DasBBD20
}}
==Ensemble of ELECTRA for Profiling Fake News Spreaders==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_193.pdf</pdf>
<pre>
        Ensemble of ELECTRA for Profiling Fake News
                        Spreaders
                        Notebook for PAN at CLEF 2020

    Kaushik Amar Das1 , Arup Baruah1 , Ferdous Ahmed Barbhuiya1 , and Kuntal Dey2
                    1
                      Indian Institute of Information Technology, Guwahati
                    {kaushikamardas,arup.baruah}@gmail.com
                                  ferdous@iiitg.ac.in
                           2
                             Accenture Tech Labs, Bangalore, India
                              kuntal.dey@accenture.com


        Abstract This paper presents an ensemble classifier that uses ELECTRA models
        for the task of identifying possible Fake News Spreaders on Twitter in PAN at
        CLEF 2020 lab. Our ensemble is created using 15 models which have been fine-
        tuned on the task dataset. Our approach scored an accuracy of 0.70 and 0.69 on
        the English and Spanish test sets respectively.


1     Introduction

Fake news is a form of news that is circulated with the aim of deceiving users and
manipulating them into formulating specific opinions. With the growth of social media
platforms such as Facebook and Twitter, it is now easier than ever to spread fake news.
This problem is aggravated further when users knowingly or unknowingly share articles
that contain false or misleading information.
    There exist numerous sites that use expert analysis to fact check and debunk fake
articles, such as snopes.com, politifact.com etc. The problem of fake news has also been
actively tackled by the research community. To list a few, the works in [5,6] studied the
incorporation of emotional features into Long Short Term Memory (LSTM) network
for detecting fake news. The authors in [10] introduced a system called DeClarE which
combines evidence collected from the web, language style and trustworthiness of the
sources for analysing the credibility of claims in textual form. The work in [17] inves-
tigated the use of user profiles as potential features for improving fake news detection
systems.
    With an aim to further investigate this problem, PAN at CLEF’20 introduced the
task of Profiling Fake News Spreaders on Twitter [13]. The objective of this task is to
identify whether a Twitter user is a possible fake news spreader given a collection of
his tweets. This task is available in English and Spanish. We participated in this task in
both languages.
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons Li-
    cense Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 September 2020, Thessa-
    loniki, Greece.
   The rest of this paper is organised as follows. First, the dataset for the task is dis-
cussed in §2. Our approach is described in §3. The performance of our classifier is
analysed in §4 before concluding in §5.


2     Dataset

The dataset [15] for the task of Profiling Fake News Spreaders is featured in two lan-
guages: English and Spanish. The number of data samples in each of these two datasets
are the same. Each contains 300 authors out of which 150 are labelled as possible
fake news spreaders while the rest are labelled as not spreaders. A collection of 100
tweets is given for each of these authors in which we trained our classification systems.
The dataset is perfectly balanced as illustrated in Figure 1. Additionally, the dataset
is anonymized [14] to protect the tweet author’s privacy. As such, identifiers like user
handles and URLs have been replaced with‘#USER#’ and ‘#URL#’ tokens respectively.
Some examples of the data are given in Figure 2. Some other noteworthy features of the
dataset are listed below.

    – By counting the number of unique tweets within the collection of 100 tweets given
      for each author, we found that overall, only 343 authors have all unique tweets. For
      each language, about half of the authors had some duplicates.
    – In the entire dataset, the shortest tweet has 1 word while the longest tweet has 86
      words. The tweets contained an average of around 15 words.
    – While many authors did not use any emojis, 284 to be exact, the rest used emojis
      in at least one of their tweets.

    We do not know anything about the test set since our models were evaluated using
the TIRA system [11]. TIRA uses blind evaluation, a paradigm in which it runs our
models on a hidden test set without exposing any information about it to the participants.
We submitted our models adhering to the guidelines given by the organizers.


                           150                                           English
                           125     150      150           150     150    Spanish
                           100
                   count


                            75
                            50
                            25
                             0
                                 Fake News Spreader       Not Spreader
                                                  label

                                         Figure 1. Data distribution
      English: ‘Journaling Benefit: How Journaling Can Help Create Mental Calmness and
      Clarity #URL# #HASHTAG#. . . #URL#’
      Spanish: ‘Abuelito pagará 2 mil pesos por daños al vehículo que lo atropelló #URL#’

                            Figure 2. Example tweets from the dataset


3      Methodology

Our approach involves an ensemble classifier built using the ELECTRA model [3]. We
chose this model because of it’s small size, fast training speed and promising bench-
mark scores. This made it possible for us to experiment with ensembles using modest
computing resources. In this section, we briefly describe the ELECTRA model before
moving on to the details about the classifier. In the rest of this section, author and data
sample are used interchangeably, since each data sample in the dataset is an author. The
code used in this work is available in GitHub3 .


3.1     ELECTRA

At present, the current state-of-the-art in natural language processing is held by large
Transformer-based [18] models which have been trained using the BERT technique [4],
for example, RoBERTa [8], T5 [12], etc. These models are trained in an unsupervised
manner on a vast amount of text data and can be fine-tuned for other downstream tasks
such as text classification, question answering, etc. The unsupervised task often used for
training such models is the prediction of masked tokens. In this task, a small percentage,
typically 15%, of tokens in the input data is corrupted with a [ MASK ] token. The model
is trained to correctly predict these masked tokens. This task (also called as a pre-
training task) has the disadvantage of only learning from a small portion of the text
sequence which is largely computationally inefficient.
     The ELECTRA training method [3], which stands for Efficiently Learning an En-
coder that Classifies Token Replacements Accurately, aims to address this inefficiency
while retaining all the same capabilities of BERT. This is done by using a novel pre-
training task, called as replaced token detection, in which a model is trained to distin-
guish between real input tokens from synthetic but plausible replacements. The model
is required to predict over each of the input tokens whether it is the real input token or a
replaced one thereby learning from the entire sequence instead just a small percentage
of it. This results in ELECTRA performing competitively with other state-of-the-art-
models while using only about 25% of their computing requirements.


3.2     Token limit of ELECTRA

Most BERT-style transformers have a token limit of 512. Models trained using ELEC-
TRA have the same limitations. This makes it difficult to directly feed the given data
samples into our ELECTRA based classification system. It is because for each author,
 3
     https://github.com/cozek/profiling-fake-news-spreaders
i.e for each data sample, a collection of 100 tweets is given, whose tokens altogether
cross the token limit.
     An obvious method would be to truncate and reduce the number of tokens. But we
avoid doing so due to two reasons. Firstly, in the entire set of an author’s tweets, not
all of them might be fake and vice versa. Secondly, doing so will result in the loss of a
lot of information. Hence, to address these issues, in this work, we randomly sample n
tweets from an author’s set of tweets. The exact implementation details and intuition is
explained in §3.3.


3.3    Random Sampling of an Author’s Tweets

Intuition The intuition behind random sampling is that we do not know which of the
tweets from an author’s set of tweets are relevant for the classification task. So, at each
training epoch, if we randomly sample an author’s tweets to feed into the model, the
model will have the chance to look at enough of an author’s tweets to learn if the author
is a fake news spreader or not.


Implementation While constructing a batch of samples to feed into the model, from
each author, randomly n tweets from the collection of 100 are selected. These are then
concatenated with special classification tokens as given in Figure 3. This chosen collec-
tion of random tweets for each author is not fixed and is randomly chosen again at every
epoch. Therefore, tweets chosen in a previous epoch might get chosen again. We use
n = 14 so that the token limit is never exceeded even in edge cases where the tweets
might be longer.


      If T a is the set of tweets of an author a, t is a subset of randomly selected n tweets
      from T a at the ith training epoch such that n ≤ |T a |, then Cia is the concatenation of
      the tweets in t, where Cia is defined as

                           Cia = < S > t1 <\ S > t2 ... <\ S > tn−1 <\ S > tn               (1)

      Here is < S > and <\ S > are special tokens defined in ELECTRA’s vocabulary as
      CLS _ TOKEN and SEP _ TOKEN respectively. CLS _ TOKEN marks sentences for classi-
      fication. SEP _ TOKEN separates each tweet.

                                 Figure 3. Concatenation Method


3.4    Ensemble Classifier

One obvious drawback of random sampling described in §3.2 is that looking at only a
small random portion of an author’s tweets may not enough to make a correct decision.
To mitigate this problem, we use an ensemble. Our proposed ensemble is built using 15
fine-tuned models each of which is built on top of a pre-trained ELECTRA model.
    Each model of the ensemble looks at a different random sample of an author’s tweets
and makes a prediction. The final prediction is determined by majority voting where the
label with the highest frequency is chosen as the final label for the task. This ensures that
a wide range of an author’s tweets is looked at before coming to a decision. Ensembling
also has the effect of lowering the variance of the model [16]. The architecture of the
models in the ensemble and the training routine is described below.


                                               Pre-trained   Dense      Softmax
                     Random Sampled and        ELECTRA       Layer       Layer
                   Concateneted Tweets of an
                           Author


Figure 4. A single classification model of the ensemble. The dense layer is tanh activated and
has a dropout of 0.1. The sof tmax layer makes the prediction.


Model Architecture The ensemble is made of 15 fine-tuned models each of which is of
the architecture given in Figure 4. We use only 15 models because adding more does not
improve the ensemble [16]. In each model, 256-dimensional embeddings produced by
pre-trained ELECTRA are fed into a tanh-activated dense layer having 256 in-features
and 256 out-features. After applying a dropout of 0.1 on the output of the dense layer,
the output representation is fed into a sof tmax layer which makes the prediction. The
weights of the dense layer and sof tmax layer are randomly initialized in each of the
models of the ensemble.
    Separate ensembles were built for the two languages in the dataset, one ensem-
ble for English and one ensemble for Spanish. For English, we used the pre-trained
model called google/electra-small-discriminator from the HuggingFace Transformers
Library 4 [19]. Since no official pre-trained model was available for Spanish, we used
a pre-trained model called skimai/electra-small-spanish from HuggingFace community
models hub.

Model Training and Inference The same training routine is applied to each of the
models in the ensemble. Each model is fine-tuned with a small learning rate of ≈ 1e − 3
using a cross-entropy loss function for 20 epochs. 90% of the data is used as the train
set and the remaining 10% is used as the validation set. The percentage of each class is
preserved in both of these. Early stopping was used to stop training if validation accu-
racy did not improve for 4 consecutive epochs. The model is optimized using Ranger
optimizer, which is a combination of LookAhead [20] and RAdam [7]. The (α, k) pa-
rameters of the optimizer are set to (0.5, 5). During both training and inference, the
random sampling approach described in §3.3 is used to feed data into the model. Data
 4
     https://huggingface.co
is fed into the models in batches of 50. As mentioned in §3.4, the final label is deter-
mined by majority voting.

4   Results


                                                                              Spanish
                        0.850                                                 English
                        0.825
             accuracy


                        0.800
                        0.775
                        0.750

                                1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
                                                model

                          Figure 5. Validation accuracy of each model in ensemble


    The accuracy on the validation set of each model in the ensemble is plotted in Fig-
ure 5. As apparent from the plot, the accuracy of each model in the ensemble varies
widely. This is expected since each model started off with different randomly initial-
ized weights and trained on randomly sampled data. The results obtained by running
the ELECTRA ensembles on the validation set and test set is given in Table 4. The
ensembles scored an accuracy of 0.70 and 0.69 on the English and Spanish test sets
respectively as given in Table 4. Both of the ensembles had almost the same accuracy
on the test set. The ensemble for English had a validation set accuracy of 0.87 while
the ensemble for Spanish had a validation set accuracy of 0.77. This suggests that the
ensembles suffered from over-fitting since there is a notable difference between the val-
idation and test set accuracy. Another explanation could be that random sampling did
not sample the relevant tweets for the classifier to be able to differentiate correctly.


                                                      Accuracy
                                    Language Validation Set Test Set
                                     English       0.87        0.70
                                     Spanish       0.77        0.69
                                             Table 1. Results


5   Conclusion
This paper explored the application of ensembled ELECTRA models for the task of
Profiling Fake News Spreaders in Twitter. Random sampling was used in an attempt to
overcome the limitation of the max number of tokens supported by transformer models.
In future work, it would be interesting to explore transformer models that do no have
such limitations, for example the Longformer [2]. Also, the study did not make use of
the many features of the data. We found that the data had duplicates (see §2) that could
have been removed during preprocessing. This perhaps might have improved the clas-
sifier’s performance by preventing duplicates from being sampled. Another promising
avenue for future work would to enhance the proposed classifier with emotional signals
from the text using lexicons such as EmoLex [9] and SentiSense [1].


References

 1. de Albornoz, J.C., Plaza, L., Gervás, P.: Sentisense: An easily scalable concept-based
    affective lexicon for sentiment analysis. In: LREC (2012)
 2. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: The long-document transformer.
    arXiv:2004.05150 (2020)
 3. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: Pre-training text encoders as
    discriminators rather than generators (2020)
 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
    transformers for language understanding. In: NAACL-HLT (2019)
 5. Ghanem, B., Rosso, P., Rangel, F.: An emotional analysis of false information in social
    media and news articles. ACM Trans. Internet Technol. 20(2) (Apr 2020),
    https://doi.org/10.1145/3381750
 6. Giachanou, A., Rosso, P., Crestani, F.: Leveraging emotional signals for credibility
    detection. In: Proceedings of the 42nd International ACM SIGIR Conference on Research
    and Development in Information Retrieval. p. 877–880. SIGIR’19, Association for
    Computing Machinery, New York, NY, USA (2019),
    https://doi.org/10.1145/3331184.3331285
 7. Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J.: On the variance of the adaptive
    learning rate and beyond. arXiv preprint arXiv:1908.03265 (2019)
 8. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer,
    L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach (2019)
 9. Mohammad, S.M., Turney, P.D.: Crowdsourcing a word-emotion association lexicon.
    ArXiv abs/1308.6297 (2013)
10. Popat, K., Mukherjee, S., Yates, A., Weikum, G.: DeClarE: Debunking fake news and false
    claims using evidence-aware deep learning. In: Proceedings of the 2018 Conference on
    Empirical Methods in Natural Language Processing. pp. 22–32. Association for
    Computational Linguistics, Brussels, Belgium (Oct-Nov 2018),
    https://www.aclweb.org/anthology/D18-1003
11. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture.
    In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World.
    Springer (Sep 2019)
12. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu,
    P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer (2019)
13. Rangel, F., Giachanou, A., Ghanem, B., Rosso, P.: Overview of the 8th Author Profiling
    Task at PAN 2020: Profiling Fake News Spreaders on Twitter. In: Cappellato, L., Eickhoff,
    C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Labs and Workshops, Notebook Papers.
    CEUR-WS.org (Sep 2020)
14. Rangel, F., Rosso, P.: On the implications of the general data protection regulation on the
    organisation of evaluation tasks. Language and Law= Linguagem e Direito 5(2), 95–117
    (2019)
15. Rangel, F., Rosso, P., Ghanem, B., Giachanou, A.: Profiling fake news spreaders on twitter
    (Feb 2020), https://doi.org/10.5281/zenodo.3692319
16. Risch, J., Krestel, R.: Bagging BERT models for robust aggression identification. In:
    Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying. pp.
    55–61. European Language Resources Association (ELRA), Marseille, France (May 2020),
    https://www.aclweb.org/anthology/2020.trac-1.9
17. Shu, K., Wang, S., Liu, H.: Understanding user profiles on social media for fake news
    detection. 2018 IEEE Conference on Multimedia Information Processing and Retrieval
    (MIPR) pp. 430–435 (2018)
18. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.,
    Polosukhin, I.: Attention is all you need (2017)
19. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T.,
    Louf, R., Funtowicz, M., Brew, J.: Huggingface’s transformers: State-of-the-art natural
    language processing. ArXiv abs/1910.03771 (2019)
20. Zhang, M.R., Lucas, J., Hinton, G., Ba, J.: Lookahead optimizer: k steps forward, 1 step
    back (2019)

</pre>