=Paper=
{{Paper
|id=Vol-2936/paper-52
|storemode=property
|title=GPLSI team at CheckThat! 2021: Fine-tuning BETO and RoBERTa
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-52.pdf
|volume=Vol-2936
|authors=Robiert Sepúlveda-Torres,Estela Saquete
|dblpUrl=https://dblp.org/rec/conf/clef/Sepulveda-Torres21
}}
==GPLSI team at CheckThat! 2021: Fine-tuning BETO and RoBERTa==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-52.pdf</pdf>
<pre>
GPLSI team at CheckThat! 2021: Fine-tuning BETO
and RoBERTa
Robiert Sepúlveda-Torres, Estela Saquete
Department of Software and Computing Systems, University of Alicante, Apdo. de Correos 99, E-03080 Alicante, Spain


                                      Abstract
                                      CheckThat! Lab is a challenging lab aimed at tackling the disinformation problem. The GPLSI team
                                      from the University of Alicante (Spain) has participated in two tasks of the CheckThat! Lab namely,
                                      Task 1 (Check-Worthiness Estimation) and Task 3 (Fake News Detection). We attained second and
                                      fifth place in the Spanish-version and English-version of Subtask 1A. Our systems use models based
                                      on transfer learning such as RoBERTa and BETO. The best results were achieved by fine-tuning these
                                      models. However, our results for Subtask 3A are low compared to the team that achieved the best result.
                                      We included some external features in the models for Subtask 1A and 3A, but we could not improve the
                                      results. In future work, we will experiment by incorporating other external features into the models
                                      with the aim of improving the results of the tasks.

                                      Keywords
                                      Check-worthiness, Transfer learning models, Fake news detection


1. Introduction
Fake news has existed for a long time, but with the exponential rise in the consumption of news
through digital media, disinformation has become one of the main problems in modern society
[1]. More recently, the huge volume of news in digital media makes it impossible to evaluate its
veracity manually in a reasonable time frame [2]. The scientific community is currently using
artificial intelligence to address the problem by, for example, developing large-scale datasets
with the aim of creating automated fact-checking systems [3].
   In this context, CheckThat! Lab emerges as part of the Cross-Language Evaluation Forum
(CLEF). CheckThat! Lab’s goal is to foster the development of technologies that allow the
automatic verification of claims [4]. This article provides a comprehensive report on the
participation of the GPLSI team in Subtask 1A (Check-worthiness of tweets) and 3A (Multi-class
fake news detection of news articles) of CheckThat! Lab. [4] provides a detailed description of
CheckThat! Lab. The subtasks are summarized below:

          1. Subtask 1A consists of predicting whether a given tweet is worth fact-checking. Subtask
             1A is offered in Arabic, Bulgarian, English, Turkish, and Spanish. The GPLSI team
             participates in the English and Spanish version of the subtask. Subtask 1A uses Mean


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" rsepulveda@dlsi.ua.es (R. Sepúlveda-Torres); stela@dlsi.ua.es (E. Saquete)
 0000-0002-2784-2748 (R. Sepúlveda-Torres); 0000-0002-6001-5461 (E. Saquete)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
      Average Precision (MAP) as the official evaluation measure and we will report reciprocal
      rank, and P@k for k ∈ {1, 3, 5, 10, 20, 30} as well [5].
   2. Subtask 3A consists of detecting fake news as a four-class classification problem. Given
      the title and body text of a news article, it determines whether the main claim made in
      the article is true, partially true, false, or other [6]. Subtask 3A is offered only in English
      and uses the macro F1 measure. The categories are as follows:
             • False - The main claim made in an article is untrue.
             • Partially False - The main claim of an article is a mixture of true and false information.
               The article contains partially true and partially false information, but cannot be
               considered 100% true.
             • True - This rating indicates that the primary elements of the main claim are demon-
               strably true.
             • Other- An article that cannot be categorised as true, false, or partially false due to
               lack of evidence about its claims.


2. Related Work
Automatic detection of misinformation and fake news is a complex task in which Artificial
Intelligence (AI) and Natural Language Processing (NLP) play a key role. Due to the complexity
of the task, different subtasks are being dealt with [7], both in traditional media [8] or social
media [9]. Within these subtasks, automatic fact checking is one of the most challenging
problems, and related competitions are very important as new datasets are designed to create
models. In 2018, the two main competitions were: i) CheckThat! Lab, the CLEF-2018 Fact
checking Lab1 for the automatic identification and verification of claims in political debates[10].
The dataset delivered by this competition was obtained from FactCheck.org and it annotates
statements with true/half-true/false values [11]; and, ii) the Fact Extraction and VERification
(FEVER) 2 , which is a workshop on fact extraction and verification providing a dataset of 220K
claims verified against Wikipedia [12] [13]. The statements in the corpus were annotated with
supports/ refutes/ notEnoughInfo.
   CheckThat! Lab 2021 is the fourth edition of the lab and Task 3 on Fake News Detection is
new for this edition. Task 1 on Check-Worthiness has been present in previous editions, but it
is new in the Spanish, Turkish, and Bulgarian versions. The previous year (2020) the winning
team in the English-version of this task was the Accenture team [14]. This team fine-tuned the
RoBERTa model and reached a MAP score of 0.8064. The second team ranked was Alex, and
this team concatenated the RoBERTa model with tweet metadata [15], obtaining a MAP score
of 0.8034.
   There have been efforts similar to CheckThat! Lab’s approach to tackling the problem of
disinformation, such as [16, 17], that have developed approaches for automated fact-checking.
   In recent years, the use of Transfer Learning models has become popular to tackle the main
tasks within Natural Language Processing (NLP). Some of the most successful models in this

   1
       http://alt.qcri.org/clef2018-factcheck/
   2
       http://fever.ai/
context are BERT and RoBERTa for the English language and BETO for the Spanish language
[18, 19, 20]. For example, RoBERTa has been used to predict the stance relationship between
the headline and body text of an article [21].
  Considering the literature, our participation in the CheckThat! Lab makes use of these
Learning Transfer models to address subtasks 1A and 3A.


3. Neural Models
The models used in this research are based on the BERT model. BERT is a multi-layer bidi-
rectional Transformer encoder that is designed to pre-train from text without labels. This
pre-training model has the advantage of fine-tuning capability via a single additional layer of
output, a feature that facilitates the creation of state-of-the-art models in various NLP tasks [18].
For Subtask 1A and Subtask 3A in English, the RoBERTa model will be used and for Subtask 1A
in Spanish, the BETO model will be used.

3.1. English-version Subtask: RoBERTa model
RoBERTa (Robustly optimized BERT approach) is a pre-training model based on BERT [19].
RoBERTa includes the following modifications: eliminating the prediction of the next sentence;
performing the training on a greater volume of data; enlarging the batch size; and, lengthening
the input sequence. This implementation attains state-of-the-art results in General Language
Understanding Evaluation (GLUE) and Reading Comprehension Dataset From Examinations
(RACE). In this research, we use RoBERTa large model architecture with 24 self-attention layers,
a hidden size of 1024 and 355M parameters [19].

3.2. Spanish-version Subtask: BETO model
BETO also uses BERT’s architecture but includes a series of optimizations similar to those
performed in the RoBERTa model. The BETO model was pre-trained with Wikipedia texts and
all OPUS Project sources [22] in the Spanish language. This model achieved better results in
most of the tasks present in the GLUE benchmark than the multilingual models based on BERT.
This model has 12 self-attention layers with a hidden size of 768 and a total of 110M parameters
[20].

3.3. Architecture modifications
Some researchers include additional features to blend them with the output of the last layer of
the transfer learning models [23, 24]. This strategy could improve the prediction of models based
on transfer learning. In the tasks where the GPLSI team participates, we have experimented
by varying our model so as to bring it closer to the domain of each task. Figure 1 shows the
internal architecture of our classifier when we included external features.
   In Subtask 1A, for both English and Spanish-language versions, we extract features related to
the presence and quantity of numbers and dates in the tweets. The Stanza Python library is
Figure 1: Architecture modifications with external features


used to extract these features. Stanza library analyzes part-of-speech and morphological feature
tagging, dependency parsing, and named entity recognition parser [25].
   Another change that has been tested for both Subtask 1A is the inclusion of features from
Linguistic Inquiry and Word Count (LIWC). LIWC is a resource for detecting meaning in a wide
variety of experimental settings, including showing the focus of attention, emotionality, social
relationships, thinking styles, and individual differences [26, 27]. The LIWC dictionaries have
been translated into several languages, including Spanish, German, Italian and Portuguese. We
used Spanish translated LIWC for the Spanish-version Subtask and the LIWC original version
for the English-version Substask.


4. Experiments
The objective of the experiments is to find the model and the hyperparameters that are the best
possible fit for the tasks outlined.This section presents the experiments carried out based on the
models described previously.

4.1. Subtask 1A experiments
This subtask aims to determine whether a tweet should be checked or not; hence the input
to each model is a tweet and, in the case of some experiments, the input also adds external
features. The input tweet to the models is pre-processed and Emoji are extracted to decrease
the out-of-vocabulary words in the models being used. The proposed six experiments for
this subtask use techniques that are well recognized by the scientific community, and will be
described next:

   1. Our baseline system with RoBERTa or BETO model: This experiment makes a fine-tuning
      of the corresponding model over the corpus of the task in evaluation. In line with [1] the
      BETO baseline system and RoBERTa baseline system have the same hyperparameters, as
      follows: maximum sequence length of 125; batch size of 4; training rate of 1.5e-5; and
      training epochs of 3.
   2. This experiment performs a Bayes search to optimize the hyperparameters. We used the
      Weights & Biases library to automate hyperparameter tuning and explore the space of
      possible models. This library enables the visualization and comparison of the results of
      each model [28]. The search configuration is shown in table 1.

   Table 1
   Hyperparameter tuning
                           Parameters                Values
                           Number train epochs       2, 3, or 4
                           Dropout                   0.1, 0.2, 0.3, 0.4, or 0.5
                           Batch size                2, 4, or 8
                           Learning rate             1e-5, 1.5e-5, 2e-5, 2.5e-5, or 3e-5


   3. RoBERTa or BETO best model with number and date indicators: Concatenated the output
      of the last layer of RoBERTa or BETO with the number and date indicators.
   4. RoBERTa or BETO best model with LIWC features: Concatenated the output of the last
      layer of RoBERTa or BETO with the LIWC features.
   5. RoBERTa or BETO best model with oversampling: The training set is extended with
      examples of the least representative classes to balance the dataset.
   6. RoBERTa or BETO best model with undersampling: Examples of the most representative
      classes are eliminated to balance the dataset.

4.2. Subtask 3A experiments
The first two experiments of the previous subtask are used in the same way for this subtask.
Subtask 3A is conducted in English and therefore only the RoBERTa model can be used. In
addition, as this task classifies news according to its veracity, the RoBERTa model processes both
the title and the body text of the article, representing them as a sequence of concatenated words.
A third experiment is added that tries to improve the results of the two previous experiments.
This experiment splits the four-class classification into two binary classifications and one
three-class classification.


5. Result and Discussion
The experiments were implemented using the Simple Transformer 3 and PyTorch 4 libraries.
The experiments are trained with the training set and their performance is evaluated with the
development set provided by the organizers of CheckThat! Lab. In our experiments, the output
of BETO and RoBERTa models passes through a series of layers to finish classifying the tweet or
the news. The layers used, in order, are two Dropouts, and Linear. After the first Dropout layer,
   3
       https://simpletransformers.ai/ (accessed on 20 May 2021)
   4
       https://pytorch.org/ (accessed on 20 May 2021)
we used the Tanh activation function. The tweet classification is a binary classification so the
output layer has a single neuron and binary cross-entropy is used as the loss function. However,
the news is classified into four classes, so the output layer has 4 neurons and cross-entropy is
used as the loss function.

5.1. Subtask 1A
The Spanish-version of Subtask 1A has a dataset of 2,495 tweets for training, 1,247 for develop-
ment, and 1248 for testing. The English-version of Subtask 1A has a dataset of 822 tweets for
training, 140 for development and 350 for testing.
   The BETO baseline system obtains good results in the macro F1 metric; however, the evalua-
tion metric of this subtask is MAP and in this case, the results are quite close to the baseline
provided by the organizers. Similarly to the behavior of the BETO baseline system, the RoBERTa
baseline system achieves good results on the macro F1 metric. However, the MAP metric is 4
points lower than 0.8064, which was reached by the best competitor of CheckThat! Lab 2020.
   Experiment 2 is the experiment with the best results in both languages. With the help of the
Weights & Biases library, a configuration has been found, although it cannot be guaranteed
that it is the best one given that the search was not exhaustive. In the Spanish-version, the
hyperparameter configurations are a maximum sequence length of 125, batch size of 8, training
rate of 1e-5, dropout rate of 0.2, and, training performed for 2 epochs. For the English-version,
the hyperparameter configurations are as follows: maximum sequence length of 125; batch size
of 4; training rate of 1.5e-5; dropout rate of 0.2; training performed for 3 epochs. Table 2 shows
the results.

Table 2
CheckThat! Spanish-version and English-version of Subtask 1A experiments with BETO and RoBERTa
models in the development dataset
                                                            Spanish             English
    No    Experiment                                   MAP Macro F1        MAP Macro F1
     1    BETO and RoBERTa baseline system             0.485     0.709     0.762     0.702
     2    Hyperparameter tuning                        0.549     0.712     0.825     0.750
     3    Best model with number and date indicators   0.500     0.633     0.652     0.706
     4    Best model with LIWC features                0.497     0.685     0.624     0.624
     5    Best model with oversampling                 0.387     0.693     0.772     0.735
     6    Best model with undersampling                0.455     0.564     0.795     0.709


   Experiment 3 indicates that the features obtained do not help to identify whether a tweet
should be checked or not. Experiment 4 worsened the results for both languages. Experiments
5 and 6 also did not improve the results achieved by simply fine-tuning the basic model. In both
cases, the two metrics descended in relation to experiment 2.
   The GPLSI team reached second place in the Spanish-version of Subtask 1A to predict the
test set. The difference in the MAP metric is less than one point with respect to the team that
came first in this subtask. Table 3 shows the results.
   In the English-version of Subtask 1A, the GPLSI team reached fifth place. In this case,
Table 3
CheckThat! Spanish-version of Subtask 1A results
          Team      MAP      MRR      RP     P@1     P@3     P@5     P@10       P@20    P@30
          GPLSI     0.529    0.500   0.533   0.000   0.667   0.600   0.800      0.750   0.620


the difference between the first-place team and GPLSI was greater than that observed in the
Spanish-version of the same subtask. Table 4 shows the results.

Table 4
CheckThat! English-version Subtask 1A result
          Team      MAP      MRR      RP     P@1     P@3     P@5     P@10       P@20    P@30
          GPLSI     0.132    0.167   0.158   0.000   0.000   0.000   0.200      0.150   0.140

   To sum up, the numerous experiments that were conducted failed to improve the results
achieved by experiment 2, indicating the power of models based on transfer learning for this
task. Evidently, we have not been able to find appropriate external features to tackle this subtask.
The systems developed in experiment 2 are used to predict the test set for both languages.

5.2. Subtask 3A
The training dataset available for this subtask contains 900 news items [29]. In order to evaluate
the models that were being fine-tuned, the training dataset was divided into a training set and a
development set. The split proportion is of 0.7 and 0.3 for a new training set and a development
set. The training and development sets maintain similar percentages of examples from each
class. The test set has 365 news stories.
   Our baseline system uses the hyperparameters described in experiment 1 of the previous
section by only changing the maximun sequence length to 512. Task 3A is considered more
complicated than 1A because it is necessary to find patterns that classify the news into 4 classes.
The macro F1 result of the baselines fine-tuning RoBERTa is quite low which corroborates the
complexity of this subtask.
   Experiment 2 carried out a deep hyperparameter tuning and, as the results show, the improve-
ment is negligible. In another task, a hyperparameter tuning like this should have improved the
baseline significantly.

Table 5
CheckThat! Subtask 3 English experiments with RoBERTa model in development dataset
                            No   Experiment                          Macro F1
                             1   Our baseline system                  0.516
                             2   Hyperparameter tuning                0.520
                             3   Best model with three classifiers    0.548

  The last experiment is depicted in figure 2. The strategy involved placing the majority class in
the first classifier and the minority classes in the subsequent classifiers. In the first classifier, we
predict False and Remaining classes (Partially False, True, and Other). In the second classifier,
we predict False, Partially False and Remaining classes (True and Other). Finally, the third
classifier is True and Other. Each classifier specializes in predicting a group of classes and the
remaining classes are passed to subsequent classifiers.


Figure 2: Explanation of the classification pipeline


    • The first classifier achieves 0.802 as macro F1 with hyperparameter configurations are as
      follows: maximum sequence length of 512; batch size of 2; training rate of 2e-5; dropout
      rate of 0.2; training performed for 3 epochs.
    • The second classifier achieves 0.666 as macro F1 with hyperparameter configurations
      are as follows: maximum sequence length of 512; batch size of 2; training rate of 2e-5;
      dropout rate of 0.2; training performed for 4 epochs.
    • The third classifier achieves 0.727 as macro F1 with hyperparameter configurations are as
      follows: maximum sequence length of 512; batch size of 4; training rate of 1e-5; dropout
      rate of 0.2 training performed for 3 epochs.

 The GPLSI team ranked 16th in this subtask. In this subtask, we failed to find a competitive
model that could obtain state-of-the-art results.


6. Conclusions
We participated in two tasks of the three CheckThat! tasks. The results achieved in the Spanish-
version of subtask 1A are considered good. They confirm that, depending on the task, fine-tuning
pre-trained models may be a good option. In the Spanish-version, we ranked second with a
MAP score of 0.529 and in the English-version we ranked fifth with a MAP score of 0.132.
   On the other hand, the results obtained in Subtask 3A leave considerable room for improve-
ment and to date, no enhancement was found in the models used. However, the classifier
cascade technique improved the classification. In this research, we classify the most majority
classes in the first classifiers. We include some external features to the models used but the
results obtained do not improve the fine-tuning experiment of each model.
   In future work, we will experiment with other reference neural models and look for specific
features that can improve the results for the most complicated tasks.


Acknowledgments
This research work has been partially funded by Generalitat Valenciana through project “SIIA:
Tecnologias del lenguaje humano para una sociedad inclusiva, igualitaria, y accesible” (PROME-
TEU/2018/089), by the Spanish Government through project “Modelang: Modeling the behavior
of digital entities by Human Language Technologies” (RTI2018-094653-B-C22), and project “IN-
TEGER - Intelligent Text Generation” (RTI2018-094649-B-I00). Also, this paper is also based upon
work from COST Action CA18231 “Multi3Generation: Multi-task, Multilingual, Multi-modal
Language Generation”.


References
 [1] R. Sepúlveda-Torres, A. Bonet-Jover, E. Saquete, “Here Are the Rules: Ignore All Rules”:
     Automatic Contradiction Detection in Spanish, Applied Sciences 11 (2021) 3060.
 [2] G. Tsipursky, F. Votta, K. M. Roose, Fighting Fake News and Post-Truth Politics with
     Behavioral Science: The Pro-Truth Pledge, Behavior and Social Issues 27 (2018) 47–70.
 [3] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, FEVER: a large-scale dataset for
     fact extraction and verification, arXiv preprint arXiv:1803.05355 (2018) 809–819.
 [4] P. Nakov, G. D. S. Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, The
     CLEF-2021 checkthat! lab on detecting check-worthy claims, previously fact-checked
     claims, and fake news, in: D. Hiemstra, M. Moens, J. Mothe, R. Perego, M. Potthast,
     F. Sebastiani (Eds.), Advances in Information Retrieval - 43rd European Conference on IR
     Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings, Part II, volume
     12657 of Lecture Notes in Computer Science, Springer, 2021, pp. 639–649.
 [5] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, M. K. Alex Nikolov, F. A. Yavuz
     Selim Kartal, G. Da San Martino, A. Barrón-Cedeño, R. Míguez, T. Elsayed, P. Nakov,
     Overview of the CLEF-2021 CheckThat! lab task 1 on check-worthiness estimation in
     tweets and political debates, in: Working Notes of CLEF 2021—Conference and Labs of
     the Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021.
 [6] G. K. Shahi, J. M. Struß, T. Mandl, "Overview of the CLEF-2021 CheckThat! Lab Task 3
     on Fake News Detection", in: "Working Notes of CLEF 2021—Conference and Labs of the
     Evaluation Forum", CLEF ’2021, Bucharest, Romania (online), 2021.
 [7] E. Saquete, D. Tomás, P. Moreda, P. Martínez-Barco, M. Palomar, Fighting post-truth
     using natural language processing: A review and open challenges, Expert Systems with
     Applications 141 (2020) 112943.
 [8] A. Bonet-Jover, A. Piad-Morffis, E. Saquete, P. Martínez-Barco, M. Ángel García-Cumbreras,
     Exploiting discourse structure of traditional digital media to enhance automatic fake news
     detection, Expert Systems with Applications (2020) 114340.
 [9] K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake News Detection on Social Media, ACM
     SIGKDD Explorations Newsletter 19 (2017) 22–36.
[10] P. Nakov, L. Màrquez, A. Barrón-Cedeño, W. Zaghouani, T. Elsayed, R. Suwaileh,
     P. Gencheva, CLEF-2018 lab on automatic identification and verification of claims in
     political debates, in: Proceedings of the CLEF-2018, 2018.
[11] A. Barrón-Cedeño, T. Elsayed, R. Suwaileh, L. Màrquez, P. Atanasova, W. Zaghouani,
     S. Kyuchukov, G. Da San Martino, P. Nakov, Overview of the clef-2018 checkthat! lab
     on automatic identification and verification of political claims, task 2: Factuality, in:
     L. Cappellato, N. Ferro, J.-Y. Nie, L. Soulier (Eds.), CLEF 2018 Working Notes. Working
     Notes of CLEF 2018 - Conference and Labs of the Evaluation Forum, CEUR Workshop
     Proceedings, CEUR-WS.org, Avignon, France, 2018.
[12] J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, A. Mittal, The fact extraction
     and verification (fever) shared task, in: Proceedings of the First Workshop on Fact Extrac-
     tion and VERification (FEVER), Association for Computational Linguistics, 2018, pp. 1–9.
     URL: http://aclweb.org/anthology/W18-5501.
[13] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for fact
     extraction and verification, in: Proceedings of the 2018 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies,
     Volume 1 (Long Papers), Association for Computational Linguistics, 2018, pp. 809–819.
     URL: http://aclweb.org/anthology/N18-1074. doi:10.18653/v1/N18-1074.
[14] E. Williams, Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of claims
     using transformer-based models, Technical Report, 2020.
[15] A. Nikolov, G. Da, S. Martino, I. Koychev, P. Nakov, Team Alex at CLEF CheckThat! 2020:
     Identifying Check-Worthy Tweets With Transformer Models, Technical Report, 2020.
[16] B. S. Andreas Hanselowski, Avinesh PVS, F. Caspelherr, Description of the system devel-
     oped by team athene in the FNC-1, 2017.
[17] A. Alonso-Reina, R. Sepúlveda-Torres, E. Saquete, M. Palomar, Team gplsi. approach for
     automated fact checking, in: Proceedings of the Second Workshop on Fact Extraction and
     VERification (FEVER), 2019, pp. 110–114.
[18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding (2018). arXiv:1810.04805.
[19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach (2019).
     arXiv:1907.11692.
[20] J. Canete, G. Chaperon, R. Fuentes, J. Pérez, Spanish pre-trained bert model and evaluation
     data, PML4DC at ICLR 2020 (2020).
[21] R. Sepúlveda-Torres, M. Vicente, E. Saquete, E. Lloret, M. Palomar, Exploring summarization
     to enhance headline stance detection, in: International Conference on Applications of
     Natural Language to Information Systems, Springer, 2021, pp. 243–254.
[22] J. Tiedemann, Parallel data, tools and interfaces in OPUS, in: Proceedings of the Eighth
     International Conference on Language Resources and Evaluation (LREC’12), European
     Language Resources Association (ELRA), Istanbul, Turkey, 2012, pp. 2214–2218.
[23] Z. Zhang, Y. Wu, H. Zhao, Z. Li, S. Zhang, X. Zhou, X. Zhou, Semantics-aware BERT for
     Language Understanding, Technical Report, 2020.
[24] W. M. Lim, H. T. Madabushi, UoB at SemEval-2020 Task 12: Boosting BERT with Corpus
     Level Information (2020).
[25] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manning, Stanza: A Python natural language
     processing toolkit for many human languages, in: Proceedings of the 58th Annual Meeting
     of the Association for Computational Linguistics: System Demonstrations, 2020.
[26] J. W. Pennebaker, R. L. Boyd, K. Jordan, K. Blackburn, The development and psychometric
     properties of LIWC2015, Technical Report, 2015.
[27] Y. R. Tausczik, J. W. Pennebaker, The Psychological Meaning of Words: LIWC and
     Computerized Text Analysis Methods, Journal of Language and Social Psychology 29
     (2010) 24–54.
[28] L. Biewald, Experiment tracking with weights and biases, 2020. URL: https://www.wandb.
     com/, software available from wandb.com.
[29] G. K. Shahi, J. M. Struß, T. Mandl, Task 3: Fake news detection at CLEF-2021 CheckThat!,
     2021. URL: https://doi.org/10.5281/zenodo.4714517. doi:10.5281/zenodo.4714517.

</pre>