NLP&IR@UNED at CheckThat! 2021:
Check-worthiness estimation and fake news detection
using transformer models
Juan R. Martinez-Rico1 , Juan Martinez-Romo1,2 and Lourdes Araujo1,2
1
  NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED),
Madrid 28040, Spain
2
  Instituto Mixto de Investigación - Escuela Nacional de Sanidad (IMIENS)


                                         Abstract
                                         This article describes the different approaches used by the NLPIR@UNED team in the CLEF2021 Check-
                                         That! Lab to tackle the tasks 1A-English, 1A-Spanish and 3A-English. The goal of Task 1A in English is to
                                         determine which tweets within a set of COVID-19 related tweets are worth checking. Task 1A in Spanish
                                         is similar but in this case the tweets are related to political issues in Spain. In both tasks, transformer
                                         models have been used to identify check-worthy tweets, obtaining the first place in the task in English
                                         and the fourth place in the task in Spanish. Task 3A is focused on determining the veracity of a news
                                         article. It is a multi-class classification problem with four possible values: true, partially false, false, and
                                         other. For this task we have used two different approaches: a gradient-boosting classifier with TF-IDF
                                         and LIWC features, and a transformer model fed with the first tokens of each news article. We got the
                                         fourth place out of 25 participants in this task.

                                         Keywords
                                         check-worthiness, fake news detection, transformer models


1. Introduction
Despite the efforts carried out in recent times to combat the proliferation of fake news, these
have not stopped growing, taking advantage of events conducive to its dissemination, such as
the current pandemic, or the events that occurred in the last presidential elections in the United
States. Therefore, the existence of initiatives such as this CheckThat! Lab[1][2], which give
researchers in this area of natural language processing the opportunity to propose and share
different ideas that can help mitigate this problem, are appreciated.
   In this article, we present the approaches used by our team in the tasks of check-worthiness
and fake news detection. Since transformer models have become a fundamental tool that adapts
to many of the tasks related to natural language processing obtaining state-of-the-art results,
we have chosen to take them as our first option in each of the tasks. However, in Task3a we


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" jrmartinezrico@invi.uned.es (J. R. Martinez-Rico); juaner@lsi.uned.es (J. Martinez-Romo); lurdes@lsi.uned.es
(L. Araujo)
 0000-0003-1867-9739 (J. R. Martinez-Rico); 0000-0002-6905-7051 (J. Martinez-Romo); 0000-0002-7657-4794
(L. Araujo)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
decided to also use more classical approaches since the size of the news articles to be checked
exceeded the input sequence size that is reasonable to define in a transformer model.
   We have organized the rest of the article as follows: in section 2 we briefly describe the
transformer models, the approach we have used in tasks 1A-English and 1A-Spanish and we
comment on the results obtained, in section 3 we explain our approach in the fact-checking
task and discuss the results obtained, and section 4 contains our conclusions and future work.


2. Transfomers for Check-Worthiness
2.1. Previous Approaches in the Check-Worthiness Task
Among the approaches that have been used to tackle this task we can highlight the initial
work carried out by [3] where they make use of classifiers such as Random Forest, SVM or
Multinomial Naive Bayes, and features based on TF-IDF representations, parts of speech tags,
sentiment scores, and entity types. To the aforementioned methods [4] add features such as
average embedding vector of the sentence, linguistic features that count the number of words
in the sentence that belong to a certain lexicon, contextual features such as the position of a
sentence with respect to others in a segment of text, discourse features such as the detection of
contradictions, and as a classifier uses a Deep Feed-Forward Neural Network. Already within
this Check That! Lab we have seen in past editions the use of recurrent neural networks by [5]
where each token is represented in three ways: through embeddings, and with part of speech
tags and syntactic dependencies encoded as one-hot vectors. In the same edition [6] makes
use of character n-gram features with a k-nearest neighbors classifier. More recently in this
same Lab, transformer models began to be used for the check-worthiness task by many of the
participants [7][8][9]. In the next section we will see a short description of this architecture.

2.2. The Transformer Model
Since its appearance as an alternative to neural machine translation models, transformer
models[10] have become a preferred model when compared to other natural language processing
techniques, not only in machine translation, but in other tasks such as sequence classification,
summarization, named entity recognition, text generation, extractive question answering or
language modeling.
   A transformer is a deep learning model that “translates” input sequences into output sequences
using an encoder-decoder architecture. It uses an attention mechanism to identify the most
relevant parts of the input and output sequences. Previous models such as RNNs also use an
attention mechanism but are limited by their sequential nature when processing input data.
Transformers, by relying solely on the attention mechanism, do not need to process the input
sequences in a specific order, allowing them to process these sequences in parallel and thus
reducing training times.
   The model is fed with training data in the form of sequence pairs (input, target). The first is
applied in the encoder block and the second in the decoder block.
   In recurrent models, sequences are introduced token by token, thus providing the relative
position of each of these tokens in the sequence. Since transformers do not process sequences
in this way, this positional information is provided to the model as a mask added to the input
and target sequences.
   The encoder block is made up of a stack of n identical encoders, each of them with a self-
attention layer and a feed fordward neural network. The decoder block is made up of the same
number n of decorders and each of them is composed of a self-attention layer, an encoder-decoder
attention layer and a feed forward neural network.
   The self-attention layers allow to identify within the same sequence, which tokens are more
relevant for another token that is being considered at that moment. On the contrary, the
encoder-decoder attention layer relates tokens of the input and target sequences. The attention
layers are not monolithic, but are composed of several attention heads that focus on different
portions of the sequence.
   The output of the encoder block is the one that feeds all the encoder-decoder attention layers
of the decoder block, while the output of the decoder block links with a linear layer and this
with a softmax layer that maps each position of the target sequence with the output vocabulary.
   What is described above is the original model however, after its presentation a large number of
models derived from the transformer architecture have appeared. For example, one of the most
successful is BERT[11], which basically eliminates the decoder block present in transformers,
and in its training the input sequences are masked in such a way that it processes them
bidirectionally.
   Another point to highlight is that as part of these architectural-models a series of data-models
pre-trained in an unsupervised manner with large datasets have been released. This allows us
to easily apply transfer-learning to different tasks such as those mentioned at the beginning of
this section.
   Next, we will describe how we have used some of these models in the check-worthiness and
fake news detection tasks.

2.3. Task 1A English
The objective of Task1a-English[12] is, given a set of tweets in English language related to the
COVID-19 topic, to identify which tweets are worth checking by assigning a score to each of
them.
   To tackle this task we eliminated any metadata present in the tweets and have focused only
on the textual information provided.
   Taking into account that all the tweets to be evaluated are about COVID-19, we have searched
a well-known repository of pre-trained models1 , and we have found one that is trained in tweets
related to this topic.
   Finally, we have used the BERTweet model[13], a BERT-architecture model initially pre-
trained with 850 million tweets in English using the RoBERTa[14] pre-training procedure, to
which the same authors performed a second 40-epoch pre-training with 23 million English
tweets related to the COVID-19 topic.
   To check if actually using a pre-trained model for the same topic and document type had a
superior behavior to other pre-trained models and architectures in more neutral datasets, we

   1
       https://huggingface.co/transformers/
Table 1
Task 1A English - Transformer models analysis: results on dev dataset
   Model                                    Epochs     Batch Size   MAP          F1      P-R      ROC
   bertweet-covid19-base-uncased                5           16      0.849    0.767       0.848    0.874
   bertweet-covid19-base-cased                  5           16      0.845    0.790       0.843    0.879
   bertweet-base                                5           10      0.842    0.774       0.841    0.873
   roberta-base                                 5            8      0.793    0.709       0.791    0.836
   funnel-transformer/small                     3            8      0.785    0.654       0.784    0.783
   funnel-transformer/small-base                3            8      0.785    0.654       0.784    0.783
   funnel-transformer/intermediate              3            8      0.761    0.637       0.759    0.768
   funnel-transformer/intermediate-base         3            8      0.761    0.637       0.759    0.768
   distilbert-base-cased                        5            8      0.752    0.688       0.749    0.790
   funnel-transformer/medium                    5            8      0.737    0.707       0.731    0.820
   funnel-transformer/medium-base               5            8      0.737    0.707       0.731    0.820
   bert-base-cased                              5            8      0.733    0.672       0.729    0.774
   bert-base-multilingual-cased                 5            8      0.726    0.636       0.722    0.786
   albert-base-v2                               5           16      0.694    0.677       0.691    0.756
   distilbert-base-multilingual-cased           5            8      0.680    0.697       0.673    0.764


Table 2
Task 1A English - Selected transformer models: results on dev dataset
       Model                             Epochs     Batch Size   MAP        F1        P-R      ROC
       bertweet-covid19-base-uncased        6          14        0.862   0.800        0.861    0.874
       bertweet-covid19-base-cased          5          14        0.860   0.797        0.859    0.883


implemented a grid search procedure in which we varied the number of periods, the size of the
lot and the model/architecture used. The rest of the hyperparameters have been kept in the
default values that each model has.
   Among the transformer models we have tested are BERT, ALBERT[15], RoBERTa, DistilBERT[16],
and Funnel-Transformer[17]. Table 1 shows the best results obtained for each model for the
mean average precision, F1, precision-recall curve and ROC curve measurements, sorted by
mean average precision.
   As we can see, the best behavior is obtained with the model that is pre-trained in tweets
related to the COVID-19 topic.
   Therefore we select the first two models bertweet-covid19-base-uncased and bertweet-covid19-
base-cased and we test various values of the epsilon parameter obtaining the best results with
the value 2.5 × 10−9 . These results are shown in Table 2.
   We also found that although we always initialized the Python, NumPy, and PyTorch random
number generators with the same seeds, the same results did not always appear for a given
set of parameters. Therefore, to make the final shipments, we do not join the training and dev
datasets to have a larger one with which to train the models, but we train the models with
Table 3
Task 1A Spanish - Transformer models analysis: results on dev dataset

 Model                                                     Epochs       Batch Size   MAP      F1     P-R     ROC
 Electra mrm8488-electricidad-base-discriminator              3            16        0.495   0.384   0.492   0.885
 BERT Geotrend-bert-base-es-cased                             3             8        0.474   0.439   0.472   0.874
 BERT dccuchile-bert-base-spanish-wwm-cased                   3            16        0.467   0.458   0.465   0.879
 RoBERTa mrm8488-RuPERTa-base                                 3             8        0.376   0.341   0.372   0.836
 Electra mrm8488-electricidad-base-generator                  5             8        0.325   0.130   0.318   0.830


the training dataset and evaluate them with the dev dataset, repeatedly executing the same
configurations of parameters and selecting the test files to send from the best results obtained
on the dev dataset, assuming that an initial random configuration that behaved well in the dev
dataset would also do so in the test dataset.

2.4. Task 1A Spanish
In this version of Task 1A, the set of tweets is defined in Spanish language and these tweets are
related to issues of Spanish politics.
   As in Task 1A English, we have used several transformer models to evaluate which one best
suits these types of tweets. The tested models have been BERT, Electra[18] and RoBERTa.
   After a preliminary grid search with different pre-trained models in Spanish and different
values of batch size and epochs, keeping the rest of the hyperparameters in their default values,
we obtained the results shown in Table 3. The best results are shown for each pre-trained model.
   Since the model Electra mrm8488-electricity-base-discriminator2 is the one with a slightly
higher result, it is the one we selected for a more exhaustive search for parameters. This Electra
model is pre-trained with 20GB of the Spanish-language Oscar corpus[19].
   We also realized, extracting the vocabulary from this pre-trained model, that among the first
1000 tokens there were 971 unused tokens of type [unusedNNN].
   To see if these tokens could be useful, we pulled all the out-of-vocabulary tokens of the
training dataset. From this set of words, we manually selected those that seemed most relevant
to us and had three or more appearances, mainly the names of politicians, political parties, the
media, and hashtags used in electoral campaigns. In total, the list consisted of 197 tokens.
   With this list, we create a dictionary to group tokens that correspond to the same concept.
For example, #PINParental, pin and parental were matched with the same PINParental token.
   In this dictionary, we substitute the tokens on the right side by tokens [unusedNNN] to obtain
a match between the out-of-vocabulary tokens with the unused tokens of the model, and both
in the training loop and in the evaluation loop we did the replacement of the out-of-vocabulary
tokens using this dictionary.
   Unfortunately, the results obtained with this strategy were not as expected, obtaining better
results without substituting out-of-vocabulary tokens. The best results obtained after repeated

   2
       https://huggingface.co/mrm8488/electricidad-base-discriminator
Table 4
Task 1A Spanish - Selected transformer models: results on dev dataset

 Model                                                  Epochs    Batch Size   MAP        F1       P-R    ROC
 mrm8488-elect-base-discr. without replacement            3          12        0.514   0.480      0.512   0.878
 mrm8488-elect-base-discr. without replacement            3          14        0.510   0.472      0.506   0.892
 mrm8488-elect-base-discr. without replacement            3          16        0.509   0.390      0.506   0.892
 mrm8488-elect-base-discr. with replacement               3          18        0.466   0.277      0.463   0.870
 mrm8488-elect-base-discr. with replacement               6          18        0.458   0.417      0.456   0.839
 mrm8488-elect-base-discr. with replacement               4          10        0.452   0.419      0.449   0.872


Table 5
Task 1A - Submission official results
         Task          MAP     MRR       RP     P@1       P@3     P@5     P@10    P@20         P@30
         1A Spanish   0.492    1.000    0.475   1.000     1.000   1.000   0.800   0.800        0.620
         1A English   0.224    1.000    0.211   1.000     0.667   0.400   0.300   0.200        0.160


runs with different batch sizes and epochs are shown in Table 4, along with the best results
obtained by substituting tokens.
   To send the submissions to this version in Spanish of subtask 1A, the same strategy was used
as in the English version: training the model repeatedly for the same parameters and send the
configurations with the best values in the dev dataset.

2.5. Task 1A Results
Finally, two submissions were made for the Spanish version of Task 1A and three submissions
for the English version. The official evaluation measure was mean average precision (MAP). In
Spanish we obtained the fourth position among six participants while in English we obtained
the first position among ten participants. The results are shown in Table 5.


3. Fake News Detection Task
3.1. Previous Approaches in the Fake News Detection Task
The approaches to the detection of fake news that have been made so far can be divided into
three groups: knowledge-based methods, content-based method and context-based methods.
   In the former, each claim is compared with a source of evidence that supports that claim.
The source of evidence can be a knowledge graph[20] in which case we must extract subject-
predicate-object triples from the claim and verify their existence in the graph, or we can be use
as a source of evidence the information retrieved from a query to a search engine[21], having
then to compare the information obtained with the claim using techniques such as similarity,
stance detection, contradiction detection, etc.
   Content-based methods only use the textual information present in the document. The
features obtained can be latent, such as word or sentence embeddings, or explicit such as TF-IDF
vectors, bag of words vectors, word counts[22], psycho-linguistic features[23], etc. Transformers
and RNNs can also be considered as a content-based method that uses latent features.
   In context-based features the information surrounding the claim is used to verify its degree
of truthfulness. Examples of these features can be those based on propagation[24], based on the
user’s reputation[25], based on their profile[26], etc.

3.2. Task 3A - English
For the fake news detection task in English[27], from a set of news articles we have to classify
each item in one of the following categories: true, partially true, false, or other[28][29][30],
taking into account the main claim of the news article.
  The organizers provided three different training datasets[31], so we joined these three datasets
and left 20% as a dev dataset for a total of 760 training instances and 190 validation instances.
  To tackle this task we have used two different approaches. The first of them is, as in the tasks
dedicated to determining the check-worthiness of a sentence, to use transformer models to
check if the latent features that these models extract from the documents can be related to their
veracity.
  The second approach is to use the more classical ensemble methods together with various
types of features such as TF-IDF and LIWC.

3.2.1. Transformer approach
A grid search has been carried out with four different transformer models: ALBERT, BERT,
DistilBERT and Funnel-Transformer, and different batch sizes and number of epochs.
   Given that one of the limitations of the transformer models is the length of the sequence that
they accept as input, we have assumed that the relevant information for each news article is
likely to be found at the beginning of it. In this way we have extracted the first 150 and 200
tokens as input for the models. We have also tried to use the first 150 tokens of the article title
as input. As some instances had no title, in those cases we have used the first 150 tokens of the
article text. The four possible class values have been converted to integer values so that they
could be processed correctly.
   The Table 6 shows the best results obtained for each transformer model. Given that this is
a multi-class classification, we have used precision, coverage and F1 as evaluation measures,
taking this last measure as the main one. As can be seen, the title of the article does not seem to
contain enough information about its veracity, and a longer sequence length provides better
results, as expected.

3.2.2. Ensemble approach
In this second approach we use the random forest[32] and gradient boosting[33] classifiers. We
extracted the text of each article and processed it with the LIWC2015[34] text analysis tool,
Table 6
Task 3A - Transformer models analysis: results on dev dataset
        Model                                     Epochs      Batch Size    Input      Prec.   Rec.     F1
        albert-base-v2                                9           8        Text 200    0.445   0.424   0.427
        funnel-transformer-intermediate               7           8        Text 200    0.436   0.409   0.402
        albert-base-v2                                8           8        Text 150    0.418   0.398   0.397
        funnel-transformer-intermediate               9           8        Text 150    0.405   0.394   0.387
        bert-base-cased                               9           8        Text 200    0.383   0.386   0.382
        distilbert-base-cased                         6           8        Text 200    0.397   0.371   0.374
        bert-base-cased                              10           8        Text 150    0.370   0.368   0.362
        distilbert-base-cased                         9           8        Text 150    0.351   0.345   0.345
        distilbert-base-cased                         6           8        Title 150   0.354   0.367   0.344
        bert-base-cased                               6           8        Title 150   0.375   0.375   0.340
        funnel-transformer-intermediate               8           8        Title 150   0.423   0.329   0.322
        albert-base-v2                                6           8        Title 150   0.335   0.341   0.316


obtaining a total of 93 discrete features3 such as Analytic, Clout, Authentic, Tone, etc. The use of
LIWC in this task is motivated by the premise that false articles may have certain linguistic
features that are not present in legitimate articles, and this can be reflected in the results offered
by this tool. We also extract the TF-IDF vectors as features from the text of the articles.
   To build the latest feature set, for each article we do a Google search using the article title as
query terms.
   From the first 20 results obtained, we extract the domain names from each URL and concate-
nate them, separating them with spaces, constructing text strings with the shape “www.politifact.com
www.reuters.com www.nytimes.com apnews.com ...”. With these strings we also build a TF-IDF
representation. Thus, we assume that if domain names of sites dedicated to fact-checking appear
among the first 20 results, that article is at least suspected of containing some controversy. On
the other hand, if the domain names are from prestigious media, the original article, true or
false, may be important.
   To select the proper configuration, we keep the LIWC features fixed, and we try to optionally
concatenate the text TF-IDF features and the domain names TF-IDF features.
   In Random Forest the number of estimators has been established at 100, the maximum depth
of the tree at 1000 and as a criterion to evaluate the split quality gini has been used. In Gradient
Boosting the number of estimators has also been set to 100 and as a loss function deviance has
been used. The result of these tests is shown in Table 7.
   As can be seen, the Gradient Boosting classifier is superior to Random Forest in all feature
configurations. It is also able to take advantage of the information provided by all the concate-
nated features, while the Random Forest classifier obtains the best result when only the LIWC
features are used.


    3
        These are all the features that this tool provides.
Table 7
Task 3A - Ensemble models and features analysis: results on dev dataset
                Model                    Domain   Text    LIWC     Prec.    Rec.     F1
                Gradient Boosting         true    true    true     0.428    0.369   0.366
                Gradient Boosting         false   true    true     0.419    0.366   0.364
                Gradient Boosting         false   false   true     0.420    0.346   0.338
                Gradient Boosting         true    false   true     0.393    0.343   0.334
                Random Forest             false   false   true     0.386    0.335   0.319
                Random Forest             false   true    true     0.574    0.325   0.303
                Random Forest             true    true    true     0.524    0.306   0.277
                Random Forest             true    false   true     0.462    0.274   0.226


Table 8
Task 3A - Submissions official results
               Model                                             Prec.     Rec.      F1
               Gradient Boosting + Domain + Text + LIWC          0.5055    0.4805   0.4680
               Albert-base + sequence lenght 150                 0.4653    0.4109   0.4237
               Albert-base + sequence lenght 200                 0.3779    0.3742   0.3691


3.3. Task 3A Results
In this task we have made three submissions. The first one has been generated by Gradient
Boosting with the three types of features: LIWC, domain names TF-IDF and text TF-IDF. The
second submission we have done with Albert transformer with albert-base language model and
the article text as input with a sequence length of 150. Moreover, for primary submission we
have used the same type of transformer but with a sequence length of 200.
   With the best of these submissions we have achieved an F1-macro measure of 0.468 which
places us in fourth position among 25 participants.
   Table 8 shows our reproduction of the results obtained by the three submissions. Unlike
what happened in the dev dataset, with the test dataset the best model has been the Gradient
Boosting classifier that uses the features based on LIWC, domain names TF-IDF and text TF-IDF.
This tells us that although transformer models can perform well in the fake news detection task
with little or no feature engineering, the use of text analysis tools like LIWC along with other
handcrafted features can still be useful for profiling fake news.


4. Conclusions and Future Work
In this edition of CheckThat! Lab, our team has explored the two main tasks in detecting fake
news: the selection of sentences or tweets to verify and the verification of these elements
themselves.
   Regarding the check-worthiness task, we have verified that the transformer models can
extract the latent features present in the tweets more efficiently than other methods, although it
is necessary to carefully choose the appropriate data model for the task, with large performance
differences between some models and others.
   Our participation in the English version of this task has been very positive, obtaining the
first position, while in the Spanish version we have been in fourth place. We have also detected
that in Spanish the mean average precision on the dev dataset (0.495) was much lower than that
obtained in English (0.849). This may be due to the fact that the dataset used is not specifically
pre-trained on tweets or on Spanish politics.
   In the task of detecting fake news we have participated with two different approaches. On
the one hand, we have used transformer models trying to extract linguistic features that identify
fraudulent articles, and expecting good behavior from them. On the other hand, we have used
a fairly simple Gradient Boosting classifier that uses linguistic features extracted through the
LIWC tool, TF-IDF text features, and a TF-IDF representation of domain names retrieved from a
Google search. We have used this second system as contrastive submission since its results were
inferior to those of the transformer models. However, in the test dataset the best performance
was obtained with this last model.
   Being our first participation in a fake news detection task, the result was positive, obtaining
fourth place among 25 participants.
   We think that although it can always be improved, the check-worthiness task can be ap-
proached reasonably well by means of transformers models, so our future work will be mainly
devoted to investigating alternative methods to those used in this laboratory to tackle the task
of fact-checking and detection of fake news, for example using knowledge methods to verify
claims.


Acknowledgments
This work has been partially supported by the Spanish Ministry of Science and Innovation
within the DOTT-HEALTH Project (MCI/AEI/FEDER, UE) under Grant PID2019-106942RB-C32,
as well as project EXTRAE II (IMIENS 2019) and the research network AEI RED2018-102312-T
(IA-Biomed).


References
 [1] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, The
     CLEF-2021 CheckThat! Lab on Detecting Check-Worthy Claims, Previously Fact-Checked
     Claims, and Fake News, in: Proceedings of the 43rd European Conference on Information
     Retrieval, ECIR ’21, Lucca, Italy, 2021, pp. 639–649. URL: https://link.springer.com/chapter/
     10.1007/978-3-030-72240-1_75.
 [2] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam,
     F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl,
     S. Modha, M. Kutlu, Y. S. Kartal, Overview of the CLEF-2021 CheckThat! Lab on Detecting
     Check-Worthy Claims, Previously Fact-Checked Claims, and Fake News, in: Proceedings of
     the 12th International Conference of the CLEF Association: Information Access Evaluation
     Meets Multiliguality, Multimodality, and Visualization, CLEF ’2021, Bucharest, Romania
     (online), 2021.
 [3] N. Hassan, C. Li, M. Tremayne, Detecting check-worthy factual claims in presidential
     debates, in: Proceedings of the 24th acm international on conference on information and
     knowledge management, 2015, pp. 1835–1838.
 [4] P. Gencheva, P. Nakov, L. Màrquez, A. Barrón-Cedeño, I. Koychev, A context-aware
     approach for detecting worth-checking claims in political debates, in: Proceedings of the
     International Conference Recent Advances in Natural Language Processing, RANLP 2017,
     2017, pp. 267–276.
 [5] C. Hansen, C. Hansen, J. G. Simonsen, C. Lioma, The Copenhagen Team Participation in
     the Check-Worthiness Task of the Competition of Automatic Identification and Verification
     of Claims in Political Debates of the CLEF-2018 CheckThat! Lab (2018) 8.
 [6] B. Ghanem, M. Montes-y Gomez, F. Rangel, P. Rosso, UPV-INAOE-Autoritas - Check That:
     Preliminary Approach for Checking Worthiness of Claims (2018) 6.
 [7] E. Williams, P. Rodrigues, V. Novak, Accenture at CheckThat! 2020: If you say so: Post-hoc
     fact-checking of claims using transformer-based models, arXiv:2009.02431 [cs] (2020).
     URL: http://arxiv.org/abs/2009.02431, arXiv: 2009.02431.
 [8] A. Nikolov, G. D. S. Martino, I. Koychev, P. Nakov, Team Alex at CLEF CheckThat! 2020:
     Identifying Check-Worthy Tweets With Transformer Models, arXiv:2009.02931 [cs] (2020).
     URL: http://arxiv.org/abs/2009.02931, arXiv: 2009.02931.
 [9] G. S. Cheema, S. Hakimov, R. Ewerth, Check_square at CheckThat! 2020: Claim Detection
     in Social Media via Fusion of Transformer and Syntactic Features, arXiv:2007.10534 [cs]
     (2020). URL: http://arxiv.org/abs/2007.10534, arXiv: 2007.10534.
[10] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Advances in neural information processing systems,
     2017, pp. 5998–6008.
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[12] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, M. K. Alex Nikolov, F. A. Yavuz
     Selim Kartal, G. Da San Martino, A. Barrón-Cedeño, R. Míguez, T. Elsayed, P. Nakov,
     Overview of the CLEF-2021 CheckThat! Lab Task 1 on Check-Worthiness Estimation in
     Tweets and Political Debates, in: Working Notes of CLEF 2021—Conference and Labs of
     the Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021.
[13] D. Q. Nguyen, T. Vu, A. T. Nguyen, BERTweet: A pre-trained language model for English
     Tweets, arXiv preprint arXiv:2005.10200 (2020).
[14] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, arXiv:1907.11692
     [cs] (2019). URL: http://arxiv.org/abs/1907.11692, arXiv: 1907.11692.
[15] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A Lite BERT for
     Self-supervised Learning of Language Representations, arXiv:1909.11942 [cs] (2020). URL:
     http://arxiv.org/abs/1909.11942, arXiv: 1909.11942.
[16] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller,
     faster, cheaper and lighter, arXiv:1910.01108 [cs] (2020). URL: http://arxiv.org/abs/1910.
     01108, arXiv: 1910.01108.
[17] Z. Dai, G. Lai, Y. Yang, Q. V. Le, Funnel-Transformer: Filtering out Sequential Redundancy
     for Efficient Language Processing, arXiv:2006.03236 [cs, stat] (2020). URL: http://arxiv.org/
     abs/2006.03236, arXiv: 2006.03236.
[18] K. Clark, M.-T. Luong, Q. V. Le, C. D. Manning, ELECTRA: Pre-training Text Encoders as
     Discriminators Rather Than Generators, arXiv:2003.10555 [cs] (2020). URL: http://arxiv.
     org/abs/2003.10555, arXiv: 2003.10555.
[19] P. J. O. Suárez, L. Romary, B. Sagot, A Monolingual Approach to Contextualized Word
     Embeddings for Mid-Resource Languages, Proceedings of the 58th Annual Meeting of the
     Association for Computational Linguistics (2020) 1703–1714. URL: http://arxiv.org/abs/
     2006.06202. doi:10.18653/v1/2020.acl-main.156, arXiv: 2006.06202.
[20] G. L. Ciampaglia, P. Shiralkar, L. M. Rocha, J. Bollen, F. Menczer, A. Flammini, Computa-
     tional Fact Checking from Knowledge Networks, PLOS ONE 10 (2015) e0128193. URL: http:
     //dx.plos.org/10.1371/journal.pone.0128193. doi:10.1371/journal.pone.0128193.
[21] G. Karadzhov, P. Nakov, L. Marquez, A. Barron-Cedeno, I. Koychev, Fully Automated Fact
     Checking Using External Sources, arXiv:1710.00341 [cs] (2017). URL: http://arxiv.org/abs/
     1710.00341, arXiv: 1710.00341.
[22] J. T. Hancock, L. E. Curry, S. Goorha, M. Woodworth, On lying and being lied to: A linguistic
     analysis of deception in computer-mediated communication, Discourse Processes 45 (2007)
     1–23. Publisher: Taylor & Francis.
[23] R. Mihalcea, C. Strapparava, The lie detector: Explorations in the automatic recognition of
     deceptive language, in: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers,
     Association for Computational Linguistics, 2009, pp. 309–312.
[24] J. Zhang, L. Cui, Y. Fu, F. B. Gouza, Fake news detection with deep diffusive network
     model, arXiv preprint arXiv:1805.08751 (2018).
[25] P. Nakov, T. Mihaylova, L. Màrquez, Y. Shiroya, I. Koychev, Do not trust the trolls:
     Predicting credibility in community question answering forums, in: Proceedings of the
     International Conference Recent Advances in Natural Language Processing, RANLP 2017,
     2017, pp. 551–560.
[26] K. Shu, X. Zhou, S. Wang, R. Zafarani, H. Liu, The Role of User Profile for Fake News Detec-
     tion, arXiv:1904.13355 [cs] (2019). URL: http://arxiv.org/abs/1904.13355, arXiv: 1904.13355.
[27] G. K. Shahi, J. M. Struß, T. Mandl, Overview of the CLEF-2021 CheckThat! Lab Task 3
     on Fake News Detection, in: Working Notes of CLEF 2021—Conference and Labs of the
     Evaluation Forum, CLEF ’2021, Bucharest, Romania (online), 2021.
[28] G. K. Shahi, A. Dirkson, T. A. Majchrzak, An exploratory study of covid-19 misinformation
     on twitter, Online Social Networks and Media 22 (2021) 100104. Publisher: Elsevier.
[29] G. K. Shahi, D. Nandini, FakeCovid – A Multilingual Cross-domain Fact Check News
     Dataset for COVID-19, in: Workshop Proceedings of the 14th International AAAI Confer-
     ence on Web and Social Media, 2020. URL: http://workshop-proceedings.icwsm.org/pdf/
     2020_14.pdf.
[30] G. K. Shahi, AMUSED: An Annotation Framework of Multi-modal Social Media Data,
     arXiv preprint arXiv:2010.00502 (2020).
[31] G. K. Shahi, J. M. Struß, T. Mandl, Task 3: Fake news detection at CLEF-2021 CheckThat!,
     2021. URL: https://doi.org/10.5281/zenodo.4714517. doi:10.5281/zenodo.4714517.
[32] L. Breiman, Random forests, Machine learning 45 (2001) 5–32. Publisher: Springer.
[33] J. H. Friedman, Stochastic gradient boosting, Computational Statistics & Data Analysis
     38 (2002) 367–378. URL: https://linkinghub.elsevier.com/retrieve/pii/S0167947301000652.
     doi:10.1016/S0167-9473(01)00065-2.
[34] J. W. Pennebaker, R. L. Boyd, K. Jordan, K. Blackburn, The development and psychometric
     properties of LIWC2015, Technical Report, 2015.