=Paper=
{{Paper
|id=Vol-3012/OHARS2021-paper1
|storemode=property
|title=“To trust a LIAR”: Does Machine Learning Really Classify Fine-grained, Fake News Statements?
|pdfUrl=https://ceur-ws.org/Vol-3012/OHARS2021-paper1.pdf
|volume=Vol-3012
|authors=Mark Mifsud,Colin Layfield,Joel Azzopardi,John Abela
|dblpUrl=https://dblp.org/rec/conf/recsys/MifsudLAA21
}}
==“To trust a LIAR”: Does Machine Learning Really Classify Fine-grained, Fake News Statements?==
<pdf width="1500px">https://ceur-ws.org/Vol-3012/OHARS2021-paper1.pdf</pdf>
<pre>
“To trust a LIAR”: Does Machine Learning Really
Classify Fine-grained, Fake News Statements?
Mark Mifsud1 , Colin Layfield1 , Joel Azzopardi2 and John Abela1
1
    Dept of Computer Information Systems, Faculty of ICT, University of Malta, Msida Malta
2
    Dept of Artificial Intelligence, Faculty of ICT, University of Malta, Msida Malta


                                         Abstract
                                         Fake news refers to deceptive online content and is a problem which causes social harm [1]. Early
                                         detection of fake news is therefore a critical but challenging problem. In this paper we attempt to
                                         determine if state-of-the-art models, trained on the LIAR dataset [2] can be leveraged to reliably classify
                                         short claims according to 6 levels of veracity that range from “True” to “Pants on Fire” (absolute lies).
                                         We investigate the application of transformer models BERT [3], RoBERTa [4] and ALBERT [5] that have
                                         previously performed significantly well on several natural language processing tasks including text
                                         classification. A simple neural network (FcNN) was also used to enhance each model’s result by utilising
                                         the sources’ reputation scores1 . We achieved higher accuracy than previous studies that used more data
                                         or more complex models. Yet, after evaluating the models’ behaviour, numerous flaws appeared. These
                                         include bias and the fact that they do not really model veracity which makes them prone to adversarial
                                         attacks. We also consider the possibility that language-based, fake news classification, on such short
                                         statements is an ill-posed problem.

                                         Keywords
                                         Fake News, Natural Language Processing, Artificial intelligence, Deep Learning, Transformer Models


1. Introduction
Social media has made the creation and spreading of information easier, quicker and cheaper
than ever before. This scenario has resulted in an epidemic of what is termed as ‘fake news’ -
content that deliberately gives false information to deceive and manipulate, often with negative
results [1].

1.1. Fake News Classification
Since fake news spread fast, early detection is necessary in order to limit the spread and,
consequently, the harm. The use of Machine Learning (ML) techniques for Natural Language
Processing (NLP) is one way to build classifiers that could serve as potential early detectors.
Two distinct approaches that use NLP-based models [6] are:

               1
                 All the source code used is available at: https://github.com/MarkMifsud/To-Tust-A-Liar
OHARS’21: Second Workshop on Online Misinformation- and Harm-Aware Recommender Systems, October 2, 2021,
Amsterdam, Netherlands
Envelope-Open mark.mifsud.16@um.edu.mt (M. Mifsud); colin.layfield@um.edu.mt (C. Layfield); joel.azzopardi@um.edu.mt
(J. Azzopardi); john.abela@um.edu.mt (J. Abela)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                           1
   1. Model-Based approaches: A ML model is used to find some reliable features in a dataset
      that correlate to the classification label.
   2. Feature-Based approaches: The ML model relies on pre-defined, linguistic or textual
      feature(s) assumed to indicate deception.

Our approach is model-based, since transformer models are used to learn features that correlate
short statements with their respective class labels.

1.2. Transformers
Transformers are deep learning architectures that have changed the face of NLP research in
recent years. A transformer is initially trained on a large corpus of text to build a ‘language
model’. This model represents words as vectors, yet unlike traditional word embeddings (like
Word2Vec), the representation of each word is sensitive to the context within which the word
occurs. The trained transformer can then be used for various NLP tasks, such as Question
Answering, Named Entity Recognition, Text Classification and others [3].
   Google’s BERT [3] was the first to gain popularity since it performed very well on a number
of NLP benchmark-tasks. Facebook later released RoBERTa which, although sharing many
common features, was pre-trained using different algorithms on an English corpus 10 times
larger and also supported a larger vocabulary [4]. RoBERTa outperformed BERT in multiple
instances. ALBERT, by Google and Toyota is an optimised version of the original BERT that
is highly scalable thus making larger architectures possible. ALBERT too performed better at
many NLP tasks than previous attempts [5].

1.3. The LIAR Dataset
The LIAR dataset [2] contains 12,836 short claims by prominent players in US politics, extracted
from politifact.com. Statements are labelled as either True, Mostly-True, Half True, Barely-True
(mostly false), False or Pants-on-fire (6 classes). This makes the measure for veracity more finely
grained than a binary (real or fake) label [2]; which is appropriate since statements can have a
mix of true and false claims.
   Each statement in LIAR also comes with textual metadata including the job title of the speaker,
the speaker’s affiliation, state of origin and the context in which the claim was uttered, (namely
an interview, a debate or another event) [2]. The speaker’s reputation is a numeric value
that represents the total number of claims under each category of truthfulness uttered by that
speaker. These values are important in some studies, and referred to as the speakers’ history,
credibility or reputation.

1.4. Previous Works on LIAR
Among the studies leading to this one, two are very relevant to our approach. Kirilin & Strube
(2018) [7] have the best accuracy score (45.7%) on the 6-way classification, while Liu et al
(2019) [8] were among the first to use BERT to classify LIAR entries.
   Kirilin & Strube represented the statements and the textual metadata as FastText [9] word
embeddings and subsequently used LSTMs to carry out the classification. The reputation score


                                                2
and the classification result were then combined using an attention mechanism. A dense neural
network performs a final classification [7].
   Liu et al used BERT-base to classify the statements and textual metadata (except for the
speaker’s names). BERT’s output-vector and the reputation were then utilized in an attention
network followed by a simple neural network to carry out the classification. This entire layout
was repeated twice. The first one produced a coarse grained (true or false) classification. This
output was passed to the second segment, together with the initial input, to derive a final 6-way
classification [8].

1.5. Ill-posed & Ill-conditioned Problems
The mathematical definition of a well-posed problem is attributed to mathematician Jacques
Hadamard [10]. Pattern classification problems can be viewed as well-posed or ill-posed
problems in the sense of Hadamard [11].
  A problem is well-posed if it satisfies the following 3 criteria [12]:

   1. It has a solution
   2. It has only one, uniquely defined solution; and
   3. The solution’s behaviour changes continuously with the initial conditions

   Thus, an ill-posed problem is a problem that fails one or more of the above criteria [12]. Such
problems require some modification to be solved or approximated which may include additional
data, measurements or boundaries [12].
   When considering a few of the short statements from the LIAR dataset, it becomes apparent
that many statements can have multiple possible truth-levels based on context, time or who the
speaker is. For instance, the veracity of the statement: “I am pro-life, he is not” will depend on
the speaker, the subject and their opinions at a given time.
   These multiple, possible solutions violate the second rule for a well posed problem, giving a
strong reason to believe that purely language-based classification of fake news is an ill-posed
problem. Another class of problems is called ill-conditioned. Such problems may not satisfy
the definition of ill-posed problems but are considered similarly unstable for practical purposes
since a small change in the input results in a large change in the output [12].


2. Aims and Objectives
The principal aim of this research was to build a Fake-News Classifier, that matches, or exceeds,
previous classifiers’ accuracy scores on the LIAR dataset. Objectives to reach this aim included:

   1. To analyse the effectiveness of transformers, given that they are the current state-of-the-
      art NLP models, for the classification task.
   2. To investigate whether better trained or larger transformers can achieve a higher accuracy.
   3. To evaluate whether the overall classification result can be enhanced by adding neural
      network layers that use both the transformer’s output and the source’s reputation score.


                                                3
 Two other objectives were added with the aim of determining the reliability of the resulting
models.

   4. To investigate the bias, behaviour and learning of the models achieved.
   5. To investigate the argument that fake news classification, performed using only NLP (with-
      out the classifier having knowledge of the real world), is an ill-posed or ill-conditioned
      problem.


3. Design & Implementation
3.1. The Classification Models
Three BERT variants were used in order to determine if the differences in transformer size,
pre-training or optimisations matter (as initially hypothesised). These were:

   1. BERT-base, the smallest of the models used.
   2. RoBERTa-Large: a larger model pre-trained on a larger amount of data.
   3. ALBERT-Large-V2: comparable to a larger version of BERT that was pre-trained for a
      longer time.

   6-way classification for different levels of truthfulness was performed on the statements in
LIAR without the use of any metadata.
   Fine tuning the transformers was performed manually for reasons of limited disk space. 80%
of LIAR’s data was used for fine tuning (training), 10% was used for validation and 10% for
testing. LIAR comes already split into these segments allowing for a fair comparison with
results reported in different studies.
   BERT and RoBERTa converged after two epochs at learning rates of 1.8−5 and 2.2−5 respec-
tively. ALBERT took 4 epochs at a learning rate of 2.2−5 . The batch size for training all models
was 64.

3.1.1. Using the Reputation Score
To use both the statement and the reputation scores for classification we created FcNN (Fully
Connected Neural Network). This was necessary because NLP-transformers are not applicable
for classification with numerical data such as the reputation score [8]. The utilised FcNN has
24 nodes in its first hidden layer, 12 in the second hidden layer and an output of 6 nodes each
corresponding to one of our classifications. All layers use the tanh activation function since the
values of the transformers’ output vector vary from -1 to 1 (or close) just like the upper and
lower limits of the hyperbolic tangent function.
   Each of the transformers produces a classification vector consisting of 6 values in its final
layer, which can be extracted programmatically. These are input to the FcNN together with the
6 values of the reputation score (Figure 1) after the latter are normalised (divide by 200, since
this is a value close to the largest reputation score). FcNN is then trained at a learning rate of
9-4. To avoid overfitting their result was checked every 500 epochs of training and the best
fitting model was used. In every run, the FcNN managed to fit in less than 9000 epochs.


                                                4
Figure 1: Our Transformer+FcNN architecture.


   In a separate attempt, FcNN was applied on its own, for classification using only the reputation
vector. This provided a baseline against which other models could be compared to (Table 3).
   None of the textual metadata was used because a speaker’s name, job or similar details were
considered to be unrelated to a statement’s truthfulness. Furthermore, these values either repeat
frequently, are often null or not normalized (non-atomic and different spelling can be found for
the same value). Because of this, we were concerned that it would bias the models unnecessarily.

3.2. Quantifying the Classifiers’ Bias
Relying too heavily on the individual’s reputation may result in labelling liars instead of lies [7].
To test if this was the case with our models, a small set of 226 statements was used as a test-set
for our baseline, FcNN-only model that utilises only the reputation score.
  This test-set’s 226 statements are truths from liars and lies from mostly-honest speakers.
These were chosen by computing each speaker’s honesty ratio P, a measure of how honest a
speaker is, based on each speaker’s classification of his or her claims, such that:

                      𝑃 = 1.5(𝑇 𝑟𝑢𝑒 − 𝑃𝑎𝑛𝑡𝑠𝑂𝑛𝐹 𝑖𝑟𝑒) + (𝑀𝑜𝑠𝑡𝑙𝑦𝑇 𝑟𝑢𝑒 − 𝐹 𝑎𝑙𝑠𝑒)                     (1)
   The numerical difference of a speaker’s pants-on-fire statements from true ones was multiplied
to give it a higher weighting. Speakers with values close to zero (balanced liars) were ignored.
Those with scores less than -15 are considered liars so we take their truthful statements. Speakers


                                                 5
scoring more than 4 are honest ones, for which we take their lies. The reason for these cut-off
points was because speakers with honesty ratio between -15 and 4 were ones with relatively
fewer claims. The inequality resulted from the fact that the labels of liars are skewed to begin
with (3 false, 1 neutral and 2 true labels). Thus, the set of 226 statements was collected.
  While only the FcNN-only model (trained normally on LIAR’s training set) was used to classify
these 226 statements, it was expected that even the FcNN models using the transformer’s output
will be prone to this same bias, if confirmed.

3.3. Investigating the Effect of Data Quality on Learning
We also trained the same models on datasets with different data quality than that of LIAR and
compared the results. This was meant to reveal the effect that the quality of the data has on the
models’ learning and also if the models are truly able to learn the intended classification task or
not.
   For this task, two variations of LIAR were created. The first is called Shuffled-LIAR and was
obtained by randomly shuffling the spoken claims attribute among all entries in the training set,
while leaving every other attribute (column) untouched. By having a dataset with randomised
text and all other attributes untouched, we can better determine how much the text really affects
the result. If the same results on the actual, unshuffled set are also achievable on a completely
random set, then one can conclude that the results are accidental and hence insignificant.
   Additionally, the Cleaned-LIAR dataset was created in order to allow training and testing
on data of better quality (less errors) [13]. This was done by compensating for flaws found in
LIAR2 . Cleaned-LIAR omits 207 entries that were discovered to not be stated claims at all [13].
For example, some entries are test data, many indicate whether a speaker changed opinion
(known as flip-flops on Politifact.com) while others are in Spanish (so the words used would
not be in the vocabulary of transformers trained on an English corpus).
   The spelling and grammar of the statements were also corrected manually, under the as-
sumption that since the transformers were pre-trained on good quality text and have a limited
vocabulary, classification may receive a boost from these corrections. If accuracy is not improved
when training on this set, this may suggest that the flawed entries were responsible for the
higher accuracy on the original (unchanged) LIAR.
   On Cleaned-LIAR, BERT was trained for 2 epochs at a 2−5 learning rate. RoBERTa and
ALBERT were trained at a learning rate of 1.2−5 for 3 and 2 epochs respectively. The models
failed to fit for Shuffled-LIAR.

3.4. Testing the Ill-conditioned Property
If instances of the same basic model produce highly varied classifications when they are trained
on data that differs gradually (when tested on the same test data); it may indicate that the
problem is ill-conditioned (at least in the way the problem is being treated here).
    Five copies of the same transformer were trained with training data that varies proportionally
each time. The dataset’s original training and validation portions were joined and their order
    2
     Spelling and other mistakes in LIAR mostly result from how the data was scraped from polifact.com to produce
the dataset.


                                                       6
randomised. The resulting set was then stratified, splitting it in 5 folds (parts), such that all
folds contain virtually an identical number of statements and variety of labels (truth levels).
This keeps the label balance identical for each fold and thus for the 5 folds. For each of the five
training runs, a different combination of 4 folds would be used for training, and the fifth would
be used for validation. The test set was the same in each of the 5 runs.
   This classification was performed with both LIAR and Cleaned-LIAR separately, using the
BERT-base model. Then this was all repeated with RoBERTa-Large. For comparison, the same
procedure was repeated using the two transformers to carry out a 5-way sentiment analysis on
the Stanford Sentiment Treebank (SST-5) dataset [14]. This would offer a baseline. Assuming
sentiment analysis is well-conditioned, fake-news classification would give a similar variability
to sentiment only if it is well-conditioned too.
   The same hyperparameters were used to train the transformers (2−5 learning rate, 64 batch
size for 2 epochs).


4. Results & Evaluation
In our evaluation, we managed to achieve a higher accuracy than other results reported in
literature. However, all other test results suggest that our models are flawed despite their higher
accuracy. This is described in more detail below.

4.1. Classifiers Accuracy
The transformer-only classifiers had a performance similar to Wang’s previous attempts that
utilise statements alone [2], showing they are at least as effective at classification as previous
deep learning models (Table 1).
   All of our Transformer+FcNN models exceeded accuracy results by Kirilin & Strube (2018)
and Liu et al (2019) in spite of these studies using more data. This vindicates our decision to
avoid using textual metadata. Furthermore, our BERT model performs better than Liu et al’s
system, despite it having a far simpler architecture (Figure 1). All transformers produced similar
accuracy scores. A bigger or a better trained transformer only marginally improves fake news
classification.

4.2. Reputation Bias
When classifying truthful statements from liars and lies from honest speakers, FcNN displays a
clear bias caused by utilising reputation. This is clearly visible in the Confusion Matrix found
in Table 2.

4.3. Effect of Data Quality on Training
The fact that the transformers did not manage to properly fine-tune for Shuffled-LIAR indicates
that the models are correlating some features to the labels, whilst no such correlating feature
occurs randomly. However, when trained on the less noisy, Cleaned-LIAR the performance of


                                                7
Table 1
Comparing to previous work: model, data used and accuracy score achieved on LIAR.
                                                                           Data Used                     Accuracy
 Study                     Model/Architecture
                                                           Statement   Metadata Reputation    External   Score
 Wang 2017 [2]             SVM                                 +                                          25.5%
                           CNNs + LSTM                         +                                          27.0%
                           CNNs + LSTM                         +          +          +                    27.4%
 Long et al 2017 [15]      LSTM + Attention                    +                                          25.5%
                           LSTM + Attention                    +          +          +                    41.5%
 Karimi et al 2018 [16]    CNN +LSTM                           +                                          29.1%
                           CNN + LSTM                          +          +          +           +        34.8%
 Pham 2018 [17]            Dual Attention                      +          +                               37.3%
                           Memory Attention Network            +                     +                    44.2%
 Kirilin &                 LSTM+Attention                      +          +                               41.5%
 Strube 2018 [7]           LSTM+Attention                      +          +          +                    45.7%
 Liu et al 2019 [8]        2 stage BERT_base + Attention       +                                          34.5%
                           2 stage BERT_base + Attention       +          +          +                    40.6%
 Ours                      BERT_base                           +                                          27.7%
 (transformer only)        RoBERTa_Large                       +                                          27.3%
                           ALBERT_Large_V2                     +                                          28.2%
 Ours                      BERT_base + FcNN                    +                     +                    48.0%
                           RoBERTa_Large + FcNN                +                     +                    47.9%
                           ALBERT_Large_V2 + FcNN              +                     +                    48.6%


Table 2
Confusion matrix showing how reputation scores alone biases truths from liars and lies from honest
speakers
               Actual                                         Predicted
                label          Pants on fire    Fake   Mostly Fake Half True    Mostly True    TRUE
             Pants                  0             2         0           0           23           0
             Fake                   0             2         0           0           94           0
             Mostly Fake            0             2         0           0           65           0
             Half True              0             0         0           0            1           0
             Mostly True            0            22         0           0            0           0
             TRUE                   0            14         0           0            0           0


the transformers, without FcNN was generally poorer (Table 3). RoBERTa is the exception in
this case, since the cleaned set resulted in marginally better performance.
   This unexpected result raised the question of whether the models are really modelling veracity.
A test to this effect was done as described below.

4.4. Is Veracity Being Modelled?
Consider the following true statement that was classified correctly:

                      “One out of every four homeless people on our streets is a veteran.”


                                                           8
Table 3
Accuracy scores on data of different quality.
                                                       Accuracy
                         Classifier       LIAR    Cleaned-LIAR Shuffled-LIAR
                     FcNN only           44.58%       44.78%          44.58%
                     BERT-base Only      27.67%       25.98%          20.73%
                     BERT+FcNN           48.17%       48.84%          45.36%
                     RoBERTa Only        27.28%       27.81%          20.81%
                     RoBERTa+FcNN        47.31%       49.40%          46.45%
                     ALBERT Only         28.22%       25.59%          20.34%
                     ALBERT+FcNN         48.56%       47.81%          44.66%


Table 4
Variance & Overall Mean Square Error on 5 classification tasks.
                  Transformer           Dataset       Mean Variance   Overall MSE
                  BERT-base              SST-5            0.12            0.79
                  RoBERTa-Large          SST-5            0.11            0.54
                  BERT-base           Cleaned-LIAR        0.48            2.96
                  RoBERTa-Large       Cleaned-LIAR        0.61            3.07
                  BERT-base               LIAR            0.64            3.13
                  RoBERTa-Large           LIAR            0.64            3.03


  A change in the fine-grained classification of the statement is expected if any of the following
changes is done:

    • Negation of the same statement: “One out of every four homeless people on our streets
      is not a veteran.”
    • Reducing probability of the statement: “One out of every four homeless people on our
      streets is a friendly veteran.”; and
    • Contradiction: “One out of every four homeless people on our streets is not homeless.”

  Nevertheless, all such modifications are still classified as fully-true, showing that the models
are not modelling deception or veracity, thus making them vulnerable to adversarial attacks.

4.5. The Ill-conditioned Property
Fake news classification results varied with gradual changes in input data over 4 times that of
Sentiment Analysis on SST-5 (Table 4). Taking the Mean Square Error (MSE) for each run one
would measure the difference in classifications from their target label. Taking an Overall MSE
for the 5 runs, fake news classification shows considerable changes in output (Table 4).
  By the definition of ill-conditioned problems [12], all these are a strong indication that fake
news classification of short statements using transformers is an ill-conditioned problem.


                                                  9
4.6. Is NLP-based Fake-news Classification Ill-posed?
Factors supporting the case that NLP-based, fake news classification is an ill-posed problem
include:

   1. There appears to be no indicator of truthfulness or deception within LIAR’s statements
      unless one has knowledge of the real world. Sentiment Analysis by contrast, can be based
      on the presence of certain words or expressions.
   2. Feature Based detection does not generalise over domains [18].
   3. Khan et al 2019, observed that “the performance of models is not dataset invariant” [19].
   4. Accurate but non explainable models are not necessarily reliable. Assuming so, is an
      ‘affirming the consequent’ fallacy3 .
   5. The models produced by this study and at least another previous one (Fakebox) are not
      modelling veracity [20].
   6. Psychology shows that people lie differently. Even the same person’s indicators of
      deception change over time within the same interview and are influenced by numerous
      factors [21, 22].
   7. The models in this study are at least ill-conditioned, as shown [13].

   All these point to the likelihood that there cannot be a model that maps a string of text to
truth levels without knowledge of the world. This likelihood is strong for the models trained
on LIAR and demonstrated for our models despite their relatively higher accuracy.


5. Conclusions & Future Work
5.1. Conclusions
Our best models achieve a higher accuracy on the LIAR dataset utilising the spoken statements
and the speakers’ reputation alone, outperforming methods that either used more data, more
complex models or both. BERT and BERT variants can be leveraged to classify short statements
more accurately.
   Bigger and better trained transformers yielded only a marginal improvement over the smaller
BERT-base transformer. In our case, although ALBERT-Large did perform better than BERT-base,
fake news classification accuracy did not scale in proportion to the transformer size.
   The most important insights resulted from testing beyond accuracy scores. Flaws were found
and these led to questioning the whole idea of language-based classification of content according
to deception or veracity. Issues were also identified with the LIAR dataset and these flaws were
used to test the effect that data quality has on the models’ ability to learn the task. Specifically,
we show that although the models’ accuracy on LIAR is better than random, the language
transformers’ contribution to the classification was generally poorer when trained and tested
on cleaner data.


    3
     Good models give a high accuracy. These models give a high accuracy; therefore, they are good. This is a
logical fallacy known as Affirming the Consequent.


                                                     10
   Furthermore, when compared with sentiment classification, fake news classification appears
to be an unstable problem. We put forward arguments that suggest that purely NLP-based,
fake-news classification on short statements, such as those found in LIAR, is not robust since it
presents traits of ill-posed and ill-conditioned problems.
   A simple test indicates that these models are not really modelling deception or veracity and
are thus vulnerable to adversarial attacks.
   The models herein, while improving on previous accuracies were thus proven unreliable for
classifying arbitrary claims. The biggest contributor to the higher score was the reputation
score which was shown to bias the models.

5.2. Future Work
We used only the text and the speaker’s reputation in our tests, achieving a better score. An
ablation study can also be done to analyse the impact of each attribute on the result.
   Future studies can attempt similar investigations with the use of constructed features like
part-of-speech tagging or dependency parsing together with those utilised internally by the
transformer.
   It would also be interesting to establish and standardise a variety of tests and metrics to assess
the quality of a fake news classifier, by testing behaviour rather than mere accuracy scores; such
as the ability to truly model veracity, stability (whether it is well conditioned or not), its ability
to generalise over domains, and tests for bias. The ability to truly model veracity or deception
deserves particular attention in future work, since it determines a classifier’s robustness against
adversarial attacks.

5.3. Recommendations
Researchers need to be aware of the flaws in the LIAR dataset.
   Future studies are recommended to treat purely NLP-based, fake news detection as ill-posed,
especially those utilising arbitrary or non-explainable features. Using knowledge-graphs to
store knowledge about the real world, is likely one potential way to regularise the ill-posed
problem.
   Lastly, our models stand as examples of why analysis of a model’s behaviour should be a
better judge of how good the model is, rather than mere accuracy. Going forward, we believe
this to be essential for mitigating the fake news problem effectively.


References
 [1] A. Coleman, ‘Hundreds dead’ because of Covid-19 misinformation, 2020. URL: https:
     //www.bbc.com/news/world-53755067.
 [2] W. Y. Wang, “Liar, Liar Pants on Fire”: A new benchmark dataset for fake news detection, in:
     Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
     (Volume 2: Short Papers), ACL, Vancouver, Canada, 2017, pp. 422–426.


                                                 11
 [3] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of NAACL-HLT 2019, Min-
     neapolis, USA, 2019, pp. 4171–4186.
 [4] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, RoBERTa: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
     (2019).
 [5] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A lite BERT for
     self-supervised learning of language representations, 2020. a r X i v : 1 9 0 9 . 1 1 9 4 2 .
 [6] R. Zafarani, X. Zhou, K. Shu, H. Liu, Fake News Research: Fundamental Theories, Detection
     Strategies & Open Problems, 2019. URL: https://www.fake-news-tutorial.com.
 [7] A. Kirilin, M. Strube, Exploiting a speakers credibility to detect fake news, in: Proceedings
     of Data Science, Journalism & Media workshop at KDD (DSJM18), 2018.
 [8] C. Liu, X. Wu, M. Yu, G. Li, J. Jiang, W. Huang, X. Lu, A two-stage model based on BERT
     for short fake news detection, in: C. Douligeris, D. Karagiannis, D. Apostolou (Eds.),
     Knowledge Science, Engineering and Management, 2019, pp. 172–183.
 [9] P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching word vectors with subword
     information, Transactions of the Association for Computational Linguistics 5 (2017)
     135–146.
[10] J. Hadamard, Sur les problèmes aux derivées partielles et leur signification physique,
     Princeton University Bulletin (1902) 49–52.
[11] P. Yee, S. Haykin, Pattern classification as an ill-posed, inverse problem: a regularization
     approach, in: 1993 IEEE International Conference on Acoustics, Speech, and Signal
     Processing, volume 1, 1993, pp. 597–600 vol.1. doi:1 0 . 1 1 0 9 / I C A S S P . 1 9 9 3 . 3 1 9 1 8 9 .
[12] S. I. Kabanikhin, Definitions and examples of inverse and ill-posed problems, Journal of
     Inverse and Ill-posed Problems 16 (2008) 317–357.
[13] M. Mifsud, “To Trust a LIAR”: Does machine learning really classify fine-grained, fake
     news statements? (Bachelor’s dissertation), 2020. URL: https://www.um.edu.mt/library/
     oar/handle/123456789/76880.
[14] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, C. Potts, Recursive deep
     models for semantic compositionality over a sentiment treebank, in: Proceedings of the
     2013 Conference on Empirical Methods in Natural Language Processing, ACL, Seattle,
     Washington, USA, 2013, pp. 1631–1642.
[15] Y. Long, Q. Lu, R. Xiang, M. Li, C.-R. Huang, Fake news detection through multi-perspective
     speaker profiles, in: Proceedings of the Eighth International Joint Conference on Natural
     Language Processing (Volume 2: Short Papers), Taipei, Taiwan, 2017, pp. 252–256.
[16] H. Karimi, P. Roy, S. Saba-Sadiya, J. Tang, Multi-source multi-class fake news detection,
     in: Proceedings of the 27th International Conference on Computational Linguistics, ACL,
     Santa Fe, New Mexico, USA, 2018, pp. 1546–1557.
[17] T. T. Pham, A study on deep learning for fake news detection, 2018. URL: https://core.ac.
     uk/download/pdf/156904536.pdf.
[18] T. Gröndahl, N. Asokan, Text analysis in adversarial settings: Does deception leave a
     stylistic trace?, 2019. a r X i v : 1 9 0 2 . 0 8 9 3 9 .
[19] J. Y. Khan, M. T. I. Khondaker, A. Iqbal, S. Afroz, A benchmark study on machine learning
     methods for fake news detection, CoRR abs/1905.04749 (2019). a r X i v : 1 9 0 5 . 0 4 7 4 9 .


                                                     12
[20] Z. Zhou, H. Guan, M. Bhat, J. Hsu, Fake news detection via NLP is vulnerable to adver-
     sarial attacks, Proceedings of the 11th International Conference on Agents and Artificial
     Intelligence (ICAART 2019) (2019).
[21] D. B. Buller, J. K. Burgoon, Interpersonal deception theory, Communication Theory 6
     (1996) 203–242.
[22] J. K. Burgoon, D. B. Buller, C. H. White, W. Afifi, A. L. S. Buslig, The role of conversational
     involvement in deceptive interpersonal interactions, Personality and Social Psychology
     Bulletin 25 (1999) 669–686.


                                                13

</pre>