<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Suspicious Sentence Detection and Claim Verification in the COVID-19 Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elitsa Pankovska</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konstantin Schulz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georg Rehm</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DFKI GmbH</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Berlin</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The processing, identification and fact checking of online information has received a lot of attention recently. One of the challenges is that scandalous or “blown up” news tend to become viral, even when coming from unreliable sources. Particularly during a global pandemic, it is crucial to find eficient ways of determining the credibility of information. Fact-checking initiatives such as Snopes, FactCheck.org etc., perform manual claim validation but they are unable to cover all suspicious claims that can be found online - they focus mainly on the ones that have gone viral. Similarly, for the general user it is also impossible to fact-check every single statement on a specific topic. While a lot of research has been carried out in both claim verification and fact-check-worthiness, little work has been done so far on the detection and extraction of dubious claims, combined with fact-checking them using external knowledge bases, especially in the COVID-19 domain. Our approach involves a two-step claim verification procedure consisting of a fake news detection task in the form of binary sequence classification and fact-checking using the Google Fact Check Tools. We primarily work on medium-sized documents in the English language. Our prototype is able to recognize, on a higher level, the nature of fake news, even hidden in a text that seems credible at first glance. This way we can alert the reader that a document contains suspicious statements, even if no already validated similar claims exist. For more popular claims, however, multiple results are found and displayed. We manage to achieve an  1 score of 98.03% and an accuracy of 98.1% in the binary fake news detection task using a fine-tuned DistilBERT model.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Sentence classification</kwd>
        <kwd>Fact checking</kwd>
        <kwd>Language models</kwd>
        <kwd>COVID-19</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With the help of the internet and social media, getting information on practically any topic has
become a rather simple task. However, the sheer amount of online content makes it impossible
to have any control over its quality and reliability. One of the internet’s most significant
advantages, the freedom of speech, has turned into one of its most crucial challenges: anyone can
write anything about any topic without having to take any responsibility for the impact their
content may have on readers. False information exist in various ways, including
misinformation and disinformation. A person sharing misinformation was misinformed themselves (i. e.,
they believe what they are saying is true). In the case of disinformation, however, someone is
spreading false information with the intention of misleading others. Online misinformation,
often compared to online propaganda, has become a major problem. Many people tend to rely
on online sources to get information on current events or topics without keeping in mind that
not everything read online is objective. On top of this development, traditional media often
make use of or republish online content without fact-checking its origin.</p>
      <p>This development has become even more concerning with the COVID-19 pandemic. Due
to the uncertainty that comes with the disease, it has become easier than ever to spread false
claims and mislead large groups of people. While claims made by prominent public figures are
usually validated swiftly, it is impossible to fact-check arbitrary statements on any topic made
online. Manual fact-checking is nowadays performed by more than 300 oficial initiatives 1, but
they still cannot cover everything. For this reason, it is crucial to find a more eficient way of
performing fact-checking or at least to recognize possibly false news hidden in sources that
appear to be trustworthy at first glance. On the other side, it is ineficient to fact-check every
single sentence in an article, especially if it does not contain potentially suspicious claims. Hence,
a two-step approach of estimating whether a sentence needs to be fact-checked, followed by
its validation, can be a step towards automated fact-checking of whole text documents.</p>
      <p>Our goal is the development of a tool that first attempts the identification of dubious
sentences in small- to medium-sized English language documents on the topic of COVID-19,
followed by a second step of fact-checking the claim, i. e., running the claim against external
knowledge bases. We focus on the detection of false news based solely on the text of the
content. False news detection is approached as a binary classification task. Claim validation is
performed using a trusted knowledge base. This subsequent step covers a broader spectrum
by ordering claims on a range from at least 1-5, instead of merely as “true” or “false”. This work
can help users to determine the credibility of online information and it can increase the
productivity of fact-checkers by sparing them the work of validating claims, which were already
fact-checked.</p>
      <p>Section 2 defines key terms and provides an overview of the state of the art. Section 3
describes the methodology and decompose the challenge into individual steps. Section 4 describes
the implementation and Section 5 explains the experimental setup and data sets. Section 6
provides an evaluation of our approach and Section 7 concludes the paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Below we define key terms (Section 2.1), review more general previous research (Section 2.2)
as well as related work concentrating on the COVID-19 domain (Section 2.3).</p>
      <sec id="sec-2-1">
        <title>2.1. Topic-specific Definitions</title>
        <p>
          According to the Cambridge English dictionary, fake news can be defined as “false stories that
appear to be news, spread on the internet or using other media, usually created to influence
po1https://reporterslab.org/fact-checking-count-tops-300-for-the-first-time/
litical views or as a joke”.2 Fake news can be further separated into diferent categories. Wardle
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] defines seven types of mis- and disinformation, the “mildest” case being (harmless) satire
or parody. Further on the scale are false connection, misleading content, false context, and
imposter content. The final two categories are manipulated and fabricated content. Both have
the intention to deceive, with the latter one being completely false. As Wardle states, people
tend to share content on social media without double-checking, so even the more “harmless”
types of fake news spread quickly and have the potential to fool. We consider fake news to be
text sequences of all types mentioned above. A suspicious claim or statement contains possibly
false, or misleading information, or in general, is proved not to be entirely true. Fact checking
the veracity of claims, statements or sentences in a text (document) includes the assessment of
their average veracity using the ratings of existing relevant claims on a scale of 1 to 5.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. State of the Art</title>
        <p>In the following, we review the current state of the art in the area of automated fake news
detection and automated claim verification.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. Fake News Detection</title>
          <p>
            There are three types of features used to detect purely textual fake news: content features,
propagation features, and source features [
            <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
            ]. Linguistic features, both lexical and syntactic,
fall under the umbrella of content features, see, e. g., [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. In contrast, approaches focusing
on the source/user features analyze a source’s reliability, previous behavior, etc.
Propagationbased research relies on the assumption that fake news has a diferent propagation process,
including that fake news spreads faster and reaches more people [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Current approaches [6, 7,
8] work on combining the diferent groups of features, the relations and connections between
them [9] or crowd-sourced information to achieve better results, see, e. g., [10, 11].
          </p>
          <p>
            Collins et al. [12] give an overview of the type of fake news and the diferent detection
techniques (human fact-checkers, crowd-sourcing, ML, NLP, combinations thereof). However,
they mostly cover outdated approaches, i. e., Gradient Boosted Decision Trees and Naive Bayes,
and do not go in-depth with state-of-the-art ML algorithms and models. Vishwakarma and Jain
[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], Li et al. [13], Gundapu and Mamidi [14] show that pre-trained language models (such as
BERT) achieve better performance in fake news detection than traditional NLP algorithms, e. g.,
Support Vector Machines, Random Forests, or Naive Bayes. They can detect stylistic diferences
much better due to their learned understanding of language, achieved during pre-training [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ].
Fact-checking initiatives, such as Fullfact, also strive towards automating their workflows by
identifying fact-check-worthy claims using their own fine-tuned BERT model. 3
          </p>
          <p>For fake news detection, most state-of-the-art approaches focus on the article level and use
groups of features. For full articles, the author feature is important but we do not have access
to it in our experiments. Moreover, our goal is to detect fake news on a sentence level by
just using content features. A major challenge for fake news detection on a sentence level is
that widely accepted benchmark datasets do not exist. Existing datasets, such as CREDBANK
2https://dictionary.cambridge.org/dictionary/english/fake-news
3https://fullfact.org/about/automated/
[15], LIAR (created from fact-checks from Politifact) [16] and FakeNewsNet [17], although
suitable for general fake news detection, are outdated and do not include data from the past year
and consequently, do not prove useful for our use case. Moreover, they traditionally either
include full articles or social media posts (most often tweets) and rely on crowd-sourcing. The
tool that is most similar to our approach (it identifies claims, skips over irrelevant statements,
and fact-checks individual claims), is ClaimBuster.4 It relies on human data-labeling
contributions and has been trained using a large dataset of sentences from political debates in the
US from 1960-2016 [18]. They experiment with diferent “claim spotting” models, including
SVM, bi-directional LSTMs, and Transformers but “claim spotting” and “claim checking” are
two separate tools; plus, “claim spotting” is not yet optimized for COVID-19 related texts.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Automated Claim Verification</title>
          <p>Verifying claims is a time-consuming process, which relies on manual research and
information assessment. Previous work in this area is rather limited. The claim verification process
cannot be automated fully when it comes to more than a simple information lookup. Still,
knowledge bases, through which fact-checkers can at least find already validated claims, are
a big step in the right direction. Chernyavskiy et al. [19] propose a system, which finds
relevant Wikipedia articles, extracts relevant sentences, and determines if they support or refute a
given claim. Other approaches use ClaimReview markup, which allows for easy tagging of
factchecks. It is used by all fact-checking initiatives, making it possible to assemble a collaborative
knowledge base, accessible through the Google Fact Check Tools API (see Section 3.2). Other
approaches include evidence retrieval using existing datasets, which can, similarly to fake news
detection, be done using Transformer models [20]. Thorne et al. [21] use TF-IDF features to
ifnd the k-nearest documents and select sentences using TF-IDF similarity. Such
implementations, however, cannot be a long-term solution since datasets need to be updated and models
trained constantly. A considerable contribution towards COVID-19 knowledge retrieval are
knowledge graphs (KGs), which use, e. g., the CORD-19 dataset. Wise et al. [22] present a KG
with diferent relations between scientific articles on COVID-19. Domingo-Fernández et al.
[23] provide an even more complex COVID-19 KG covering ten entity and 9484 relation types.
Similar KGs can facilitate verified information retrieval for claim validation.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Approaches in the COVID-19 Domain</title>
        <p>In terms of applying such approaches to the COVID-19 domain, there are already a number
of published results. Manual claim validation is constantly being performed by fact-checking
initiatives, e. g., Snopes, Politifact, etc. Below, however, we only discuss automatic approaches.</p>
        <p>Wadden et al. [24] concentrate on scientific claims, they use CORD-19 as a source of evidence.
Given a claim and CORD-19 abstracts, they identify abstracts that support or refute the claim.
Instead of scientific articles, we focus on more general claims including, say, statements from
politicians or social media captions, as long as they fall under the umbrella of COVID-19.</p>
        <p>Vijjali et al. [25] propose a model, which takes a claim and explanations and determines
the probability of the claim being in entailment with the explanations. A first model fetches
relevant explanations and is trained on sentence entailment. A second model identifies the
veracity of a claim. Both models are trained on sequence classification and evaluated with
Transformer architectures [26, 27, 28]. The approach shows the dominance of Transformer
models, but the final model is not applicable in real-life scenarios because it cannot distinguish
fake news from genuine news but merely considers explanations.</p>
        <p>Barrón-Cedeño et al. [29] give an overview of the third edition of the CheckThat! lab, which
was hold under the umbrella of CLEF 2020. The first task focuses on the importance of
estimating the fact-check-worthiness of a claim. The dataset consists of COVID-19-related tweets,
annotated manually into two classes (fact-check-worthy or not), based on the answers of two
questions [29]. The best solutions use Transformer models (RoBERTa, BERT).</p>
        <p>Das et al. [30] research the impact of using an ensemble approach rather than a single model
for detecting COVID-19 related fake news. The results show that an ensemble of Transformer
models, combined with soft voting (averaging out the prediction probabilities of multiple
models for each class), delivers state-of-the-art results.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The brief overview of the architecture we developed includes descriptions of the process pipeline
and the tools and models used (Section 3.1). Section 3.2 presents the claim validation and the
knowledge bases we used.</p>
      <sec id="sec-3-1">
        <title>3.1. Process Pipeline</title>
        <p>We process text in multiple steps. First, it is parsed into sentences, where each one is
classiifed into one of two categories: a sentence, which contains possibly dubious claims related to
COVID-19, or a regular one. For all regular sentences, the process ends here. For the remaining
ones, we perform two additional steps: claim extraction and claim validation. The first one is
done with an existing tool, while the latter is done by searching for similar claims in a trusted
knowledge base. Finally, all results returned are analyzed and classified into six categories.</p>
        <p>The first step of our fact-checking task is also our main focus, i. e., training a sentence
classification model, which can detect sentences containing potentially suspicious claims or fake
news. We experimented with a number of techniques including diferent types of
Transformerbased language models [31], which exhibit state-of-the-art performance in multiple NLP tasks
including BERT [26], RoBERTa [32], DistilBERT [33] and SciBERT [34], which was trained on
a random sample of 1.14 million papers from Semantic Scholar. SciBERT outperforms BERT
on tasks in the scientific domain, which our topic also partially overlaps with. Wadden et al.
[24] report SciBERT to be among the best performing models when it comes to experiments
using claims concerning COVID-19.</p>
        <p>The subsequent step of sentence parsing and claim extraction is performed using spaCy,
which supports tokenization, part-of-speech-tagging, etc.5 We opt for a rather simple
approach when it comes to claim extraction. Various processors can be added to the default
spaCy pipeline, including a sentencizer to split text into sentences, which is the first step in
our process pipeline. Punctuation removal can be performed using POS tagging. Stopword
removal is supported out of the box through its set of stopwords, which can also be manually
modified. In addition to spaCy, we also conducted experiments with Stanza [ 35].</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Claim Verification</title>
        <p>Automatic claim verification can be done in various ways, but mainly by searching for similar
claims, which are already validated. This search can be done by using external knowledge
graphs or knowledge bases or by using a dataset of validated claims. This means that the claim
verification always relies on third-party data or services. We choose the Google Fact Check
Tools for this step due to their reliability and up-to-date information.</p>
        <p>ClaimReview is a type of structured markup, based on Schema.org, that fact-checkers can
use to enrich their articles for search engines and social media platforms.6 Each ClaimReview
object includes multiple properties. Of interest to us is the reviewRating property, which is of
type Rating7 with a ratingValue that can be textual or numerical. The convention, however, is a
numerical value between 1 and the number stated as “bestRating”. In the case of ClaimReview,
a value of 1 is equivalent to “False”, while 5 means “True”.</p>
        <p>Google’s Fact Check Tools have three main functions. The Google Fact Check Explorer is like
the traditional Google Search, with the diference that a user searches for fact checks (validated
claims). Results have a claimant, text of the claim and data about the review, including a textual
rating. Second, users can add ClaimReview Markup to their site using the Markup Tool in order
for their reviews to pop up as results in, e. g., the Fact Check Explorer. Of interest to us is the
Claim Search API, which developers can use via HTTP GET requests to query the same set of
Fact Check results available via the Fact Check Explorer.8 The request body includes the query
text (i. e., the claim), language code, pageSize and maxAgeDays. By limiting the maximum
age of a claim review, we somewhat prevent fetching already outdated results. This is crucial
especially in the COVID-19 domain, where the scientific community has repeatedly changed
its conclusions about the efectiveness of certain protective measures, the efects/difusion of
COVID-19, etc.</p>
        <p>The response body includes an array of fact-checked claims, where each one includes the
claim text, the claimant, a claim date and the claim review with a textual rating. One crucial
disadvantage of this JSON structure concerns the fact that there is only a textual rating provided
rather than a numerical one, i. e., the textual ratings from multiple fact-checking sources cannot
be compared, which calls for a meaningful mapping. With a numerical rating, we can easily
compute an average from all relevant results, in order to have the final rating of our initial
claim.</p>
        <p>6https://schema.org/ClaimReview
7https://schema.org/Rating
8https://toolbox.google.com/factcheck/apis</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Implementation</title>
      <p>Below we present the architecture of our implementation and describe the individual steps.9
Although our use-case addresses the COVID-19 domain, we believe that the following
implementation is also applicable to other areas, such as political disinformation, climate change
disinformation, etc.</p>
      <sec id="sec-4-1">
        <title>4.1. Overall Architecture</title>
        <p>The system takes a text as input and performs various processing steps. First, it is parsed
into sentences using the spaCy library. Second, each sentence is classified as either suspicious
or regular using a pre-trained Transformer model, fine-tuned using our dataset. Third, for
all dubious sentences we perform claim extraction by simplifying the sentence using simple
preprocessing, again using spaCy. Fourth, the simplified sentence is used as input for the
Google Claim Search API. Fifth, from its results, textual ratings are mapped to numerical ones,
and an average is taken to determine the veracity of the claim. The process is visualized using
the Streamlit library and a custom Vue frontend component that presents the text to the user,
with all suspicious sentences highlighted in colours corresponding to their veracity.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Fake News Detection</title>
        <p>For all Transformer models (BERT, DistilBERT, SciBERT, RoBERTa) we use a standard
HuggingFace implementation,10 also making use of Auto Classes.11 We tokenize sentences, with
padding and truncation enabled and a maximum sequence length of 512, with the main training
arguments being two epochs, train batch size 8, eval batch size 64, 2000 steps.</p>
        <p>Additionally, we calculate the following metrics in each evaluation step: accuracy,
crossentropy-loss, precision, and recall, as defined in the sklearn.metrics module. 12 Finally, we use
the Trainer class, which takes our training arguments, the train and validation datasets, the
model initializing function, and our compute_metrics function. Once initialized, we run the
train function to complete the fine-tuning procedure. The exact parameters, training time,
evaluation results, etc. for all fine-tuning procedures are recorded in our W&amp;B projects.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Claim Extraction</title>
        <p>For the claim extraction step we simplify the initial sentence. We decided against extracting
claims on the discourse level and focus upon individual sentences. First, each suspicious
sentence is tokenized and all tokens of type “punctuation” are removed. Second, we remove all
stop words from the sentence using spaCy’s default set of stopwords. This results in an array
of keywords (i. e., the claim), which we use to query Google’s Fact Check Tool.
9https://github.com/elip06/covid19-fact-checking provides the full code of our implementation.
10https://huggingface.co/transformers/custom_datasets.html
11https://huggingface.co/transformers/model_doc/auto.html
12https://scikit-learn.org/stable/modules/model_evaluation.html</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Claim Verification</title>
        <p>We limit the results from the Google Fact Check Tool API to the English language only. An
example query may look like this:
To determine the veracity of a claim, we need to map the heterogeneous textual ratings of a
claim to numerical ones since the API returns query results from diferent sources13 that use
their own way of expressing or encoding the result of the verification step. We map as follows:
• 1: “false”, “four pinocchios”, “inaccurate”, “miscaptioned”, “misattributed”, “scam”
• 2: “mostly false”, “three pinocchios”, “misleading”
• 3: “mixture”, “two pinocchios”, “biased”, “cherry-picking”, “not the whole story”, “exaggerates”
• 4: “mostly true”, “half true”, “one pinocchio”
• 5: “true”, “accurate”, “unbiased”, “correct”</p>
        <p>With this approach, we combine the ratings of all relevant claims to calculate an average
value and determine the veracity of the original claim, which is why we put some terms that
are not exactly equivalent under the same umbrella (e. g., “true” and “unbiased”). It must be
noted, though, that there is no standard or convention with regard to this mapping approach.</p>
        <p>We use the default maximum number of ten results for each search query since the further
one goes through the list of results, the more irrelevant and outdated they become (similarly
to the standard Google Search). Additionally, the results for many of the claims are repetitive
since the most prominent fact-checking initiatives fact-check the same claims.</p>
        <p>We realise that this manner of computing numerical ratings is not perfectly accurate. The
required fields for Google’s Fact Check products are only the fact checked article, the claim
reviewed, and a rating text. While there is the option to also add a numerical rating, also
indicating the best and worst possible ratings, they are present neither in the Fact Check Explorer
nor in the Fact Check Tools API.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Deployment and Visualization</title>
        <p>We implement an additional UI, which has two main functions: firstly, we make use of it for the
qualitative evaluation of both steps of the system, by feeding text documents and looking at the
ifnal result. Moreover, the tool can be used by fact-checkers with no technical knowledge to
easily check the credibility of texts and find suspicious claims and whether they have already
been verified. For the UI prototype of our approach we use the Streamlit 14 framework so that
we can develop the application in Python, also making use of Vue and Vuetify.15</p>
        <p>The user can either upload a Word or plain text file or type text into a text field. The text
is parsed into sentences, which are then classified. Once the claim extraction and queries are
completed, the UI receives the following data as input: an array of all sentences and an array of
13Including Snopes, Washington Post, Health Feedback etc.
14https://streamlit.io
15https://vuejs.org, https://vuetifyjs.com
labels/claims. The UI shows the text and an explaination of what the diferent highlight colours
mean. When the user hovers over a suspicious sentence, for which relevant claims exist, those
claims and their ratings are displayed (Figure 1).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experimental Setup</title>
      <sec id="sec-5-1">
        <title>5.1. Datasets</title>
        <p>Many diferent datasets and publications exist for the wider COVID-19 domain. The primary
purpose of the data is to fine-tune the pre-trained language models for the sequence
classification but a dataset of whole documents in which individual sentences are labeled as “suspicious”
does not exist, which is why we combine diferent datasets to achieve this structure. Initially,
we used suspicious sentences taken from the archives of fact-checking websites, such as Snopes
and Politifact, and regular ones from scholarly articles on COVID-19. While this dataset
performed well during cross-validation, it showed weaknesses in real-life scenarios by labeling
a significant amount of regular sentences as suspicious. Consequently, we use an additional
dataset to increase the model’s accuracy. A brief description of all used datasets follows. The
ifrst three are part of the initial dataset, on which multiple Transformer models are fine-tuned
and compared. The last one is added later and used for fine-tuning all models one more time.
1. COVID-19 Open Research Dataset (CORD-19) consists of more than 500,000 scholarly
articles, including over 200,000 with full text, about COVID-19 and related coronaviruses
[36]. We use only a small portion, i. e., the abstracts of the most recent articles at the
time of data collection (the first week of January 2021). We use CORD-19 as the source
for non-suspicious sentences.
2. FakeCovid: A Multilingual Cross-domain Fact Check News Dataset for COVID-19 consists
of articles in 40 languages but the majority have titles in English [37]. These articles
have been fact-checked by reliable online sources. We use FakeCovid as the source for
suspicious sentences in the initial dataset.
3. CoAID (Covid-19 heAlthcare mIsinformation Dataset) includes fake news on websites and
social media, along with users’ social engagement about such news [38]. We only use
part of it since many of the claims are also present in FakeCovid.
4. COVID19 Fake News Detection in English consists of real and fake news on COVID-19
[30]. We improve our models using this dataset as it introduces a diferent type of
regular sentences. Unlike the ones from scientific articles, the non-suspicious sentences are
merely real news on the topic of COVID-19, which do not necessarily have the same
structure or use scientific terms as the other ones do but are still just as trustworthy, i. e.,
the models can learn the specifics of fake news, rather than the diferences between a
scientific and non-scientific text.</p>
        <p>All datasets are preprocessed in accordance with their specific structure and parsed into
sentences using Stanza. We also performed additional cleaning (all sentences which either
contain no verb or consist of less than four words are removed). The two sets of genuine and
dubious sentences contain 7908 samples each. We merge together the two sets and split them
into train (80%), and test sets (20%) using Scikit-learn’s train_test_split function. The fourth
dataset was added due to the unsatisfactory initial results, arriving at a final train dataset of
9426 (49.4%) suspicious samples and 9646 (50.6%) non-suspicious ones as well as a final test
dataset of 2562 (48.3%) suspicious samples and 2742 (51.7%) non-suspicious ones.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Hyperparameter Search</title>
        <p>The choice of hyperparameters can have a major influence on the machine learning models’
performance [39]. For this reason, it is crucial to use a scalable hyperparameter optimization
framework. Our choice is Ray Tune [40], due to its possibility of early termination of bad runs
and intelligent choice of parameters within the defined space. Our search for hyperparameters
is documented in a Weights &amp; Biases project.16 Experiments are conducted using the following
hyperparameters:
• Learning Rate: between 1e-5 and 1e-3
• Number of Training Epochs: between 1 and 4
• Seed: between 1 and 42
• Per Device Train Batch Size: 8 or 16
• Warmup Steps: between 0 and 1000
• Weight Decay: between 1e-6 and 0.1</p>
        <p>Most of the hyperparameters are an obvious choice, except for the random seed. However,
Dodge et al. [41] found that by checking many diferent random seeds they obtained significant
16https://wandb.ai/elip06/distilbert-hyperparameter-tuning
improvements on the performance of BERT and some seeds are statistically more likely to
diverge. An explanation for this influence is that the random seeds directly determine the
values for weights when the neural network is initialized. Some of the seeds may lead to a
more beneficial starting point for the following learning phase, compared to others that put
the network very close to a local minimum.</p>
        <p>In total, 20 random combinations are trained and evaluated at the end of each epoch using
the DistilBERT for Sequence Classification architecture. The best model is picked taking into
account the accuracy, precision, recall,  1 and cross-entropy loss.</p>
        <p>Figure 2 shows the combination of hyperparameters and achieved accuracy in each run. The
models with the worst performance are the purple ones with an accuracy of slightly over 50%.
The one with the second biggest learning rate has an accuracy of 86.5%, while all others have
similar results: between 95% and 98%.
• Highest accuracy,  1 score and recall
• Second smallest cross-entropy loss
• Sixth highest precision
Considering the above metrics we choose our final model with the following rankings:
The importance of each of the hyperparameters with respect to the accuracy can be seen in
Figure 3, calculated by training a random forest classifier with the hyperparameters as inputs
and the metric (in our case, accuracy and cross-entropy loss) as the target output and reporting
the feature importance values for the random forest classifier. A negative correlation is colored
in red, while a positive one is colored in green.</p>
        <p>The learning rate has the biggest influence on the accuracy, and they have a negative
correlation. The random seed has a decent level of importance, though not as high as the learning
rate. The correlation between learning rate and loss is even more pronounced than that
between learning rate and accuracy, and naturally, in the opposite direction. The only metric
with an opposite correlation is the weight decay, but its importance with regard to both
accuracy and loss is very low. All other metrics can be neglected.</p>
        <p>An interesting observation is that this best model does not ofer a big increase in accuracy
compared to the out-of-the-box model, fine-tuned using the default parameters of the
Train(a) Accuracy
(b) Cross-entropy loss
ingArguments class from Huggingface.17 Despite that, it is clear that especially the learning
rate plays a significant role in the model’s performance.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation</title>
      <sec id="sec-6-1">
        <title>6.1. Sentence Classification</title>
        <p>We perform hyperparameter optimisation using DistilBERT (see Section 4). Below, we present
the results from all models and explain why we chose DistilBERT for our further experiments.</p>
        <sec id="sec-6-1-1">
          <title>6.1.1. Model Comparison</title>
          <p>We compare four Transformer models, which are fine-tuned using the same parameters, using
the HuggingFace TrainingArguments defaults. We use a batch size of 8 per device for training
and 64 for evaluation, which is performed at the end of each epoch. The models are fine-tuned
for two epochs. We take into account the metrics already mentioned in Section 5, i. e., precision,
recall,  1 score, accuracy and cross-entropy-loss.</p>
          <p>Although RoBERTa has the highest performance in the GLUE benchmark, BERT achieves
the highest accuracy and  1 score and the lowest cross-entropy-loss in our case, see Table 1.</p>
          <p>Although the diferences are minimal, SciBERT (our domain-specific model) and RoBERTa,
which were possible favorites, deliver slightly worse results than DistilBERT and BERT.</p>
          <p>17https://huggingface.co/transformers/main_classes/trainer.html</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>6.1.2. Cross-validation</title>
          <p>We perform 10-fold cross-validation on the two best performing models (BERT, DistilBERT).
Here, we only use our train dataset. In each of the ten iterations, 30% of the data is chosen
randomly for testing only. Although BERT achieves higher accuracy in the simple comparison,
the results during cross-validation are much closer, see Table 2.</p>
          <p>Although BERT still has higher accuracy, this time the diference is much smaller (0.02%),
and DistilBERT has a significantly lower cross-entropy loss and higher precision. Moreover,
DistilBERT is fine-tuned at least twice as fast as BERT. For these reasons, we choose to
further experiment with diferent hyperparameters for fine-tuning DistilBERT and pick the best
configuration as the final model for the prototype UI. We manage to increase the accuracy of
DistilBERT to 98.1%, the  1 score to 98.03%, and minimize cross-entropy-loss to 0.08619. Such
a high performance is possible, because the presented dataset simplifies the problem
considerably: the suspicious sentences are mostly article headlines or individual statements from
public figures. Consequently, they difer more from normal sentences and non-scandalous
article headlines than more carefully hidden fake news would.</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>6.1.3. Qualitative Evaluation</title>
          <p>As a first step of the qualitative evaluation, we compare our implementation to a similar tool –
ClaimBuster, already mentioned in Section 2. In contrast to our approach, users have to set a
threshold themselves between 0 and 1 to determine at which point claims are considered
unreliable. We notice that many of the suspicious sentences are still not marked as such even with
a rather low threshold of 0.3. Also, checking a claim is done via a separate search bar, which
does not allow for fact-checking multiple claims at once. In comparison, our implementation
not only performs these two steps automatically and for a whole text, but also with a higher
accuracy. Moreover, it starts with binary classification, which is then extended to five classes
after the claim validation step.</p>
          <p>Since the evaluation dataset is not the perfect match for our use case, i. e., the fake news are
often headlines, photo captions, etc., rather than sentences found in a full text, we perform a
qualitative evaluation using a small number of articles. We use the “Global Social Responses
to Covid-19 Web Archive”18 and a couple of articles from reliable sources. The web archive
includes social responses to the Covid-19 pandemic, as represented in local news, NGOs, art, blogs
and social media. We focus mainly on articles from local news in English. The archive does
not contain North American or West European sources, where the majority of fact-checkers
are based. For this reason, although some sentences from local news are labeled as suspicious,
no relevant claims can be found for them. This uncovers the problem that suspicious
statements from regional news are often not fact-checked at all, or at least not annotated using
ClaimReview, and as a result, cannot be found using Google’s tool.</p>
          <p>Another observation is that hardly any sentences in articles from trusted sources, e. g., in
the article “Coronavirus outbreak and kids” from Harvard Health Publishing19, are labeled as
suspicious. This is the expected behavior of the model.</p>
          <p>With such complex model architectures, however, it is hard to determine which specifics of
the data prove to have the biggest influence on the final predictions. We look at a couple of
the misclassified examples (Table 3). A human annotator would possibly also classify the real
example as dubious. However, the false negative is probably caused by the scientific terms
used. This is the case for most of the wrongly classified examples, which stresses the point
that a priority should be avoiding false negatives.
There is no one in New Zealand receiving hospital-level care
for COVID-19.</p>
          <p>Even discharged patients could be a long-term asymptomatic
carriers.
regular</p>
          <p>suspicious
suspicious
regular</p>
          <p>In the hopes of discovering specific terms which can distinguish fake news from real ones,
we calculate the word frequency in our validation dataset. The 20 most common words in each
group of sentences can be seen in Figure 4.</p>
          <p>Sentences from real news or scientific articles tend to prefer the term “COVID-19” instead
of “coronavirus”. Due to the nature of the suspicious sentences, typical words are “video”,
“lockdown”, “facebook”, etc. Also, we notice that verbs, such as “shows”, “says” and “claim”,
are only present in the suspicious sentences, whereas the verbs among the top 20 of the regular
19https://www.health.harvard.edu/diseases-and-conditions/coronavirus-outbreak-and-kids
sentences are “states” and “reported”. Altogether, the language in the non-suspicious sentences
seems to be more factual and scientific, with more formal terms.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Claim Extraction</title>
        <p>With regard to the evaluation of the claim extraction procedure we cannot measure the success
of claim validation in a mathematical sense since the Google Claim Search API does not expect
the claim to be structured data. While we can limit our results by choosing a language, a
factchecking website, etc., the claim itself is simply a text string. For this reason, we briefly measure
the success of claim extraction/validation by the number of fetched relevant results. We search
for claims relevant to each sentence in the validation dataset using Google’s Fact Check Tools:
once by using the full20 sentence and once with the simplified sentence (without punctuation
and stopwords). Surprisingly, there are slightly more cases in which no relevant results are
found when removing the stopwords but choosing the more suitable approach depends on the
type of sentence: in the case of, e. g., statements from or about known figures, full sentences
deliver the same, if not better, results. However, when it comes to more general claims or more
complex sentences, only keeping the keywords works better. Since the Google Claim Search
API seems to work well with exact terms/structures, we decided not to perform additional
processing, such as lemmatization or stemming.</p>
        <p>Our approach, however, is not optimal for the following reason: we test using individual
sentences, which are usually either a news title or a standalone statement, so they are not part
of a text. This means that we do not have to deal with the problem of context, pronouns, etc.
In these cases, removing them proves useful for finding more relevant results. Nevertheless,
we are aware that future work should include a more sophisticated way of extracting claims.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and Future Work</title>
      <p>In terms of tackling the misinformation challenge, the approach we propose includes two
separate steps: the identification of suspicious sentences (i. e., fact-check-worthiness estimation)
followed by claim validation using trusted knowledge bases, both components can also be used
separately. In comparison, existing approaches usually only implement one or the other, the
middle ground between the two, i. e., claim extraction, is usually missing. Moreover, in this
combined way, we can allow, at least to a certain degree, for some error in the first step, which
can be corrected in the second, and we also introduce some nuance since each claim is
categorised into one of five separate classes.</p>
      <p>Our initial experiments, using datasets from diferent origin, proved not to adapt well in
reallife cases. Using scientific articles as a source of non-suspicious sentences caused the problem
that a large amount of non-scientific sentences were falsely labeled as suspicious and therefore
unnecessarily fact-checked. One solution was bringing more variety to this part of the dataset
by using real news. With this new dataset, curated using multiple sources of suspicious and
non-suspicious sentences, much better results could be achieved.</p>
      <p>20Here, we still remove punctuation since it often causes invalid queries.</p>
      <p>As for claim extraction, a simple approach proved to be suficient to find similar claims
using the Google Fact Check Tools API. A problematic aspect was the deviation of the format of
Google’s claim search results from the standard ClaimReview markup. While a numerical
rating is defined, only a textual one is present in what the tool provides. Moreover, fact-checkers
are not required to stick to a certain format of their textual ratings, as even full sentences are
allowed. As a result, we came up with a mapping from textual to numerical ratings.</p>
      <p>All in all, our tool can help journalists and other people that rely on extensive fact-checking.
It is fully automatic, easily accessible using an intuitive graphical interface and comes with a
fast AI-based fake news detector. Thus, suspicious content can be detected in real time and
validated against high-quality external databases. This procedure leads to a more productive
fact-checking workflow, both in terms of speed and accuracy.</p>
      <p>There are multiple directions in which our work can be taken further. They include more
research on the performance of the models on full articles, Twitter threads, etc., extending
the current datasets by adding data from full texts (instead of only titles and headlines),
testing out diferent sequence classification models (including ensembles), and claim extraction
approaches. Additionally, we plan to carry out a user study with journalists to assess if the
developed prototype is helpful and if it makes them more eficient.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The research presented in this article was partially funded by the German Federal Ministry
of Education and Research (BMBF) through the projects QURATOR (https://qurator.ai; grant
no. 03WKDA1A) and PANQURA (http://qurator.ai/panqura; grant no. 03COV03E).
[6] G. Bhatt, A. Sharma, S. Sharma, A. Nagpal, B. Raman, A. Mittal, Combining neural,
statistical and external features for fake news stance identification, in: Companion
Proceedings of the The Web Conference 2018, WWW ’18, International World Wide Web
Conferences Steering Committee, Republic and Canton of Geneva, CHE, 2018, p. 1353–1357.</p>
      <p>URL: https://doi.org/10.1145/3184558.3191577. doi:10.1145/3184558.3191577.
[7] V.-H. Nguyen, K. Sugiyama, P. Nakov, M.-Y. Kan, Fang, Proceedings of the 29th ACM
International Conference on Information and Knowledge Management (2020). URL: http:
//dx.doi.org/10.1145/3340531.3412046. doi:10.1145/3340531.3412046.
[8] K. Schulz, J. Rauenbusch, J. Fillies, L. Rutenburg, D. Karvelas, G. Rehm, User Experience
Design for Automatic Credibility Assessment of News Content About COVID-19, in: HCI
International 2022 – Late Breaking Papers, Springer, Virtual, 2022. Forthcoming.
[9] A. Srivastava, G. Rehm, J. M. Schneider, DFKI-DKT at SemEval-2017 Task 8: Rumour
Detection and Classification Using Cascading Heuristics, in: Proceedings of the 11th
International Workshop on Semantic Evaluation (SemEval-2017), Association for
Computational Linguistics, Vancouver, Canada, 2017, pp. 486–490. URL: https://www.aclweb.
org/anthology/S17-2085.pdf.
[10] G. Rehm, An Infrastructure for Empowering Internet Users to handle Fake News and
other Online Media Phenomena, in: G. Rehm, T. Declerck (Eds.), Language Technologies
for the Challenges of the Digital Age: 27th International Conference, GSCL 2017, Berlin,
Germany, September 13-14, 2017, Proceedings, number 10713 in Lecture Notes in
Artificial Intelligence (LNAI), Gesellschaft für Sprachtechnologie und Computerlinguistik e.V.,
Springer, Cham, Switzerland, 2018, pp. 216–231. URL: https://link.springer.com/content/
pdf/10.1007%2F978-3-319-73706-5_19.pdf, 13/14 September 2017.
[11] G. Rehm, J. M. Schneider, P. Bourgonje, Automatic and Manual Web Annotations in an
Infrastructure to handle Fake News and other Online Media Phenomena, in: N. Calzolari,
K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J.
Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, T. Tokunaga (Eds.), Proceedings of the
11th Language Resources and Evaluation Conference (LREC 2018), European Language
Resources Association (ELRA), Miyazaki, Japan, 2018, pp. 2416–2422.
[12] B. Collins, D. T. Hoang, N. T. Nguyen, D. Hwang, Fake news types and detection models on
social media a state-of-the-art survey, in: P. Sitek, M. Pietranik, M. Krótkiewicz, C. Srinilta
(Eds.), Intelligent Information and Database Systems, Springer Singapore, Singapore, 2020,
pp. 562–573.
[13] X. Li, Y. Xia, X. Long, Z. Li, S. Li, Exploring text-transformers in aaai 2021 shared task:
Covid-19 fake news detection in english, in: T. Chakraborty, K. Shu, H. R. Bernard, H. Liu,
M. S. Akhtar (Eds.), Combating Online Hostile Posts in Regional Languages during
Emergency Situation, Springer International Publishing, Cham, 2021, pp. 106–115.
[14] S. Gundapu, R. Mamidi, Transformer based automatic covid-19 fake news detection
system, 2021. arXiv:2101.00180.
[15] T. Mitra, E. Gilbert, Credbank: A large-scale social media corpus with associated
credibility annotations, in: Ninth International AAAI Conference on Web and Social Media, The
AAAI Press, Palo Alto, California, 2015, pp. 258–267.
[16] W. Y. Wang, “liar, liar pants on fire”: A new benchmark dataset for fake news
detection, in: Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), Association for Computational
Linguistics, Vancouver, Canada, 2017, pp. 422–426. URL: https://aclanthology.org/P17-2067.
doi:10.18653/v1/P17- 2067.
[17] K. Shu, D. Mahudeswaran, S. Wang, D. Lee, H. Liu, Fakenewsnet: A data repository with
news content, social context, and spatiotemporal information for studying fake news
on social media, Big Data 8 (2020) 171–188. URL: https://doi.org/10.1089/big.2020.0062.
doi:10.1089/big.2020.0062. arXiv:https://doi.org/10.1089/big.2020.0062,
pMID: 32491943.
[18] F. Arslan, N. Hassan, C. Li, M. Tremayne, A benchmark dataset of check-worthy factual
claims, in: Proceedings of the International AAAI Conference on Web and Social Media,
volume 14, The AAAI Press, Online, 2020, pp. 821–829.
[19] A. Chernyavskiy, D. Ilvovsky, P. Nakov, Whatthewikifact: Fact-checking claims against
wikipedia, in: CIKM ’21: Proceedings of the 30th ACM International Conference on
Information &amp; Knowledge Management, Association for Computing Machinery, New York,
NY, USA, 2021.
[20] A. Soleimani, C. Monz, M. Worring, Bert for evidence retrieval and claim verification,
in: J. M. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, F. Martins (Eds.),
Advances in Information Retrieval, Springer International Publishing, Cham, 2020, pp.
359–366.
[21] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, FEVER: a large-scale dataset for
fact extraction and VERification, in: Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New
Orleans, Louisiana, 2018, pp. 809–819. URL: https://aclanthology.org/N18-1074. doi:10.
18653/v1/N18- 1074.
[22] C. Wise, V. N. Ioannidis, M. R. Calvo, X. Song, G. Price, N. Kulkarni, R. Brand, P. Bhatia,
G. Karypis, Covid-19 knowledge graph: Accelerating information retrieval and discovery
for scientific literature, 2020. arXiv:2007.12731.
[23] D. Domingo-Fernández, S. Baksi, B. Schultz, Y. Gadiya, R. Karki, T. Raschka, C.
Ebeling, M. Hofmann-Apitius, A. T. Kodamullil, COVID-19 Knowledge Graph: a
computable, multi-modal, cause-and-efect knowledge model of COVID-19 pathophysiology,
Bioinformatics 37 (2020) 1332–1334. URL: https://doi.org/10.1093/bioinformatics/btaa834.
doi:10.1093/bioinformatics/btaa834.
[24] D. Wadden, S. Lin, K. Lo, L. L. Wang, M. van Zuylen, A. Cohan, H. Hajishirzi, Fact or
ifction: Verifying scientific claims, in: Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing (EMNLP), Association for Computational
Linguistics, Online, 2020, pp. 7534–7550. URL: https://aclanthology.org/2020.emnlp-main.
609. doi:10.18653/v1/2020.emnlp- main.609.
[25] R. Vijjali, P. Potluri, S. Kumar, S. Teki, Two stage transformer model for covid-19 fake
news detection and fact checking, in: G. Da San Martino, C. Brew, G. L. Ciampaglia,
A. Feldman, C. Leberknight, P. Nakov (Eds.), Proceedings of the 3rd NLP4IF Workshop on
NLP for Internet Freedom: Censorship, Disinformation, and Propaganda, International
Committee on Computational Linguistics (ICCL), Barcelona, Spain (Online), 2020, pp. 1–
10.
[26] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
transformers for language understanding, in: Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/
N19-1423. doi:10.18653/v1/N19- 1423.
[27] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A lite BERT
for self-supervised learning of language representations, in: 8th International
Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020,
OpenReview.net, Addis Ababa, Ethiopia, 2020. URL: https://openreview.net/forum?id=
H1eA7AEtvS.
[28] Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, D. Zhou, MobileBERT: a compact
taskagnostic BERT for resource-limited devices, in: Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, Association for Computational
Linguistics, Online, 2020, pp. 2158–2170. URL: https://aclanthology.org/2020.acl-main.195.
doi:10.18653/v1/2020.acl- main.195.
[29] A. Barrón-Cedeño, T. Elsayed, P. Nakov, G. Da San Martino, M. Hasanain, R. Suwaileh,
F. Haouari, N. Babulkov, B. Hamdan, A. Nikolov, S. Shaar, Z. S. Ali, Overview of checkthat!
2020: Automatic identification and verification of claims in social media, in: A.
Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma, C. Eickhof, A. Névéol,
L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and
Interaction, Springer International Publishing, Cham, 2020, pp. 215–236.
[30] S. D. Das, A. Basak, S. Dutta, A heuristic-driven ensemble framework for covid-19 fake
news detection, in: T. Chakraborty, K. Shu, H. R. Bernard, H. Liu, M. S. Akhtar (Eds.),
Combating Online Hostile Posts in Regional Languages during Emergency Situation, Springer
International Publishing, Cham, 2021, pp. 164–176.
[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser,
I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S.
Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in
neural information processing systems, volume 30, Curran Associates, Inc., Red Hook,
NY, USA, 2017, pp. 5998–6008. URL: https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
[32] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V.
Stoyanov, Roberta: A robustly optimized bert pretraining approach, ArXiv abs/1907.11692
(2019).
[33] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller,
faster, cheaper and lighter, in: Proceedings 2019 Fifth Workshop on Energy Eficient
Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), IEEE
Computer Society, Los Alamitos, CA, USA, 2019.
[34] I. Beltagy, K. Lo, A. Cohan, SciBERT: A pretrained language model for scientific text, in:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLPIJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 3615–
3620. URL: https://aclanthology.org/D19-1371. doi:10.18653/v1/D19- 1371.
[35] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manning, Stanza: A Python natural
language processing toolkit for many human languages, in: Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics: System
Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 101–108. URL: https:
//aclanthology.org/2020.acl-demos.14. doi:10.18653/v1/2020.acl- demos.14.
[36] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Burdick, D. Eide, K. Funk, Y.
Katsis, R. M. Kinney, Y. Li, Z. Liu, W. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan,
Z. Shen, B. Stilson, A. D. Wade, K. Wang, N. X. R. Wang, C. Wilhelm, B. Xie, D. M.
Raymond, D. S. Weld, O. Etzioni, S. Kohlmeier, CORD-19: The COVID-19 open research
dataset, in: Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020,
Association for Computational Linguistics, Online, 2020. URL: https://aclanthology.org/2020.
nlpcovid19-acl.1.
[37] G. K. Shahi, D. Nandini, FakeCovid – a multilingual cross-domain fact check news
dataset for covid-19, in: Workshop Proceedings of the 14th International AAAI
Conference on Web and Social Media, The AAAI Press, Atlanta, Georgia, USA, 2020. URL:
http://workshop-proceedings.icwsm.org/pdf/2020_14.pdf.
[38] L. Cui, D. Lee, Coaid: Covid-19 healthcare misinformation dataset, 2020.</p>
      <p>arXiv:2006.00885.
[39] M. Claesen, B. D. Moor, Hyperparameter search in machine learning, ArXiv
abs/1502.02127 (2015).
[40] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, I. Stoica, Tune: A research
platform for distributed model selection and training, arXiv preprint arXiv:1807.05118
(2018).
[41] J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, N. Smith, Fine-tuning
pretrained language models: Weight initializations, data orders, and early stopping, 2020.
arXiv:2002.06305.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wardle</surname>
          </string-name>
          ,
          <article-title>Fake news</article-title>
          . It's complicated., https://firstdraftnews.com/ fake-news-complicated/,
          <year>2017</year>
          . First Draft News.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.</given-names>
            <surname>Antoun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Baly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Achour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hussein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajj</surname>
          </string-name>
          ,
          <article-title>State of the art models for fake news detection tasks</article-title>
          , in: 2020 IEEE International Conference on Informatics, IoT, and
          <article-title>Enabling Technologies (ICIoT)</article-title>
          , IEEE, Doha,Qatar,
          <year>2020</year>
          , pp.
          <fpage>519</fpage>
          -
          <lpage>524</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICIoT48696.
          <year>2020</year>
          .
          <volume>9089487</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Vishwakarma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <article-title>Recent state-of-the-art of fake news detection: A review</article-title>
          ,
          <source>in: 2020 International Conference for Emerging Technology (INCET)</source>
          , IEEE, Belgaum,India,
          <year>2020</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . doi:
          <volume>10</volume>
          .1109/INCET49848.
          <year>2020</year>
          .
          <volume>9153985</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bourgonje</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Rehm, From Clickbait to Fake News Detection: An Approach based on Detecting the Stance of Headlines to Articles</article-title>
          , in: O.
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Strapparava (Eds.),
          <source>Proceedings of the Second Workshop on Natural Language Processing meets Journalism - EMNLP 2017 Workshop (NLPMJ</source>
          <year>2017</year>
          ), Copenhagen, Denmark,
          <year>2017</year>
          , pp.
          <fpage>84</fpage>
          -
          <lpage>89</lpage>
          . URL: http://www.aclweb.org/anthology/W/W17/W17-4215.pdf,
          <volume>7</volume>
          <fpage>September</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Aral,</surname>
          </string-name>
          <article-title>The spread of true and false news online</article-title>
          ,
          <source>Science</source>
          <volume>359</volume>
          (
          <year>2018</year>
          )
          <fpage>1146</fpage>
          -
          <lpage>1151</lpage>
          . URL: https://science.sciencemag.org/content/359/6380/1146. doi:
          <volume>10</volume>
          .1126/science.aap9559.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>