1 Introduction

A Paragraph-level Multi-task Learning Model for Scientific Fact-Verification

Xiangci Li

lixiangci8@gmail.com 2 3

Gully Burns

gully.burns@chanzuckerburg.com 0

Nanyun Peng

violetpeng@cs.ucla.edu 1 0 Chan Zuckerburg Initiative 1 University of California Los Angeles , USA 2 University of Texas at Dallas , USA 3 Work performed at Information Sciences Institute, Viterbi School of Engineering, University of Southern California Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International , CC BY 4.0 , USA

Even for domain experts, it is a non-trivial task to verify a scientific claim by providing supporting or refuting evidence rationales. The situation worsens as misinformation is proliferated on social media or news websites, manually or programmatically, at every moment. As a result, an automatic factverification tool becomes crucial for combating the spread of misinformation. In this work, we propose a novel, paragraphlevel, multi-task learning model for the SCIFACT task by directly computing a sequence of contextualized sentence embeddings from a BERT model and jointly training the model on rationale selection and stance prediction.

1 Introduction

Many seemingly convincing rumors such as “Most humans only use 10 percent of their brain” are widely spread, but ordinary people are not able to rigorously verify them by searching for scientific literature. In fact, it is not a trivial task to verify a scientific claim by providing supporting or refuting evidence rationales, even for domain experts. The situation worsens as misinformation is proliferated on social media or news websites, manually or programmatically, at every moment. As a result, an automatic fact-verification tool becomes more and more crucial for combating the spread of misinformation.

The existing fact-verification tasks usually consist of three sub-tasks: document retrieval, rationale sentence extraction, and fact-verification. However, due to the nature of scientific literature that requires domain knowledge, it is challenging to collect a large scale scientific fact-verification dataset, and further, to perform fact-verification under a low-resource setting with limited training data. Wadden et al. (2020) collected a scientific claim-verification dataset, SCIFACT, and proposed a scientific claim-verification task: given a scientific claim, find evidence sentences that support or refute the claim in a corpus of scientific paper abstracts. Wadden et al. (2020) also proposed a simple, pipeline-based, sentencelevel model, VERISCI, as a baseline solution based on DeYoung et al. (2019) .

VERISCI is a pipeline model that runs modules for abstract retrieval, rationale sentence selection, and stance prediction sequentially, and thus the error generated from an upstream module may propagate to the downstream modules. To overcome this drawback, we hypothesize that a module jointly optimized on multiple sub-tasks may mitigate the error-propagation problem to improve the overall performance. In addition, we observe that a complete set of rationale sentences usually contains multiple inter-related sentences from the same paragraph. Therefore, we propose a novel, paragraph-level, multi-task learning model for the

SCIFACT task.

In this work, we employ compact paragraph encoding, a novel strategy of computing sentence representations using BERT-family models. We directly feed an entire paragraph as a single sequence to BERT, so that the encoded sentence representations are already contextualized on the neighbor sentences by taking advantage of the attention mechanisms in BERT. In addition, we jointly train the modules for rationale selection and stance prediction as multi-task learning (Caruana 1997) by leveraging the confidence score of rationale selection as the attention weight of the stance prediction module. Furthermore, we compare two methods of transfer learning that mitigate the low-resource issue: pre-training and domain adaptation (Peng and Dredze 2017) . Our experiments show that: • The compact paragraph encoding method is beneficial over separately computing sentence embeddings. • With negative sampling, the joint training of rationale selection and stance prediction is beneficial over the pipeline solution.

SCIFACT Task Formulation

Given a scientific claim c and a corpus of scientific paper abstracts A, the SCIFACT (Wadden et al. 2020) task retrieves all abstracts E^(c) that either SUPPORTS or REFUTES c. Specifically, the stance prediction (a.k.a. label prediction) task classifies each abstract a ∈ A into y(c; a) ∈ {SUPPORT; REFUTES; NOINFO} with respect to each claim c; the rationale selection (a.k.a. sentence selection) task retrieves all rationale sentences S^(c; a) = {s^1(c; a); :::; s^l(c; a)} of each a that SUP

PORTS or REFUTES c. The performance of both tasks

are evaluated with F 1 measure at both abstract-level and sentence-level, as defined by Wadden et al. (2020) , where {SUPPORTS; REFUTES} are considered as the positive labels and NOINFO is the negative label for stance prediction. 3

Approach

We formulate the SCIFACT task (Wadden et al. 2020) as a sentence-level sequence-tagging problem. We first apply an abstract retrieval module to filter out negative candidate abstracts that do not contain sufficient information with respect to each given claim. Then we propose a novel model for joint rationale selection and stance prediction using multitask learning (Caruana 1997) . 3.1

In contrast to the TF-IDF similarity used by Wadden et al.

(2020), we leverage BioSentVec (Chen, Peng, and Lu 2019) embedding, which is the biomedical version of Sent2Vec (Pagliardini, Gupta, and Jaggi 2018) , for a fast and scalable sentence-level similarity computation. We first compute the BioSentVec (Chen, Peng, and Lu 2019) embedding of each abstract in the corpus by treating the concatenation of each title and abstract as a single sentence. Then for each given claim, we compute the cosine similarities of the claim embedding against the pre-computed abstract embeddings, and choose the top kretrieval similar abstracts as the candidate abstracts for the next module. 3.2

Joint Rationale Selection and Stance Prediction Model

Compact Paragraph Encoding A major usage of BERTfamily models (Devlin et al. 2018; Liu et al. 2019) for sentence-level sequence tagging computes each sentence embedding in a paragraph with batches. Since each batch is independent, such method leaves the contextualization of the sentences to the subsequent modules. Instead, we propose a novel method of encoding paragraphs by directly feeding the concatenation of the claim c and the whole paragraph P to a BERT model BERT as a single sequence Seq. By separating each sentence s using the BERT model’s [SEP ] token, we fully leverage the multi-head attention (Vaswani et al. 2017) within the BERT model to compute the contextualized word representations hSeq with respect to the claim sentence and the whole paragraph.

c = [cw1; cw2; : : : ; cwn] si = [w1; w2; : : : ; wm]

P = [s1; s2; : : : ; sl] Seq = [c[SEP ]s1[SEP ]s2[SEP ] : : : [SEP ]sl] (1) hSeq = BERT (Seq) ∈ Rlen(Seq)×dBERT hSeq = [hCLS; hcw1; : : : ; hcwn;

hSEP ; hw1; : : : ; hwm; hSEP ; : : :]

Sentence Representations via Word-level Attention

Next, we apply a weighted sum to the contextualized word (2) (3) representations of each sentence hsent to compute the sentence representations hsi . The weights are obtained by applying a self-attention Self Attnword with a two-layer multilayer perceptron on the word representations in the scope of each sentence, as separated by the [SEP ] tokens. d hsi = Self Attnword([hSEP ; hw1; :::; hwm]) ∈ R BERT

Dynamic Rationale Representations We use a two-layer

multi-layer perceptron M LPrationale to compute the rationale score and use the sof tmax function to compute the probability of each candidate sentence being a rationale sentence pr or not pnot r with respect to the claim sentence c.

Then we only feed rationale sentences r into the next stance

prediction module.

pinot r; pir = sof tmax(M LPrationale(hsi)) ∈ (0; 1) hri ← hsi if pinot r < pir

Stance Prediction We use two variants for stance predic

tion: a simple sentence-level attention and the Kernel Graph Attention Network (KGAT) (Liu et al. 2020) . • Simple Attention. We apply another weighted summation on the predicted rationale sentence representations hri to compute the whole paragraph’s rationale representation, where the attention weights are obtained by applying another self-attention Self Attnsentence on the rationale sentence representations hr. Finally, we apply another two-layer multi-layer perceptron M LPstance and the sof tmax function to compute the probability of the paragraph serving the role of fSUPPORTS, REFUTES, NOINFOg with respect to the claim c. d hr = Self Attnsentence([hr1; hr2; :::; hrl]) ∈ R BERT pstance = sof tmax(M LPstance(hr)) ∈ (0; 1)3 (4) • Kernel Graph Attention Network. Liu et al. (2020) proposed KGAT as a stance prediction module for their pipeline solution on the FEVER (Thorne et al. 2018) task.

In addition to the Graph Attention Network (Velicˇkovic´

et al. 2017), which applies attention mechanisms on each word pair and sentence pair in the input paragraph, KGAT applies a kernel pooling mechanism (Xiong et al. 2017) to extract better features for stance prediction. We integrate

KGAT (Liu et al. 2020) into our multi-task learning model

for stance prediction on SCIFACT (Wadden et al. 2020) .

The KGAT module KGAT takes the word representation

of the claim hc and the predicted rationale sentence representations hR as inputs, and outputs the probability of the paragraph serving the role of fSUPPORTS, REFUTES, NOINFOg with respect to the claim c.

hc = [hCLS; hcw1; : : : ; hcwn] hRi = [hSEP ; hrw1; : : : ; hrwm] where pinot r < pir hR = [hR1; hR2; : : : ; hRl] pstance = KGAT (hc; hR) ∈ (0; 1)3 (5)

Multi-task Learning We train our model on rationale se

lection and stance prediction using multi-task learning approach (Caruana 1997) . We use cross-entropy loss as the training objective for both tasks. We introduce a coefficient to adjust the proportion of two loss values Lrationale and

Lstance in the joint loss L.

L =

Lrationale + Lstance Scheduled Sampling Because the stance prediction module takes the predicted rationale sentences as the input, errors in rationale selection may propagate to the stance prediction module, especially during the early stage of training. To mitigate this issue, we apply scheduled sampling (Bengio et al. 2015) , which starts by feeding the ground truth rationale sentences to the stance prediction module, and gradually increasing the proportion of the predicted rationale sentences, until eventually all input sentences are the predicted rationale sentences. We use a sin function to compute the probability of sampling predicted rationale sentences psample as a function of the progress of the training: progress = current epoch − 1

total epoch − 1 psample = sin( 2 × progress)

Negative Sampling and Down-sampling Although the

abstract retrieval module filters out the majority of the negative candidate abstracts, the false-positive rate is still inevitably high, in order to ensure the retrieval of most of the positive abstracts. As a result, the input to the joint prediction model is highly biased towards negative samples.

Therefore, in addition to the positive samples from the SCI FACT dataset (Wadden et al. 2020), we perform negative

sampling (Mikolov et al. 2013) to sample the top ktrain similar negative abstracts using our abstract retrieval module as an augmented dataset for training and validation to increase the downstream model’s tolerance to false positive abstracts.

Furthermore, in order to increase the diversity of the dataset, we augment the dataset by down-sampling sentences within each paragraph. FEVER Pre-training As Wadden et al. (2020) proposed,

due to the similar task structure of FEVER (Thorne et al. 2018) and SCIFACT (Wadden et al. 2020) , we first pre-train our model on the FEVER dataset, then fine-tune on the SCI

FACT dataset by partially re-initializing the rationale selec

tion and stance prediction attention modules.

Domain Adaptation Instead of pre-training, we also ex

plore domain adaptation (Peng and Dredze 2017) from FEVER (Thorne et al. 2018) to SCIFACT (Wadden et al. 2020) . We use shared representations for the compact paragraph encoding and word-level attention, while using domain-specific representations for the rationale selection and stance prediction modules. (6) (7)

Parameter

kretrieval kF EV ER ktrain drop out learning rate

BERT learning rate batch size Explored

Dummy Rationale Sentence. We dynamically feed only the predicted rationale sentence representations to the stance prediction module. To address the special case when an abstract contains no rationale sentences, we append a fixed dummy sentence (e.g.“@”) whose rationale label is always 0 at the beginning of each of the paragraph. When the stance prediction module has no actual rationale sentence to take as input, we feed it with the representation of the dummy sentence and expect the module to predict NOINFO.

Post Processing. To prevent inconsistency between the

outputs of rationale selection and stance prediction, we enforce the predicted stance to be NOINFO if no rationale sentence is proposed.

Hyper-parameters. Table 1 lists the hyper-parameters

used for training the Joint-Paragraph model in Table 4 1, where kF EV ER refers to the number of negative samples retrieved from FEVER (Thorne et al. 2018) for model pretraining.

Experiments

4.1

SCIFACT Dataset SCIFACT (Wadden et al. 2020) is a small dataset, whose corpus contains 5183 abstracts. There are 1409 claims, including 809 in the training set, 300 in the development set and 300 in the test set. 1https://github.com/jacklxc/ParagraphJointModel Paragraph-Pipeline Paragraph-Joint Paragraph-Joint KGAT VERT5ERINI* P Sentence-level Selection-Only Selection+Label

R F1 P R 56:3 65:3 69:8 50:5 74:9 75:9 75:5 83:5 67:7 62:8 68:3 72:1 81:6 83:3 81:8 92:7 72:2 76:6 75:1 79:0

F1 50:0 71:2 74:2 69:3 72:5 69:4 68:8 70:2 67:7 64:8 51:4 57:4 50:0 55:7 56:6 56:6 55:5 53:8 57:4 59:7 64.7 58:1 63:1 62:3 62:1 62:0 60:0 60:9 62:1 63:3 59:8 62:2 60:4 60:1 61:9 63:9 60:8 44:8 48:9 43:2 47:8 49:2 49:5 48:9 50:8 53:8 52:1 55.2 50:2 54:1 54:2 54:3 54:7 56:6 57:1 54:5 59:8 52:1 59:8 57:4 62:7 59:3 61:7 65:1 64:0 65:1 59:7 64:7 62:0 65.3 64:6 66:0 65:1 72:8 65:7 64:7 65:5 63:5 61:5 66:3 67:0 61:7 51:2 55:0 48:3 55:5 54:1 56:5 55:5 58:4 61:7 60:1 59:9 55:3 60:1 58:4 58:9 60.4 62:4 61:7 P 56:4 77:6 71:4 69:9 70:6 67:4 68:2 70:9 70:9 65:1 Table 2 compares the performance of abstract retrieval modules using using TF-IDF and BioSentVec (Chen, Peng, and Lu 2019) . As Table 2 indicates, the overall difference between these two methods is small. Wadden et al. (2020) chose kretrieval = 3 to maximize the F1 score of the abstract retrieval module, while we choose a larger kretrieval to pursue a larger recall score, in order to retrieve more positive abstracts for the down-stream models. 4.3

Baseline Models VERISCI. Along with the SCIFACT task and dataset,

Wadden et al. (2020) proposed VERISCI, a sentence-level, pipeline-based solution. After retrieving the top similar abstracts for each claim with TF-IDF vectorization method, they applied a sentence-level “BERT to BERT” model DeYoung et al. (2019) to extract rationales, sentence by sentence, with a BERT model, and they predict the stance with another BERT model using the concatenation of the extracted rationale sentences. Wadden et al. (2020) used Roberta-large (Liu et al. 2019) as their BERT model and pre-trained their stance prediction module on the FEVER dataset (Thorne et al. 2018).

VERT5ERINI. Very recently, Pradeep et al. (2020) pro

posed a strong model VERT5ERINI, based on T5 (Raffel et al. 2019) . They applied T5 for all three steps of the SCI

FACT task in a sentence-level, pipeline fashion. Because of

the known significant performance gap between Robertalarge (Liu et al. 2019) that we use and T5 (Raffel et al. 2019;

Pradeep et al. 2020), we only use VERT5ERINI as a reference (marked with *).

4.4

Model Performances and Ablation Studies We experiment on the oracle task, which performs rationale

selection and stance prediction given the oracle abstracts (Table 3), and the open task, which performs the full task of abstract retrieval, rationale selection, and stance prediction (Table 4). We tune our models based on the sentencelevel, final development set performance (Selection+Label).

The test labels are not released by Wadden et al. (2020). Unless explicitly stated, all models are pre-trained on FEVER (Thorne et al. 2018). Paragraph-level Model vs. Sentence-level Model. We

compare our paragraph-level pipeline model against VERISCI (Wadden et al. 2020) , which is a sentencelevel solution on the oracle task. As Table 3 shows, our paragraph-level pipeline model (Paragraph-Pipeline) outperforms VERISCI, particularly on rationale selection. This suggests the benefit of computing the contextualized sentence representations using the compact paragraph encoding over individual sentence representations.

Joint Model vs. Pipeline Model. Although our joint

model does not show benefits over the pipeline model on the oracle task (Table 3), the benefit emerges on the open task. Along with negative sampling, which greatly increases the tolerance of models to false positive abstracts, the Paragraph-Joint model shows its benefit over the

Paragraph-Pipeline model. The small difference between the Paragraph-Joint model and the same model except with TF IDF abstract retrieval (Paragraph-Joint TF-IDF) shows that

the performance improvement is mainly attributed to the joint training, instead of replacing TF-IDF similarity with

BioSentVec embedding similarity in abstract retrieval.

Pre-training vs. Domain Adaptation. We also compare two methods of transfer learning from FEVER (Thorne et al. 2018) to SCIFACT (Wadden et al. 2020) . Table 4 shows that the effect of pre-training (Paragraph-Joint) or domain adaptation (Peng and Dredze 2017) (Paragraph-Joint DA) is similar. Both of them are effective as transfer learning, as they significantly outperform the same model that is only trained on SCIFACT (Paragraph-Joint SCIFACT-only).

KGAT vs. Simple Attention as Stance Prediction Module. We expected a significant performance improvement by ap

plying the strong stance prediction model KGAT (Liu et al. 2020) , but the actual improvement is limited. This is likely due to the strong regularization of KGAT that under-fits the training data.

Test-set Performance on the SCIFACT Leaderboard By the time this paper is updated, our Paragraph-Joint model trained on the combination of SCIFACT training set and development set achieved the first place on the SCIFACT leaderboard 2. We obtain test sentence-level F1 score (Selection+Label) of 60:9% and test abstract-level F1 score (Label+Rationale) of 67:2%.

Related Work

Fact-verification has been widely studied. There are many datasets available on various domains (Vlachos and Riedel 2014; Ferreira and Vlachos 2016; Popat et al. 2017; Wang 2017; Derczynski et al. 2017; Popat et al. 2017; Atanasova 2018; Baly et al. 2018; Chen et al. 2019; Hanselowski et al. 2019) , among which the most influential one is FEVER shared task (Thorne et al. 2018), which aims to develop systems to check the veracity of human-generated claims by extracting evidences from Wikipedia. Most existing systems (Nie, Chen, and Bansal 2019) leverages a three-step pipeline approach by building modules for each of the step: document retrieval, rationale selection and fact verification. Many of them focus on the claim verification step (Zhou et al. 2019; Liu et al. 2020) , such as KGAT (Liu et al. 2020) , one of the top models on FEVER leader board. On the other hand, there are some attempts on jointly optimizing rationale selection and stance prediction. TwoWingOS (Yin and Roth 2018) leverages attentive CNN (Yin and Schu¨tze 2018) to

2https://leaderboard.allenai.org/scifact/submissions/public, as

of February 12, 2021. inter-wire two modules, while Hidey et al. (2020) used a single pointer network (Vinyals, Fortunato, and Jaitly 2015) for both sub-tasks. We propose another variation that directly links two modules by a dynamic attention mechanism.

Because SCIFACT (Wadden et al. 2020) is a scientific ver

sion of FEVER (Thorne et al. 2018), systems designed for FEVER can be applied to SCIFACT in principle. However, as a fact-verification task in scientific domain, SCIFACT task has inherited the common issue of lacking sufficient data, which can be mitigated with transfer learning by leveraging language models and introducing external dataset. The baseline model by Wadden et al. (2020) leverages Robertalarge (Liu et al. 2019) fine-tuned on FEVER dataset (Thorne et al. 2018), while VERT5ERINI (Pradeep et al. 2020) leverages T5 (Raffel et al. 2019) and fine-tuned on MS MARCO dataset (Bajaj et al. 2016) . In this work, in addition to finetuning Roberta-large on FEVER, we also explore domain adaptation (Peng and Dredze 2017) to mitigate the low resource issue.

Conclusion In this work, we propose a novel paragraph-level multi-task

learning model for SCIFACT task. Experiments show that (1) The compact paragraph encoding method is beneficial over separately computing sentence embeddings. (2) With negative sampling, the joint training of rationale selection and stance prediction is beneficial over the pipeline solution.

Acknowledgement We thank the anonymous reviewers for their useful com

ments, and Dr. Jessica Ouyang for her feedback. This work is supported by a National Institutes of Health (NIH) R01 grant (LM012592). The views and conclusions of this paper are those of the authors and do not reflect the official policy or position of NIH.

Atanasova , P.

2018 . Lluıs Marquez, Alberto Barro´ n- Cedeno , Tamer Elsayed, Reem Suwaileh, Wajdi Zaghouani, Spas Kyuchukov, Giovanni Da San Martino, and Preslav Nakov.

2018. Overview of the CLEF-2018 CheckThat! Lab on automatic identification and verification of political claims, Task 1: Checkworthiness . In Working Notes of the Conference and Labs of the Evaluation Forum, CLEF , volume 18 .

Bajaj , P. ; Campos , D. ; Craswell , N. ; Deng , L. ; Gao , J. ; Liu, X. ; Majumder , R. ; McNamara , A. ; Mitra , B. ; Nguyen , T. ; et al. 2016 .

Ms marco: A human generated machine reading comprehension dataset . arXiv preprint arXiv:1611 . 09268 .

Baly , R. ; Mohtarami, M. ; Glass , J. ; Ma`rquez, L.; Moschitti , A. ; and Nakov, P. 2018 . Integrating stance detection and fact checking in a unified corpus . arXiv preprint arXiv: 1804 .08012 .

Bengio , S. ; Vinyals , O. ; Jaitly , N.; and Shazeer , N. 2015 . Scheduled sampling for sequence prediction with recurrent neural networks . In Advances in Neural Information Processing Systems , 1171 - 1179 .

Caruana , R.

1997 . Multitask learning . Machine learning 28(1) : 41 - 75 .

Chen , Q. ; Peng , Y. ; and Lu , Z. 2019 . BioSentVec: creating sentence embeddings for biomedical texts . In 2019 IEEE International Conference on Healthcare Informatics (ICHI) , 1 - 5 . IEEE.

2019. Seeing things from a different angle: Discovering diverse perspectives about claims . arXiv preprint arXiv: 1906 .03538 .

W. S. ; and Zubiaga , A. 2017 . SemEval -2017 Task 8: RumourEval: Determining rumour veracity and support for rumours . arXiv preprint arXiv:1704 . 05972 .

Devlin , J. ; Chang, M.-W.; Lee , K. ; and Toutanova , K. 2018 . Bert: Pre-training of deep bidirectional transformers for language understanding . arXiv preprint arXiv: 1810 .04805 .

DeYoung , J.; Jain, S. ; Rajani, N. F. ; Lehman , E. ; Xiong , C. ; Socher , R.; and Wallace , B. C. 2019 . Eraser: A benchmark to evaluate rationalized nlp models . arXiv preprint arXiv: 1911 .03429 .

Ferreira , W. ; and Vlachos , A. 2016 . Emergent: a novel data-set for stance classification . In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies , 1163 - 1168 .

2019. A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking . arXiv preprint arXiv: 1911 .01214 .

Hidey , C. ; Chakrabarty , T. ; Alhindi , T. ; Varia , S. ; Krstovski , K. ; Diab , M. ; and Muresan , S. 2020 . DeSePtion: Dual Sequence Prediction and Adversarial Examples for Improved Fact-Checking.

arXiv preprint arXiv: 2004 .12864 .

Liu , Y. ; Ott , M. ; Goyal , N. ; Du , J. ; Joshi, M. ; Chen , D. ; Levy , O. ; Lewis , M. ; Zettlemoyer , L. ; and Stoyanov , V. 2019 . Roberta: A robustly optimized bert pretraining approach . arXiv preprint arXiv: 1907 .11692 .

Liu , Z. ; Xiong , C. ; Sun , M. ; and Liu, Z. 2020 . Fine-grained fact verification with kernel graph attention network . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 7342 - 7351 .

2013. Distributed representations of words and phrases and their compositionality . In Advances in neural information processing systems , 3111 - 3119 .

Nie , Y. ; Chen , H.; and Bansal, M. 2019 . Combining fact extraction and verification with neural semantic matching networks . In Proceedings of the AAAI Conference on Artificial Intelligence , volume 33 , 6859 - 6866 .

Pagliardini , M. ; Gupta , P. ; and Jaggi, M. 2018 . Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features . In NAACL 2018 - Conference of the North American Chapter of the Association for Computational Linguistics .

Peng , N.; and Dredze, M.

2017 . Multi-task multi-domain representation learning for sequence tagging . In Proceedings of the 2nd Workshop on Representation Learning for NLP.

Popat , K. ; Mukherjee , S. ; Stro¨ tgen, J.; and Weikum, G. 2017 .

Where the truth lies: Explaining the credibility of emerging claims on the web and social media . In Proceedings of the 26th International Conference on World Wide Web Companion , 1003 - 1012 .

Pradeep , R. ; Ma, X. ; Nogueira , R.; and Lin , J. 2020 . Scientific Claim Verification with VERT5ERINI . arXiv preprint arXiv: 2010 .11930 .

Raffel , C. ; Shazeer , N. ; Roberts , A. ; Lee , K. ; Narang , S. ; Matena, M. ; Zhou , Y. ; Li , W. ; and Liu, P. J. 2019 . Exploring the limits of transfer learning with a unified text-to-text transformer . arXiv preprint arXiv: 1910 .10683 .

2018. FEVER: a large-scale dataset for fact extraction and verification . arXiv preprint arXiv: 1803 .05355 .

Vaswani , A. ; Shazeer , N. ; Parmar , N. ; Uszkoreit , J. ; Jones , L. ; Gomez , A. N. ; Kaiser , Ł.; and Polosukhin , I. 2017 . Attention is all you need . In Advances in neural information processing systems , 5998 - 6008 .

Velicˇkovic´ , P.; Cucurull, G. ; Casanova , A. ; Romero , A. ; Lio , P. ; and Bengio, Y. 2017 . Graph attention networks . arXiv preprint arXiv:1710 . 10903 .

Vinyals , O. ; Fortunato , M. ; and Jaitly , N. 2015 . Pointer networks .

In Advances in neural information processing systems , 2692 - 2700 .

Vlachos , A. ; and Riedel , S. 2014 . Fact checking: Task definition and dataset construction . In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science , 18 - 22 .

Wadden , D. ; Lo , K. ; Wang , L. L. ; Lin , S.; van Zuylen, M. ; Cohan , A. ; and Hajishirzi, H. 2020 . Fact or Fiction: Verifying Scientific Claims . In EMNLP.

Wang , W. Y.

2017 . ” liar, liar pants on fire”: A new benchmark dataset for fake news detection . arXiv preprint arXiv:1705 . 00648 .

Xiong , C. ; Dai , Z. ; Callan , J. ; Liu, Z. ; and Power, R. 2017 . Endto-end neural ad-hoc ranking with kernel pooling . In Proceedings of the 40th International ACM SIGIR conference on research and development in information retrieval , 55 - 64 .

Yin , W. ; and Roth , D. 2018 . Twowingos: A two-wing optimization strategy for evidential claim verification . arXiv preprint arXiv: 1808 .03465 .

Yin , W. ; and Schu¨ tze, H. 2018 . Attentive convolution: Equipping cnns with rnn-style attention mechanisms . Transactions of the Association for Computational Linguistics 6 : 687 - 702 .

2019. GEAR: Graph-based evidence aggregating and reasoning for fact verification . arXiv preprint arXiv: 1908 . 01843 .