=Paper=
{{Paper
|id=Vol-2838/Hatua
|storemode=property
|title=On the Feasibility of Using GANs for Claim Verification- Experiments and Analysis
|pdfUrl=https://ceur-ws.org/Vol-2838/paper4.pdf
|volume=Vol-2838
|authors=Amartya Hatua,Arjun Mukherjee,Rakesh M. Verma
|dblpUrl=https://dblp.org/rec/conf/ecir/HatuaMV21
}}
==On the Feasibility of Using GANs for Claim Verification- Experiments and Analysis==
On the Feasibility of Using GANs for Claim Verification- Experiments and Analysis Amartya Hatuaa , Arjun Mukherjeea and Rakesh M. Vermaa a University of Houston, 4800 Calhoun Rd, Houston, TX 77004 Abstract The research on fact checking and claim verification has been explored using the Fact Extraction and VERification (FEVER) dataset. To supplement this research a Generative Adversarial Network (GAN) based model is used for fact checking on the FEVER dataset. The GAN based model generates synthetic data in an extended feature space of the FEVER dataset and gives leverage to new features. This synthet- ically generated data is further classified using positive-unlabeled (PU) learning considering supported facts as positive class and are added to the existing training dataset. Bidirectional Encoder Representa- tions from Transformers (BERT) based encoding technique is used for both original and newly generated data to get the text’s underline context. Due to the Information Gain in the synthetically generated data features, better performance is achieved in the fact checking and claim verification task. A thorough analysis of the model selection is done by comparing the GAN based model with BERT based classifier and other standard classifiers. Keywords Fact checking, GAN, BERT, positive-unlabeled learning 1. Introduction Fake news and misleading information are becoming a widespread phenomenon in our daily lives. Sometimes fake news is designed in such innovative ways, that it becomes difficult to separate the fake news from facts. To check the validity of such facts, other available resources are often used. To solve this problem, research on fact checking and claim verification is gaining a lot of attention. In most of the earlier research, this problem is treated as a classification task based on the patterns of the language [1, 2, 3], or the type of sources of the facts [4]. Sometimes external resources are used [5, 6] to supplement the classification task. Fake claims and fake news often exhibit similar patterns and features. Earlier research attempted to find and use those patterns and features to perform fact checking tasks. In this research, we propose new features for fact checking and claim verification, and an attempt has been made to determine whether these new features help in fact checking tasks. To generate such synthetic features, generative models are used. In this research, a GAN [7] based model is used to create synthetic data, which helps to add new features to the existing dataset. The new features show an information gain which leads the model to produce better results. The class label of the newly generated data is given by the PU learning [8] method, where supported ROMCIR 2021: Workshop on REDUCING ONLINE MISINFORMATION THROUGH CREDIBLE INFORMATION RETRIEVAL, April 01, 2021, Lucca, Tuscany, Italy (ONLINE EVENT) " ahatua@central.uh.edu (A. Hatua); arjun@uh.edu (A. Mukherjee); rmverma@cs.uh.edu (R. M. Verma) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) claims or the true claims are considered as the positive class. The synthetically generated data is added to the existing training data so that the training feature space gains more features. BERT [9] plays a significant part in this experiment, as BERT is used for encoding of the training dataset. It helps to capture the underline context and the relationships between claims and evidence pairs. The diagrammatic representation of the proposed model is presented in Fig. 1. 1.1. Data We use FEVER dataset for our fact checking and claim verification experiments. The FEVER dataset is open-source, and the research community is actively working on this dataset, so the FEVER dataset is selected for this research [10, 11, 12]. For every claim in the dataset, the evidence for the given claim and its class is given. Two of such (claim, evidence) pairs are presented in Table 1. In the FEVER dataset, the third category of data is also provided where the class of the claim is not mentioned as there is not enough information provided for or against the claim. Table 1 Examples of claim verification Claim: Tetris has sold millions of physical copies. Evidence: It was announced that Tetris has sold more than 170 million copies, approximately 70 physical copies and ... Label: True Claim: Andy Roddick lost 5 Master Series between 2002 and 2010. Evidence: Roddick was ranked in the top 10 for nine consecutive years between 2002 and 2010, and won five Masters Series in that period. Label: False Each data point in this dataset has three main elements: claim, evidence and label. For every claim there is one or multiple evidence from Wikipedia. The class label describes whether the evidence supports / refutes / do not provide enough information for the given claim. FEVER published different versions of the datasets. In this research we used FEVER 1.0 and FEVER 2.0 for training, validation and testing. FEVER 1.0 training dataset has 80,035 Supported claims, 29,775 Refuted claims, and 35,639 NotEnoughInfo claims. The FEVER 1.0 validation set and test set have 3,333 Support claims, 3,333 Refute claims, and 3,333 NotEnoughInfo claims respectively. FEVER 2.0 has 391 Support claims, 396 Refute claims, and 387 NotEnoughInfo claims respectively. This research explores the possibilities of improvements in fact checking and claim verification tasks by adding extra features from synthetically generated data. Generating new data using GAN, which leads to improvement of the fact checking result, is novel to our knowledge. We also compare our result with prior research of Yang et al. [8] and other standard models like Long Short Term Memory (LSTM) [13], Convolution Neural Network (CNN) [14], Graph Neural Network (GCN) [15], Naive Bayes classifier [16], Support Vector Machine (SVM) [17], Random forest [18], Stochastic Gradient Boosting (SGB) [19]. Figure 1: Block diagram of the GAN based model 2. Related Work Significant work has been done on fact checking with most of it focusing on the text’s linguistic patterns, the source of the fact, and occasionally some external information used as a supplement for a given claim. Internal features. Throughout this research, we find that linguistic features are the most important and widely used features for this task. For example, in [1], H. Rashkin et al. analytically characterized the language of fake political news and determined political news’s truthfulness. A similar type of study is done by Ramy Baly et al. in [2] on multiple news resources. Apart from linguistic features and patterns; sentiment, mood, and other psychological factors can help to identify fake claims or news. In [3], Pérez-Rosas et al. showed a method of fake news detection using psycholinguistic features of the news. Using Linguistic Inquiry and Word Count software (LIWC), they extracted essential words in text that are part of psycholinguistic categories. These words are then used to identify fake news. Addition of external sources or metadata. Some research shows that, other than the internal features of the text of the fact/news, external resources and meta data can play a vital role in identifying the truthfulness of a fact or claim. One such approach is in [4], where the meta-data is combined with the data to achieve a significant improvement in fake news detection. Furthermore, external sources include additional information about the news, user interaction, public opinion, etc., and helps in assessment of news or claims [5]. Moreover, previous work also proposed a method to find the truthfulness of news by collecting information from multiple related sources [6, 20]. These sources can either support or refute each other. In [6] Ravali et al. proposed a novel method of modeling the correlations between different sources of news and applied that in determining the truthfulness of the news. Similarly, Jeff Pasternack et al. introduced a generalized fact-finding framework in [20], which incorporates uncertainty in the information extraction of claims from documents, attributes of the sources, the degree of similarity among claims, and the degree of certainty expressed by the sources as additional information into the fact-finding process. For fact checking on inconsistent sources and information, Liang Ge et al. [21] proposed a two-step procedure. It calculates the degree of information consistency and identifies the underlined common reason for the inconsistency and calculates a consistent score for each item. Similarly, Q. Li et al. [22] proposed an optimization framework in which truths and reliable sources are considered as two sets of unknown variables and the framework aims to minimize the deviation between the truths and the multi-source observations. A generalized algorithm called TruthFinder is proposed in [23], which utilizes the information of different related web sites to perform fact checking. PU Learning. In recent works on fact checking and claim verification, Yang et al. [8] proposed a GAN based PU learning technique on the FEVER dataset for claim verification task. This work is used as a baseline for this research, and their results are compared in this research. For our research we used FEVER as it is an open-source dataset for fact checking and claim verification. FEVER dataset contains evidence taken from Wikipedia pages, and the claims are constructed by crowdsourcing [24]. Pipeline. In some of the earlier fact checking research using the FEVER dataset, researchers followed a pipeline used in the base model [24]. The pipeline consists of identifying relevant wiki articles, extracting the appropriate supporting sentences, and determining the truthfulness of the claim. In prior research, document selection is made using different techniques. Some of the important techniques used by the Wikipedia API are the DrQA framework for document detection; token matching techniques and the AllenNLP framework. The second phase of the pipeline, i.e., sentence selection is done using TF-IDF based method, sequence matching neural network, and some ranking based methods. The third and final step of the task, i.e., classification is done using TF-IDF based approach in the base model. Neural network based models, different natural language inference models, and deep learning based models are also used later. 3. Model As presented in Figure 1, the model’s pipeline consists of three central units: i) GAN, ii) PU Learning, iii) Classification unit. In this experiment the Leak GAN [25] model is used; for PU learning, a bagging based method is used with Random forest [26]. The BERT [9] encoding based classifier is used for the classification in the final step. A brief description of each of the units is given below. 3.1. Leak GAN A GAN [7] model consists of a generator (𝐺) unit and a discriminator (𝐷) unit. The generator unit generates synthetic data while the discriminator distinguishes between true data and synthetic data. The goal of the generator unit is to generate data that are very similar to the original data so that the discriminator cannot identify them as synthetic data. The optimization of GAN is done by 𝐷 and 𝐺 alternatively via a min-max game and 𝑝(𝑧) denotes a simple distribution, such as 𝒩 (0, 1). min max 𝑉 (𝐷, 𝐺) = E𝑥∼𝑝𝑑𝑎𝑡𝑎(𝑥) [𝑙𝑜𝑔𝐷(𝑥)] + E𝑧∼𝑝𝑧(𝑧) [𝑙𝑜𝑔(1 − 𝐷(𝐺(𝑧)))] (1) 𝐺 𝐷 Generation of long sentences using GAN is a challenging task, Leak GAN [25] is specially designed to generate long sentences and to produce good results on standard natural language processing based tasks [25]. Hence Leak GAN is used in this research. Leak GAN follows the standard adversarial training principle but in the standard method the scalar guiding signal to the generator unit becomes relatively less informative when GAN attempts to generate a long sentence. Leak GAN overcomes this problem because the discriminator unit leaks information of its own high-level extracted features to the generative unit for further guidance, which eventually helps Leak GAN to generate long sentences. A hierarchical reinforcement learning (RL) architecture [27] is used in the generator unit to incorporate the leaked information from the discriminator unit. Although Leak GAN produces excellent results due to its hierarchical RL based architecture, it takes a long time to generate sentences. Another GAN model, LaTextGAN [28] is also used in this research to compare the results with Leak GAN. As LaTexGAN is not an RL based model, it converges faster than Leak GAN. 3.2. PU Bagging Once the synthetic data is generated using the Leak GAN model, the next task is to label these data using the PU Learning method. PU bagging is one such PU learning method. In the first step, a training set is created combining all positive data points (here the supported claims in FEVER dataset) with a random sample from the unlabeled points, with replacement. In the second step, the dataset is used as a “bootstrap” sample to train a classifier considering unlabeled data points belonging to the negative class. In the next step, the classifier is used to classify the unlabeled data points that were not included in the random sample or the OOB (“out of bag”) points and record their assigned class. These three steps are repeated many times and finally each unlabeled point is labeled with the class it was assigned to the maximum number of times. While assigning the class label of the generated data, only ‘SUPPORTED’ and ‘REFUTED’ class labels are assigned, ‘NOT ENOUGH INFORMATION’ class label is ignored. 3.3. BERT In this research BERT plays a significant role as an encoder and a classifier. BERT Huggingface [29] pretrained model is used to encode the training and test datasets. It ensures the consistency of the underlying semantic context and corresponding relations between claim, evidence and classes. The [SEP] token is used to separate the claim and the evidence. Some claims have multiple supporting statements, in such cases multiple claim evidence pairs are created. For a particular claim, its corresponding evidence is concatenated separately. For example, there is a data point with the following claim (𝐶), evidence (𝐸), and label (𝐿) : [𝐶, 𝐸 < 𝑒1 , 𝑒2 , 𝑒3 >, 𝐿]. The input data format to the BERT model will be: 𝑥 = [< 𝐶; 𝑒1 , 𝐿 >, < 𝐶; 𝑒2 , 𝐿 >, < 𝐶; 𝑒3 , 𝐿 >]. 4. Experimental Setup 4.1. Experiments GAN Based Models: In this experiment, two different GAN based models are used to generate claims synthetically. One of the GAN based models (Leak GAN) uses reinforcement learning (RL), while the other model (LaTextGAN) does not use RL based models. Using each GAN based model, a total of 10,000 synthetic data points (claims) are generated. The length of the sentences generated by Leak GAN is longer than LaTextGAN. On average, the length of the generated sentences by LeakGAN is 20. Whenever LaTextGAN generates a long sentence, it exhibits a tendency to repeat some words multiple times. This problem is not observed in the synthetic data generated by the LeakGAN. PU Learning with Bagging using Random forest: PU Learning with Bagging using Ran- domforest: The unlabeled synthetic data is labeled using the PU Learning technique. In the PU learning technique, a Random forest classifier is used for the initial step, and a bagging approach is followed to label the synthetic data in the final step. BERT Transformer: Huggingface BERT pretrained transformer is used as tokenizer for the training, validation and testing dataset. The vocabulary size of the pretrained model is 30522 and the size of the hidden layer is 768. Later the pretuned model is fine tuned to classify the claims. BERT is used as an encoder for training (original and sythetic data), validation and testing datasets. Classifiers: Other than BERT based classifiers some of the standard machine learning and deep learning classifiers are also used for the classification task. These classifiers are: GCN, LSTM, CNN, SVM, Random forest, Naive Bayes, SGB. The deep learning classifiers like GCN, LSTM and CNN are implemented. For GCN pointwise mutual information between words is calculated to generate the graph. To implement the CNN five kernels of sizes 2, 3, 4, 5 and 6 are used. For LSTM, the input data is encoded using GloVe [30]. The learning rate and batch size for these 3 models are 0.001, 64, respectively. In machine learning models, Categorical Naive Bayes classifier, SVM with RBF kernel, and Random forest is implemented. The Random forest is equipped with 1000 trees and entropy is used as supported criteria for the information gain. The SGB model utilizes hinge loss and L2 penalty. The deep learning models are implemented using PyTorch [31], and the Scikit learn library [32] is used for machine learning models. Evidence Sentence Selection: The evidence for the synthetically generated sentences are selcted from the Wikipedia database [10] using cosine similarity [33]. In this case we have selected one evidence for every synthetically generated sentence. Table 2 Examples of synthetically Leak GAN generated claims and respective evidence from the FEVER dataset Claim: Colantoni Entertainment Singing dealt Wisin densely expertise Crooks Carthaginians toxoplasmosis Dextroamphetamine 313,000 1204 orphanages Illuminate Protestant Hackers Gupta 1917. Evidence: Andrea Colantoni, quarter-finalist in Men’s Low-Kick at WAKO World Championships 2007 Belgrade-67 kg. Label: False Claim: Jackson’s singing citing blast Austrian Coppola direct 100 gruelling The screams Tick Mycroft FX Bacsinszky Orci MacShayne Castlevania Unkrich. Evidence: “All in Your Name” is a song written and performed by Barry Gibb and Michael Jackson Label: True 5. Results The result of the above experiments on FEVER 1.0 and FEVER 2.0 is presented and discussed in this section. The precision, recall, and F1 score for all the models are reported in this section. All the results are compared with the previous research by Yang et al. The Leak GAN based model generates 10,000 data points, which are added to the initial training data. PU Learning Table 3 Examples of synthetically LaTextGAN generated claims and respective evidence from the FEVER dataset Claim: Steven Angele certified Kroll, burglary presidential Texas Lactobacillales Pont finding jumped knight population, Switzerland, person person person person. Evidence: The 1877 Stevens Ducks football team represented Stevens Institute of Technology in the 1877 college football season. Label: False Claim: Lionel messi reached breakdown now Barcelona ““Bonet”” philosophical Afghanistan Championship, adherents hook abandoned Kentuck Kentuck Kentuck Kentuck Kentuck Kentuck Kentuck Evidence: Lionel Messi scored the winning goal in the fifth minute of the second half of extra time, securing Barcelona’s record sixth trophy for the 2009 calendar year. Label: True is applied to these 10,000 synthetically generated data, and 6,823 data points are classified as Supported claims, and the rest 3,177 data points are classified as Refuted claims, and we ignore the NotEnoughInfo class. Hence, the new training dataset has 86,858 Supported claims, 32,952 Refuted claims, and 35,639 NotEnoughInfo claims. The test and validation dataset has 3,333 Supported claims, 3,333 Refuted claims, and 3,333 NotEngoughInfo claims. Training data for both FEVER 1.0 and FEVER 2.0 is the same; the test data different. All the experiments are repeated five times. Table 4 Result of FEVER 1.0 FEVER 1.0 Dataset Classifiers Precision Recall F1 Score BERT Classifier 0.45 ± 0.011 0.44 ± 0.010 0.44 ± 0.009 Leak GAN Based Classifier 0.65 ± 0.003 0.64 ± 0.006 0.63 ± 0.003 LaTextGAN Based Classifier 0.41 ± 0.008 0.36 ± 0.016 0.30 ± 0.009 Graph Convolutional Network 0.45 ± 0.015 0.44 ± 0.013 0.44 ± 0.013 SVM 0.53 ± 0.013 0.42 ± 0.013 0.38 ± 0.013 Naive Bayes 0.41 ± 0.016 0.34 ± 0.014 0.24 ± 0.015 Random forest 0.33 ± 0.011 0.33 ± 0.010 0.28 ± 0.011 SGD 0.31 ± 0.023 0.22 ± 0.022 0.27 ± 0.023 LSTM 0.45 ± 0.003 0.42 ± 0.004 0.004 ± 0.004 CNN 0.46 ± 0.012 0.44 ± 0.011 0.43 ± 0.012 Yang et al. result 0.61 0.58 0.60 It can be observed in Table 4 and Table 5 that the performance of the Leak GAN based model is better than all other models on both the datasets. The F1 mean scores for the two datasets are 0.63 and 0.51. The Leak GAN based model performed better than the earlier published results and the results of the other standard classifiers. We have implemented two GAN based models, LaTextGAN and Leak GAN for synthetic data generation. Analysis is done on the synthetically generated data from both the GAN models. Table 5 Result of FEVER 2.0 FEVER 2.0 Dataset Classifiers Precision Recall F1 Score BERT Classifier 0.46 ± 0.013 0.44 ± 0.014 0.44 ± 0.013 Leak GAN Based Classifier 0.52 ± 0.023 0.51 ± 0.019 0.51 ± 0.021 LaTextGAN Based Classifier 0.42 ± 0.02 0.39 ± 0.019 0.39 ± 0.019 Graph Convolutional Network 0.43 ± 0.023 0.39 ± 0.013 0.37 ± 0.016 SVM 0.40 ± 0.019 0.37 ± 0.022 0.35 ± 0.019 Naive Bayes 0.33 ± 0.030 0.22 ± 0.023 0.27 ± 0.025 Random forest 0.33 ± 0.014 0.26 ± 0.017 0.29 ± 0.015 SGD 0.30 ± 0.025 0.22 ± 0.029 0.26 ± 0.027 LSTM 0.43 ± 0.028 0.40 ± 0.039 0.39 ± 0.032 CNN 0.41 ± 0.021 0.38 ± 0.011 0.37 ± 0.018 Figure 2: t-SNE Plot of Original & Synthetic Data Figure 3: t-SNE Plot of Original & Synthetic Data by LaTextGAN by Leak GAN In Figure 2 and Figure 3, the distribution of the original data and newly generated data (using LaTextGAN and Leak GAN) can be observed. To plot this graph, the dataset is encoded using BERT and t-SNE algorithm is used for visualization. The t-SNE plot is done using perplexity value 30, number of iterations 1000, and learning rate 200. In this visualization, 10,000 randomly selected data points are used from both the original and the synthetic datasets. To understand the statistical significance of the synthetically generated dataset, paired t-Test is also performed on the same 20,000 data points. Here we considered the null hypothesis as there is no difference in the distribution between the two datasets. The paired t-Test produces a t-value of 1.811, and the respective p-value is 0.072 for the data generated by the Leak GAN based model. The result of the paired t-Test supports the null hypothesis with a p-value of 0.072. Moreover, the new features added by the synthetic data helps in information gain of 0.020 bits. On the other hand, the paired t-Test using the synthetic data from LaTextGAN produces a t-value of -3.37, and the respective p-value is 0.000737. The information gain is 0.008 bits. The information gain and result of t-Test shows that the synthetic data generated by Leak GAN has similar distribution to the original data and gives more information to the entire training dataset than the dataset generated using LaTextGAN. In Table 2, it can be observed that the performance of the Leak GAN based model is better than all other models on both the datasets. The reason behind the better performance of the Leak GAN based classifier model is its synthetic data generation capability. All the analytical results discussed here helps us to draw this conclusion. Figure 4: F1 Score of SVM Classifier on Figure 5: F1 Score of BERT Classifier on gradually increasing dataset gradually increasing dataset To confirm the synthetic data’s effectiveness on the improvement of the F1 score, we per- formed an empirical analysis with SVM and BERT classifier. To arrive at this analysis, a subset of the training dataset (original + synthetic) is used for training, and the test dataset provided by FEVER 1.0 is used for the testing. 10,000 original training data points are randomly selected for this experiment, where the ratio of the classes is kept the same as the entire dataset. Initially, the percentage of the synthetic data is 5% of the training data, and in the next 10 steps, the volume of synthetic data is gradually increased to 50% of the total training data. BERT based classifiers and SVM are applied to this subset of the training dataset, and F1 scores are recorded and plotted in Figure 4 and Figure 5. It can be observed from Figure 4 and Figure 5 that as the percentage of the synthetic dataset is increasing in the subset of the training dataset the F1 score is also increasing. This proves enhancement in effectiveness of classifiers due to addition of synthetic data with the original data. 6. Conclusion This research attempts to employ the effectiveness of the synthetic data generation capability of the GAN. We proposed a GAN based model with PU bagging for fact checking and claim verification. The model is capable of generating synthetic data, which eventually helps the fact checking task. This research also discusses the distribution of the newly generated data and its statistical significance toward information gain and classification results. The entire research is carried out on FEVER 1.0 and FEVER 2.0 datasets, and the result of the proposed model is compared with several standard classifier’s results and previous results. In the future, a similar set of experiments can be carried out on other publicly available standard datasets to test this proposed model’s effectiveness. Acknowledgments Research was supported in part by grants NSF 1838147, NSF DGE 1433817 and ARO W911NF- 20-1-0254. Verma is the founder of Everest Cyber Security and Analytics, Inc. The views and conclusions contained in this document are those of the authors and not of the sponsors. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation herein. References [1] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, Y. Choi, Truth of varying shades: Analyzing language in fake news and political fact-checking (2017) 2931–2937. [2] R. Baly, G. Karadzhov, D. Alexandrov, J. Glass, P. Nakov, Predicting factuality of reporting and bias of news media sources, arXiv preprint arXiv:1810.01765 (2018). [3] V. Pérez-Rosas, B. Kleinberg, A. Lefevre, R. Mihalcea, Automatic detection of fake news, arXiv preprint arXiv:1708.07104 (2017). [4] W. Y. Wang, "liar, liar pants on fire": A new benchmark dataset for fake news detection, arXiv preprint arXiv:1705.00648 (2017). [5] K. Popat, S. Mukherjee, J. Strötgen, G. Weikum, Where the truth lies: Explaining the credibility of emerging claims on the web and social media (2017) 1003–1012. [6] R. Pochampally, A. Das Sarma, X. L. Dong, A. Meliou, D. Srivastava, Fusing data with correlations (2014) 433–444. [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, Advances in neural information processing systems 27 (2014) 2672–2680. [8] F. Yang, E. Dragut, A. Mukherjee, Claim verification under positive unlabeled learning, IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM) (2020). [9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [10] J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, A. Mittal, The fact extraction and verification (fever) shared task, arXiv preprint arXiv:1811.10971 (2018). [11] J. Thorne, A. Vlachos, Adversarial attacks against fact extraction and verification, arXiv preprint arXiv:1903.05543 (2019). [12] J. Thorne, A. Vlachos, O. Cocarascu, C. Christodoulopoulos, A. Mittal, The fever2. 0 shared task (2019) 1–6. [13] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780. [14] S. Lawrence, C. L. Giles, A. C. Tsoi, A. D. Back, Face recognition: A convolutional neural- network approach, IEEE transactions on neural networks 8 (1997) 98–113. [15] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph neural network model, IEEE Transactions on Neural Networks 20 (2008) 61–80. [16] D. D. Lewis, Naive (bayes) at forty: The independence assumption in information retrieval (1998) 4–15. [17] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, V. Vapnik, Support vector regression machines, Advances in neural information processing systems 9 (1996) 155–161. [18] M. Pal, Random forest classifier for remote sensing classification, International journal of remote sensing 26 (2005) 217–222. [19] J. H. Friedman, Stochastic gradient boosting, Computational statistics & data analysis 38 (2002) 367–378. [20] J. Pasternack, D. Roth, Making better informed trust decisions with generalized fact-finding (2011). [21] L. Ge, J. Gao, X. Li, A. Zhang, Multi-source deep learning for information trustworthiness estimation (2013) 766–774. [22] Q. Li, Y. Li, J. Gao, B. Zhao, W. Fan, J. Han, Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation (2014) 1187–1198. [23] M. Wan, X. Chen, L. Kaplan, J. Han, J. Gao, B. Zhao, From truth discovery to trustworthy opinion discovery: An uncertainty-aware quantitative modeling approach (2016) 1885– 1894. [24] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for fact extraction and verification, arXiv preprint arXiv:1803.05355 (2018). [25] J. Guo, S. Lu, H. Cai, W. Zhang, Y. Yu, J. Wang, Long text generation via adversarial training with leaked information, arXiv preprint arXiv:1709.08624 (2017). [26] C. Li, X.-L. Hua, Towards positive unlabeled learning for parallel data mining: a random forest framework (2014) 573–587. [27] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, K. Kavukcuoglu, Feudal networks for hierarchical reinforcement learning, arXiv preprint arXiv:1703.01161 (2017). [28] D. Donahue, A. Rumshisky, Adversarial text generation without reinforcement learning, arXiv preprint arXiv:1810.06640 (2018). [29] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art nat- ural language processing (2020) 38–45. URL: https://www.aclweb.org/anthology/2020. emnlp-demos.6. [30] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation (2014) 1532–1543. [31] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Sys- tems 32, Curran Associates, Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. [32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [33] A. Huang, et al., Similarity measures for text document clustering 4 (2008) 9–56.