=Paper=
{{Paper
|id=Vol-2936/paper-40
|storemode=property
|title=University of Regensburg at CheckThat! 2021:Exploring Text Summarization for Fake
News Detection
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-40.pdf
|volume=Vol-2936
|authors=Philipp Hartl,Udo Kruschwitz
|dblpUrl=https://dblp.org/rec/conf/clef/HartlK21
}}
==University of Regensburg at CheckThat! 2021:Exploring Text Summarization for Fake
News Detection==
University of Regensburg at CheckThat! 2021: Exploring Text Summarization for Fake News Detection Philipp Hartl, Udo Kruschwitz University of Regensburg, Universitätsstraße 31, 93053 Regensburg, Germany Abstract We present our submission to the CLEF 2021 CheckThat! challenge. More specifically, we took part in Task 3a, Multi-class fake news detection of news articles. The conceptual idea of our work is that (a) transformer-based approaches represent a strong foundation for a broad range of NLP tasks including fake news detection, and that (b) compressing the original input documents into some form of automat- ically generated summary before classifying them is a promising approach. The official results indicate that this is indeed an interesting direction to explore. They also confirm that oversampling to address the class imbalance was effective to further improve the results. We also note that both abstractive and extractive summarization approaches score way better when we do not apply hypertuning of parame- ters suggesting that the small scale of the test collection leads to overfitting. Keywords Fake News Detection, Text Summarization, Abstractive / Extractive Summarization, CLEF, BERT 1. Introduction Fake news, misinformation and disinformation is by no means a recent phenomenon, but instead has been around since classical antiquity when manipulated information was used to discredit political opponents or alter battle courses [1]. What did change though over time was the scale and extent of the problem, e.g. initially dissemination happened verbally, but the invention of the printing press marked a major milestone as easy access and distribution of information combined with increasing literacy enabled more people to consume and create information. The advent of social media with the freedom to publish marks the birth of a yet another era altogether [2]. The term fake news has been particularly prevalent in the mainstream media since the 2016 US election, when a large amount of intentionally false news was spread through social media during the campaign [3]. These platforms operate with a non-restrictive content policy by design and provide various ways for automation which eases the spread of mis- and disinformation. Combined with their enormous user bases (e.g. Facebook with 2.8 billion active users in December 2020 1 ) information is able to reach many people in a very short period CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " philipp1.hartl@stud.uni-regensburg.de (P. Hartl); udo.kruschwitz@ur.de (U. Kruschwitz) 0000-0002-5503-0341 (U. Kruschwitz) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 https://investor.fb.com/investor-news/press-release-details/2021/Facebook-Reports-Fourth-Quarter-and-Full- Year-2020-Results/default.aspx of time. In an age of information pollution (irrelevant, redundant, unsolicited and low-value information [4]) it is therefore important to (semi-) automatically identify such claims and minimize their harm – in particular as humans appear to not be very skilled at identifying disinformation, with typical recognition rates only being slightly better than chance [5]. CheckThat! Lab [6] is an evaluation campaign which is part of the 2021 Cross-Language Evaluation Forum (CLEF) conference and contains three tasks related to fact-checking or fake news detection with two subtasks each. Our team participated in this year’s Task 3a whose goal it is to create a system to identify fake news in a multi-label scenario. We built four models based on fine-tuned BERT [7], a highly-popular bidirectional transformer architecture, and abstractive respectively extractive summarization technologies [8, 9]. Our best submitted model (abstractive summarization) was ranked 8th among all 25 participating teams in the lab for this task. Post-hoc runs reveal though that the same runs but without hyperparameter tuning lead to substantially improved results (placing our best run 3rd in the ranked list). In this paper, we describe our participation in Task 3a at CLEF 2021 in detail. 2. Related Work Traditionally, fake news detection is modelled as a classification problem but often with varying class numbers. While datasets like FakeNewsNet [10], MM-COVID [11] or ReCOVery [12] provide only two labels and hence see fake news detection as a binary classification, there exist also several datasets which got multiple labels such as FEVER [13], NELA-GT-2019 [14] or the dataset provided by the organizers of this task (see Section 3). Unfortunately, generating comprehensive datasets still takes a lot of work as the ground-truth labels often need to be assigned by, e.g., journalists or domain experts. Fake news detection systems typically adopt one of three general approaches or a combination of them. The most commonly used way is based on the news content which can be either linguistic, auditory (e.g., attached voice recordings) or visual (e.g., images or videos) [15]. This is based on the assumption that real and fantasy statements differ in content style and quality [16]. Therefore, it is possible to successfully differentiate claims solely on their content with either hand-engineered features [17] or deep learning methods [18]. However, approaches which only focus on the news content might miss valuable context information. Hence, feedback-based solutions target secondary information such as user engagements [19] and dissemination networks [20]. These approaches are often used in combination with content-based methods to increase performance [21]. While contextual information can be useful when available, it is often not or only partially available (as reflected by common benchmark collections for fake news detection [22, 13]). While both methods discussed above are limited to a snapshot of features present at the time of training, intervention-based methods try to dynamically interpret real-time dissemination data. These are arguably the least common approaches used at the moment because of their difficult way to evaluate [23]. When used though, they try to intervene the process of fake news spreading through e.g., injecting of true news into social networks [24] or user intervention [25, 26]. In this work we use a solely content-based approach simply because the dataset provided for this challenge has no additional context data. Additionally, gathering of some context data was explicitly forbidden as described in Section 4, so we decided to focus on a text-based solution. 3. Task Description This year there have been a total of three CheckThat! tasks with two subtasks each [6]. We participated in Task 3a: Multi-class fake news detection of news articles, which is a part of Task 3: Fake News Detection. The goal is to “given the text of a news article, determine whether the main claim made in the article is true, partially true, false, or other”. The data used in this task is only available in English. As this task is designed as a four-class classification problem, the official evaluation metric introduced by the organizers is the F1-macro score. The F1-macro score is simply the mean of class-wise F1 scores: 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 * 𝑟𝑒𝑐𝑎𝑙𝑙 𝐹1 = 2 * (1) 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑟𝑒𝑐𝑎𝑙𝑙 𝑁 1 ∑︁ 𝐹 1𝑚𝑎𝑐𝑟𝑜 = 𝐹 1𝑖 (2) 𝑛 𝑖=0 Up to five runs were permitted for each team. We submitted three competitive configurations and one baseline run to compare against our own approaches. Further details on all tasks can be found in the task overview [6]. 4. Dataset As this work is part of this year’s CLEF CheckThatLab! [6] Task 3a, we used a modified version of the dataset by Shahi [27] provided by the organizers. This dataset also got four different classes to predict as defined in [28]. The distribution of each class in the provided training and test data can be seen in Table 1. The dataset was given in .csv format with four columns: • public_id — unique identifier of the news article • title — title/heading of the news article • text — text content of the news article • our rating — class of the news article (either false, partially false, true or other) Table 1 Dataset statistics Dataset False Partially False True Other Training 486 235 153 76 Test 113 141 69 41 The training set contains 950 data points including the 50 sample data points released before both batches of data. The provided test set got 364 data points without labels. We received the ground-truth labels separately after the competition had finished (see Table 1). Each group had to submit a .csv file with their predictions separately on Codalab2 . Additionally, through a data sharing agreement, it was forbidden to identify individuals and the original entries on the fact-checking websites. Therefore, we refrained from finding this information, although it would have been useful for classification purposes as demonstrated on a similar task [17]. 2 https://competitions.codalab.org/competitions/31238 5. Methodology In the following section we provide an overview on how we prepared the data, the models we used as well as the training and evaluation process. Everything has been implemented in Python and is available on Github.3 5.1. Data preparation We started our preprocessing with first converting all labels to numeric values. We used 0 for true, 1 for false, 2 for partially false and 3 for other. As seen in Table 1, the four classes are not equally distributed. We therefore applied random oversampling of all classes except the majority class using the imbalanced-learn package [29] with the aim to train a better classifier. Additionally, we generated abstractive and extractive summaries (we did this offline as in particular the generation of abstractive summaries was time-consuming). Before sending the text into our models we also tokenized and normalized the texts. 5.2. Model architecture All models used are fine-tuned variants of Google’s BERT [7] and use the bert-base-uncased implementation provided by Wolf et al. [30] in conjunction with a linear layer on top to predict the output. We have chosen BERT because it already has shown good performance in various text classification tasks [31] as well as in fake news detection [32]. Due to limited computational resources we could not use a more sophisticated BERT model like RoBERTa [33]. One of the main drawbacks of BERT-based models is the maximum sequence length each model is able to process which is at a maximum of 512 tokens (word pieces) for BERT. Unfortunately, fake news articles often are a longer than this value [34]. In the provided dataset the mean token length is 806 with at least 55% of texts exceeding the 512 token limit. As these values are calculated with nltk [35] and word pieces do not exactly match tokens, the real ratio is even higher (all other token values reported are calculated similarly). By default, BERT-based models simply truncate the text to the desired input length (or apply padding if it is too short). This leads to the loss of potentially important information in the input text. To circumvent this issue we propose three different solutions, all aimed at compressing the original text: • Modified hierarchical transformer representation • Extractive summarization • Abstractive summarization Hierarchical transformer representations have been introduced by Pappagari et al. [36]. In their work they suggest splitting the input text into smaller text segments with overlapping parts (stride) to represent the structure of the text. In our model we split the text into parts of 500 tokens with a stride length of 50. After getting the BERT embeddings for each text segment we then calculated the mean representation dimensionally and fed this into BERT. The output of BERT is then used to classify the input text. Mean embeddings have been successfully used before by Mulyar et al. [37] 3 https://github.com/phHartl/CheckThatLab_2021 Another possible solution is to use automatic summarization to get a more condensed text representation. Deep learning models such as BART [8], XLNet [38] or ALBERT [39] perform exceptionally well on summarization tasks like SQuAD [40] or ELI5 [41] - even sometimes surpassing humans. These algorithms are able to reduce the text length by a significant amount if desired, which is ideal for the initial problem with BERT. In our work we use the extractive summarization technology implemented by [42]. Note that while this method is also based on BERT it has no maximum sequence length. To ensure a better summarization quality while keeping the running time reasonable we activated co-reference handling (better contextualiza- tion) and used distilBERT [43] as the underlying model. In contrast to [9] we are interested in long sequences and not only the first two sentences for classifying. After manually inspecting different configurations we settled with a summarization ratio of 0.40. Apart from an extractive approach we also implemented an abstractive technique based on BART. This model is specifically well suited for text generation, outperforming similar ones on summarization tasks like SQuAD 1.1 [8]. The Huggingface transformers library [30] provides an easy way to use BART-models for sequence generation. Because of the repetitive nature of greedy and beam search [44, 45] we used Top-K [46] and Top-p sampling [47] for our summaries. The exact model we used is sshleifer/distilbart-cnn-12-6 4 , which is a smaller BART model trained on a news summarization dataset by Hermann et al. [48]. In our final configuration we used the 100 (Top-K) most likely words and a probability (Top-p) of 95%. Like BERT, BART has a sequence limit of 1024 tokens. Therefore, if the input text was longer than 1000 tokens we used our first approach to ensure all parts of the text are taken into consideration when getting summarized. We also tried to get a summarization ratio of roughly 40% for better comparability to the extractive approach. However, as both approaches are not deterministic this cannot always be guaranteed (also, as noted, both approaches take quite a while to execute, so we saved the results in files once generated). Additionally, due to the late release of the dataset we could not try out many configurations but instead had to use suggested configurations. Finally, the submitted models all use the hierarchical text representation (even when using text summaries). There is one model for each type of input text aka. no summary, extractive summary or abstractive summary. We also submitted a run without oversampling for better comparability. 5.3. Experimental setup For training, we represented each input as [CLS] + title + [SEP] + text, where text is either the original text or one of the two summaries produced and [CLS] is a classification token and [SEP] is a token to indicate a separator between two sentences. For training, we use an 80/20 training/validation split and optimize hyperparameters based on the loss of the validation set. We used the same initial random state and split for all configurations to provide a better comparability. We used a batch size of 8, an initial learning rate of 5e-5, a weight decay of 0.01 with 500 warm-up steps and three training epochs with an AdamW [49] optimizer. Everything was trained on a single RTX 2080 Ti with 11 GB VRAM using the Huggingface library. 4 https://huggingface.co/sshleifer/distilbart-cnn-12-6 6. Results We report three sets of results – (a) official results for all four of our runs, and for comparison we also present results obtained on (b) the development set as well as (c) the test set without hyperparameter tuning (not submitted to the challenge). First of all, in Table 2 we present the official results as returned to us by the shared task organizers. We marked the best-performing model for each metric in bold.5 Recall, that hierarchical transformer representation is applied to the source text in all of our runs, i.e. the term "original texts" refers to text that has been created this way but without subsequently applying abstractive or extractive summarization, respectively. Table 2 Official results Model Accuracy Precision Recall F1-macro BERT w/o oversampling 0.387 0.636 0.300 0.25570 BERT w/ original texts 0.432 0.409 0.402 0.40413 BERT w/ extractive summaries6 0.370 0.549 0.362 0.32986 BERT w/ abstractive summaries 0.438 0.476 0.385 0.40415 To contextualise the official results better (and also due to the fact that at this point we do not have official baseline results to compare against), we also report the results on the validation set (see Table 3). The configuration is the same as described in section 5.3 but without hyperparameter tuning (using a 80/20 split of the training data). Table 3 Performance on validation/dev set without hyperparameter tuning Model Accuracy Precision Recall F1-macro BERT w/o fine-tuning 0.421 0.379 0.370 0.356 BERT w/o oversampling 0.584 0.525 0.371 0.329 BERT w/ original texts 0.511 0.378 0.379 0.369 BERT w/ extractive summaries 0.568 0.498 0.463 0.459 BERT w/ abstractive summaries 0.542 0.362 0.397 0.376 Table 4 also follows the same configuration, but has been calculated once the test set was available and does not use hyperparameter tuning either (using all training data and evaluating on the test data). 7. Discussion First of all we observe that the non-fine-tuned model and the model which has been trained without oversampling the minority classes perform worst in all setups. This is in line with expectations. 5 Because of the extremely close values in Table 2 we added additional fractional digits. 6 The value for extractive summarization has been calculated with the official evaluation script afterwards, as there was a problem when uploading the file Table 4 Performance on test set without hyperparameter tuning Model Accuracy Precision Recall F1-macro BERT w/o fine-tuning 0.251 0.328 0.315 0.251 BERT w/o oversampling 0.379 0.419 0.355 0.333 BERT w/ original texts 0.472 0.487 0.481 0.465 BERT w/ extractive summaries 0.531 0.525 0.523 0.508 BERT w/ abstractive summaries 0.489 0.509 0.450 0.459 It however gets more complicated when comparing the other models. The official runs suggest that BERT w/ abstractive summaries wins overall by a tiny bit, but is on par with BERT w/ original texts (i.e. the original articles hierarchically transformed but without applying summarization). Given that this makes it into 8th place of 25 submissions and the fact that abstractive summarization is becoming more and more competitive, we see this as a clear signal that our general conceptual idea is a promising one. When taking a look at the official results for BERT w/ extractive summaries and BERT w/o oversampling, both models are still reasonably well-placed in the rankings. They would have ranked 16th and 18th respectively showing how well a vanilla BERT is pre-trained already. Looking beyond the official results, we observe some wide variation of scores though. While BERT w/ extractive summaries performs better than other approaches when not using hyperpa- rameter tuning (see Table 4), it scores way worse when hyperparameter tuning is in place (Table 2). In fact, not applying hyperparameter tuning would rank the system in 3rd position of the ranked list of 25 runs with an F1-macro of 0.508. This seems to be an indication of overfitting happening internally. The validation set in general seems to be not well suited to learn with, as all tuned models perform better when applying them to the test dataset directly (this is also the case, when the training set is exactly the same). All this raises some concerns about the size, robustness and generalisability of the test collection. This is by no means a novel finding, and some researchers go as far as to call the current (commonly applied) NLP evaluation approach to be broken [50]. We conclude that we will have to test our methodology on a wide range of additional collections to gain a better understanding of its strengths and weaknesses. One last point to note, there seems to be only little difference in performance when using BERT w/ original texts or BERT w/ abstractive summaries. Interestingly, the respective models achieve very similar performance independently of the dataset and experimental setup used. 7.1. Limitiatons Due to the nature of such challenges there was not much time to try different experimental setups. Especially abstractive summarization generation has a lot of different parameters to work with. Unfortunately, one iteration for those alone takes about half a day of computing time on our system. While we always tried to use the recommended configurations when possible, we could only use BERT with a maximum batch size of 8. It would have been interesting to see, whether batch sizes of 16 or greater make a significant difference in performance. Previous work on parameter tuning of BERT suggests this [51]. While BERT itself is a very sophisticated system, an approach using an even better system like RoBERTa [33] or XLNet [38] could outperform it. This has already been proven in their respective papers on other NLP tasks. The substantial difference in performance between the official results (Table 2) and our reruns on the test set (Table 4) indicate that the chosen experimental setup might either not have been ideal for this task or the data sets were simply too small. While hyperparameter tuning is often useful, in this case we achieve better results without it. However, this could also be because of the validation/dev set we acquired. As seen in Table 3 all models perform worse here than on the actual test set. This indicates a bad seed for our validation set we optimized on. Also, a summarization ratio of 0.40 was picked quite arbitrarily which might or might not restrict the full potential of summarizations. 7.2. Future Work In future it would certainly be interesting to explore more configurations and applications of automatic summarization. We believe summarization has the potential to enable better transferable knowledge. This could be useful for a variety of classification tasks as many models often only work in a certain domain. Therefore, it would be interesting to compare models trained on automatic summarizations and compare their performance in different domains working as a kind of “normalization” technique. We expect summarization of texts to limit overfitting in the future. With the results of Table 4 in mind, we hypothesize that there is a lot of room for improvement still available. We plan to apply our approaches on more datasets in the future and try to optimize the tuning further. 8. Conclusions We presented an approach for fake news detection that is based on the powerful paradigm of transformer-based embeddings and utilises text summarization as the main text transformation step before classifying a document. The results suggest that this is indeed a worthwhile direction of work and in future work we plan to explore this further. We note that using oversampling has a strong positive effect on system performance. What we did also observe was that the performance obtained on different datasets and based on different models of hypertuning varied substantially. One way forward is to apply our framework to larger datasets to see how robust extractive and abstractive summarisation might be for the task at hand. Acknowledgements This work was supported by the project COURAGE: A Social Media Companion Safeguarding and Educating Students funded by the Volkswagen Foundation, grant number 95564. References [1] J. M. Burkhardt, History of Fake News, Library Technology Reports 53 (2017) 5–9. [2] R. Baeza-Yates, B. Ribeiro-Neto (Eds.), Modern Information Retrieval, 2nd ed., Addison- Wesley, 2010. [3] H. Allcott, M. Gentzkow, Social media and fake news in the 2016 election, Journal of economic perspectives 31 (2017) 211–36. [4] L. Orman, Fighting Information Pollution with Decision Support Systems, Journal of Management Information Systems 1 (1984) 64–71. URL: https://doi.org/10.1080/07421222. 1984.11517704. doi:10.1080/07421222.1984.11517704, publisher: Routledge _eprint: https://doi.org/10.1080/07421222.1984.11517704. [5] V. L. Rubin, On deception and deception detection: Content analysis of computer-mediated stated beliefs, Proceedings of the American Society for Information Science and Technology 47 (2010) 1–10. Publisher: Wiley Online Library. [6] P. Nakov, G. D. S. Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari, M. Hasanain, N. Babulkov, A. Nikolov, G. K. Shahi, J. M. Struß, T. Mandl, The CLEF-2021 checkthat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news, in: D. Hiemstra, M. Moens, J. Mothe, R. Perego, M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval - 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28 - April 1, 2021, Proceedings, Part II, volume 12657 of Lecture Notes in Computer Science, Springer, 2021, pp. 639–649. URL: https: //doi.org/10.1007/978-3-030-72240-1_75. doi:10.1007/978-3-030-72240-1\_75. [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://www.aclweb.org/ anthology/N19-1423. doi:10.18653/v1/N19-1423. [8] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle- moyer, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Gener- ation, Translation, and Comprehension, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7871–7880. URL: https://www.aclweb.org/anthology/2020.acl-main.703. doi:10.18653/v1/2020.acl-main.703. [9] Q. Li, W. Zhou, Connecting the Dots Between Fact Verification and Fake News De- tection, in: Proceedings of the 28th International Conference on Computational Lin- guistics, International Committee on Computational Linguistics, Barcelona, Spain (On- line), 2020, pp. 1820–1825. URL: https://www.aclweb.org/anthology/2020.coling-main.165. doi:10.18653/v1/2020.coling-main.165. [10] K. Shu, D. Mahudeswaran, S. Wang, D. Lee, H. Liu, FakeNewsNet: A Data Repository with News Content, Social Context, and Spatiotemporal Information for Studying Fake News on Social Media, Big Data 8 (2020) 171–188. URL: https://www.liebertpub.com/doi/abs/ 10.1089/big.2020.0062. doi:10.1089/big.2020.0062, publisher: Mary Ann Liebert, Inc., publishers. [11] Y. Li, B. Jiang, K. Shu, H. Liu, MM-COVID: A Multilingual and Multidimensional Data Repository for CombatingCOVID-19 Fake News, arXiv:2011.04088 [cs] (2020). URL: http: //arxiv.org/abs/2011.04088, arXiv: 2011.04088 version: 1. [12] X. Zhou, A. Mulay, E. Ferrara, R. Zafarani, ReCOVery: A Multimodal Repository for COVID- 19 News Credibility Research, in: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Association for Computing Machinery, New York, NY, USA, 2020, pp. 3205–3212. URL: https://doi.org/10.1145/3340531.3412880. [13] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, FEVER: a Large-scale Dataset for Fact Extraction and VERification, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 809–819. URL: https://www.aclweb.org/anthology/N18-1074. doi:10.18653/v1/N18-1074. [14] M. Gruppi, B. D. Horne, S. Adalı, NELA-GT-2019: A Large Multi-Labelled News Dataset for The Study of Misinformation in News Articles, arXiv:2003.08444 [cs] (2020). URL: http://arxiv.org/abs/2003.08444, arXiv: 2003.08444. [15] X. Zhou, J. Wu, R. Zafarani, SAFE: Similarity-Aware Multi-modal Fake News Detection, in: H. W. Lauw, R. C.-W. Wong, A. Ntoulas, E.-P. Lim, S.-K. Ng, S. J. Pan (Eds.), Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, Springer International Publishing, Cham, 2020, pp. 354–367. doi:10.1007/978-3-030-47436-2_ 27. [16] U. Undeutsch, Beurteilung der glaubhaftigkeit von aussagen, Handbuch der psychologie 11 (1967) 26–181. [17] C. Yuan, Q. Ma, W. Zhou, J. Han, S. Hu, Early Detection of Fake News by Utiliz- ing the Credibility of News, Publishers, and Users based on Weakly Supervised Learn- ing, in: Proceedings of the 28th International Conference on Computational Lin- guistics, International Committee on Computational Linguistics, Barcelona, Spain (On- line), 2020, pp. 5444–5454. URL: https://www.aclweb.org/anthology/2020.coling-main.475. doi:10.18653/v1/2020.coling-main.475. [18] L. Cui, S. Wang, D. Lee, SAME: sentiment-aware multi-modal embedding for detecting fake news, in: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’19, Association for Computing Machin- ery, New York, NY, USA, 2019, pp. 41–48. URL: https://doi.org/10.1145/3341161.3342894. doi:10.1145/3341161.3342894. [19] K. Shu, X. Zhou, S. Wang, R. Zafarani, H. Liu, The role of user profiles for fake news detection, in: Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM ’19, Association for Computing Machin- ery, New York, NY, USA, 2019, pp. 436–439. URL: https://doi.org/10.1145/3341161.3342927. doi:10.1145/3341161.3342927. [20] K. Shu, D. Mahudeswaran, S. Wang, H. Liu, Hierarchical Propagation Networks for Fake News Detection: Investigation and Exploitation, Proceedings of the International AAAI Conference on Web and Social Media 14 (2020) 626–637. URL: https://ojs.aaai.org/index. php/ICWSM/article/view/7329. [21] K. Shu, L. Cui, S. Wang, D. Lee, H. Liu, dEFEND: Explainable Fake News Detection, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ACM, Anchorage AK USA, 2019, pp. 395–405. URL: https://dl.acm.org/doi/ 10.1145/3292500.3330935. doi:10.1145/3292500.3330935. [22] W. Y. Wang, “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detec- tion, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Van- couver, Canada, 2017, pp. 422–426. URL: https://www.aclweb.org/anthology/P17-2067. doi:10.18653/v1/P17-2067. [23] K. Sharma, F. Qian, H. Jiang, N. Ruchansky, M. Zhang, Y. Liu, Combating Fake News: A Survey on Identification and Mitigation Techniques, arXiv:1901.06437 [cs, stat] (2019). URL: http://arxiv.org/abs/1901.06437, arXiv: 1901.06437. [24] M. Farajtabar, J. Yang, X. Ye, H. Xu, R. Trivedi, E. Khalil, S. Li, L. Song, H. Zha, Fake News Mitigation via Point Process Based Intervention, arXiv:1703.07823 [cs] (2017). URL: http://arxiv.org/abs/1703.07823, arXiv: 1703.07823. [25] Y. Papanastasiou, Fake News Propagation and Detection: A Sequential Model, Management Science 66 (2020) 1826–1846. URL: https://pubsonline.informs.org/doi/10.1287/mnsc.2019. 3295. doi:10.1287/mnsc.2019.3295, publisher: INFORMS. [26] J. Kim, B. Tabibian, A. Oh, B. Schölkopf, M. Gomez-Rodriguez, Leveraging the Crowd to Detect and Reduce the Spread of Fake News and Misinformation, in: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, Association for Computing Machinery, New York, NY, USA, 2018, pp. 324–332. URL: https://doi.org/10.1145/3159652.3159734. doi:10.1145/3159652.3159734. [27] G. K. Shahi, Amused: An annotation framework of multi-modal social media data, arXiv preprint arXiv:2010.00502 (2020). [28] G. K. Shahi, A. Dirkson, T. A. Majchrzak, An exploratory study of covid-19 misinformation on twitter, Online Social Networks and Media 22 (2021) 100104. [29] G. Lemaître, F. Nogueira, C. K. Aridas, Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning, The Journal of Machine Learning Research 18 (2017) 559–563. Publisher: JMLR. org. [30] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, HuggingFace’s Transformers: State-of-the-art natural language processing, arXiv preprint arXiv:1910.03771 (2019). [31] M. Ostendorff, P. Bourgonje, M. Berger, J. Moreno-Schneider, G. Rehm, B. Gipp, Enriching BERT with Knowledge Graph Embeddings for Document Classification, arXiv:1909.08402 [cs] (2019). URL: http://arxiv.org/abs/1909.08402, arXiv: 1909.08402. [32] J. Ding, Y. Hu, H. Chang, BERT-Based Mental Model, a Better Fake News Detector, in: Proceedings of the 2020 6th International Conference on Computing and Artificial Intelligence, Association for Computing Machinery, New York, NY, USA, 2020, pp. 396–400. URL: https://doi.org/10.1145/3404555.3404607. [33] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019). [34] W. Souma, I. Vodenska, H. Aoyama, Enhanced news sentiment analysis using deep learning methods, Journal of Computational Social Science 2 (2019) 33–46. Publisher: Springer. [35] E. Loper, S. Bird, Nltk: The natural language toolkit, arXiv preprint cs/0205028 (2002). [36] R. Pappagari, P. Żelasko, J. Villalba, Y. Carmiel, N. Dehak, Hierarchical Transformers for Long Document Classification, arXiv:1910.10781 [cs, stat] (2019). URL: http://arxiv.org/ abs/1910.10781, arXiv: 1910.10781. [37] A. Mulyar, E. Schumacher, M. Rouhizadeh, M. Dredze, Phenotyping of Clinical Notes with Improved Document Classification Models Using Contextualized Neural Language Models, arXiv:1910.13664 [cs] (2020). URL: http://arxiv.org/abs/1910.13664, arXiv: 1910.13664. [38] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, arXiv preprint arXiv:1906.08237 (2019). [39] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: A lite bert for self-supervised learning of language representations, arXiv preprint arXiv:1909.11942 (2019). [40] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions for machine comprehension of text, arXiv preprint arXiv:1606.05250 (2016). [41] A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, M. Auli, Eli5: Long form question answering, arXiv preprint arXiv:1907.09190 (2019). [42] D. Miller, Leveraging BERT for extractive text summarization on lectures, arXiv preprint arXiv:1906.04165 (2019). [43] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter, arXiv:1910.01108 [cs] (2020). URL: http://arxiv.org/abs/1910. 01108, arXiv: 1910.01108. [44] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, D. Batra, Diverse beam search: Decoding diverse solutions from neural sequence models, arXiv preprint arXiv:1610.02424 (2016). [45] L. Shao, S. Gouws, D. Britz, A. Goldie, B. Strope, R. Kurzweil, Generating high-quality and informative conversation responses with sequence-to-sequence models, arXiv preprint arXiv:1701.03185 (2017). [46] A. Fan, M. Lewis, Y. Dauphin, Hierarchical neural story generation, arXiv preprint arXiv:1805.04833 (2018). [47] A. Holtzman, J. Buys, L. Du, M. Forbes, Y. Choi, The curious case of neural text degeneration, arXiv preprint arXiv:1904.09751 (2019). [48] K. M. Hermann, T. Kočiskŷ, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, P. Blunsom, Teaching machines to read and comprehend, arXiv preprint arXiv:1506.03340 (2015). [49] I. Loshchilov, F. Hutter, Decoupled weight decay regularization, arXiv preprint arXiv:1711.05101 (2017). [50] S. R. Bowman, G. E. Dahl, What will it take to fix benchmarking in natural lan- guage understanding?, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani- Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6- 11, 2021, Association for Computational Linguistics, 2021, pp. 4843–4855. URL: https: //www.aclweb.org/anthology/2021.naacl-main.385/. [51] M. Guderlei, M. Aßenmacher, Evaluating Unsupervised Representation Learning for Detecting Stances of Fake News, in: Proceedings of the 28th International Conference on Computational Linguistics, International Committee on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 6339–6349. URL: https://www.aclweb.org/anthology/ 2020.coling-main.558. doi:10.18653/v1/2020.coling-main.558.