=Paper=
{{Paper
|id=Vol-3180/paper-60
|storemode=property
|title=ur-iw-hnt at CheckThat! 2022: Cross-lingual Text Summarization for Fake News Detection
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-60.pdf
|volume=Vol-3180
|authors=Hoai Nam Tran,Udo Kruschwitz
|dblpUrl=https://dblp.org/rec/conf/clef/TranK22
}}
==ur-iw-hnt at CheckThat! 2022: Cross-lingual Text Summarization for Fake News Detection==
ur-iw-hnt at CheckThat! 2022: Cross-lingual Text Summarization for Fake News Detection Hoai Nam Tran1 , Udo Kruschwitz1 1 Information Science, University of Regensburg, Germany Abstract We describe our submission to the CLEF CheckThat! 2022 challenge. We contributed to Tasks 3A and 3B – multiclass fake news classification in English and German, respectively. Our approach incorporates extractive and abstractive summarization techniques by utilizing fine-tuned DistilBART and T5-3B. For cross-linguality, we use automatic machine translation to improve model inference. Our approved run for Task 3B was the official winner according to both F1 and Accuracy, with a fair margin to the second place. For Task 3A, we describe a wide range of models that we experimented with. While only one submission per team was permitted, we also describe the non-submitted setup that tops the leaderboard performance in this task. Keywords Fake news detection, BART, T5, extractive summarization, abstractive summarization, translation 1. Introduction The distribution of fake news is not a new problem, but due to its scale, it has become an urgent social and political issue [1]. Here we understand fake news to be intentionally and verifiably false information with the purpose of deceiving its reader [2]. Task 3 of the CLEF 2022 CheckThat! Shared Task [3] focuses on Multiclass Fake News Classification with English (3A) and German (3B) test sets (reusing the previous year’s dataset for training). That is the task with our contribution. The critical motivations for our work are as follows. We have seen transformer-based models becoming the basis for most state-of-the-art NLP applications, including a wide range of classification tasks, e.g., [4]. However, we acknowledge that there are restrictions on the input size in transformer models, which is why we are inspired by the findings that automatic summarization as a step towards cutting down long documents has been demonstrated to help identify fake news [5]. Finally, there are indications that automatic machine translation can help improve text classification, e.g., in our own work [6]. In this paper, we use transformer models for summarization and multiclass classification. Additionally, we use automatic machine translation for the German subtask to improve model inference. We conducted several experiments, but only one submission was allowed, and in the official leaderboard, we ended up being ranked 1st in Task 3B and 9th in Task 3A. Here we also CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ Hoai-Nam.Tran@student.ur.de (H. N. Tran); Udo.Kruschwitz@ur.de (U. Kruschwitz) 0000-0002-5503-0341 (U. Kruschwitz) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) discuss our non-submitted approaches for which post-competition results demonstrate that they would be ranked 1st in both 3A and 3B with a substantial margin over the leaderboard’s best performers. This paper describes our experiments in more detail. To encourage reproducibility of experimental work, we make all models available via Hugging Face1 and additional data via GitHub2 . 2. Related Work We briefly sketch related work here and focus on what directly inspired us for this paper. Fake News Detection is an unresolved task for which Transformer models with self-attention [7] like BERT [4], BART [8], and T5 [9] are actively utilized. Since fake news can appear in every language, multilingual models like XLM-RoBERTa [10] apply their cross-lingual ability in several tasks and benchmarks. One such dataset is FakeCovid, consisting of fact-checked articles from 92 fact-checking websites [11]. Shahi et al. [12] conducted an exploratory study of COVID-19 misinformation on the Twitter platform to define four different classes for the current dataset used in this Shared Task [3] and in the previous one [13]. To collect high-quality data, Shahi [14] proposes a semi-automatic framework where both machines and humans are involved in the process to mitigate the workload. In last year’s Shared Task, Hartl and Kruschwitz used the same DistilBART model we adopt here for summarization (though we use it for extractive rather than abstractive summarization) [15]. In later work, they refined this approach to achieve state-of-the-art performance for the task of fake news detection using a common reference benchmark collection [5]. 3. Dataset The dataset has been annotated using four labels: "true", "false", "partially false", and "other". As indicated in Table 1, the distribution of the released dataset is rather imbalanced. The training set is the same as last year’s 3A task [13]. The difference here is the later released test sets consisting of 612 English data points for task 3A and 586 German data points for task 3B. Both test sets have substantially more "true" labels than "partially false", while the training set and development set have more "partially false" than "true" labels. As shown in Table 2, the dataset contains some very long texts in both title and text. Therefore, one challenge of this Shared Task is to consider this length, especially when it goes beyond the standard token limits of typical transformer models. 1 https://huggingface.co/hntran/CLEF_2022_CheckThatLab_Task3 2 https://github.com/HN-Tran/CLEF_2022_CheckThatLab_Task3 Table 1 Some Dataset Statistics Labels Training Set Development Set Test Set 3A Test Set 3B True 142 69 210 243 False 465 113 315 191 Partially False 217 141 56 97 Other 76 41 31 55 All 900 364 612 586 Table 2 Token Length Distribution Variables Statistics Training Set Development Set Test Set 3A Test Set 3B Median 70 66 73 67 Mean 286 171 78 71 Title Minimum 3 3 11 3 Maximum 9960 8092 200 234 Median 3035 3115 3655 4009 Mean 4167 4498 6052 5617 Text Minimum 18 25 289 507 Maximum 32767 44359 100000 45309 4. Methodology 4.1. Summarization For the summarization task, we use two particular models for two different approaches: DistilBART-CNN-12-63 for extractive summarization and T5-3B4 for abstractive summarization. Extraction-based summarization selects the best representations of words and sentences of the given document or text input. In contrast, abstraction-based summarization generates shorter text which can be new sentences capturing the prominent notion of the source text input. In the used library [16] (see Chapter 4.4), we extract the output embeddings from the chosen model inference and cluster these with k-means [17]. The Elbow method is used to determine the optimal k-value [18]. While it is also possible to restrict the amount of output text by a fixed number of sentences or a fixed ratio, we want to use the optimal cluster of sentences instead to avoid losing any possibly relevant sentences. Our chosen model is DistilBART-CNN-12-6 which is based on BART [8] with distillation [19], fine-tuned with the CNN and DailyMail dataset [20]. In contrast to extraction, we use the three billion parameter version of T5 [9] to generate shorter sentences with the same prefix used in the pre-training process for the CNN/DailyMail dataset [20]. While the default input token length limit is 512, the model uses relative positional embeddings, which allows it to utilize much longer text at the cost of higher 3 https://huggingface.co/sshleifer/distilbart-cnn-12-6 4 https://huggingface.co/t5-3b computing resources like memory consumption [21, 9]. We also apply some omissible soft pre-processing steps to ensure retention of usually lossy token information and cleaner input for further processing. To solve the problem of token length, as mentioned in Chapter 3, we combine the title with the text for summarization in the following order: "title" + "." + "text". We refer the interested reader to the project repository for further details. 4.2. Multi-Class Classification The dataset has four different classes (see Table 1), which have imbalanced distributions. To classify them, we use BERTBase Uncased [4] as our baseline model and include three large models: BERTLarge Uncased [4], XLM-RoBERTaLarge [10], and T5-3B [9]. The default fine-tuning process consists of tokenization, splitting the dataset into train, development, and dev-test sets, and then the actual training. After the automatic machine translation step for the cross-lingual task, the inference is the last step for classification of the test set for submission and for the later released ground-truth labels to see how well our fine-tuned models perform. The main metric is macro-F1 since the dataset is imbalanced; specifically, the averaged F1 score as in Equation 1 is calculated [22]. 1 ∑︁ 1 ∑︁ 2𝑃𝑥 𝑅𝑥 ℱ1 = F1𝑥 = (1) 𝑛 𝑥 𝑛 𝑥 𝑃𝑥 + 𝑅𝑥 4.3. Machine Translation For the cross-lingual task (3B), we use the free Google Translate service to translate the whole German test set into English for inference. In our previous work, this has been shown to be effective [6]. Given the scale of translation data, Google utilizes this as an obvious choice. Since Google Translate has an internal character limit, we only take the first 5000 tokens for translation. After the automatic machine translation, we repeat the summarization step on these newly created data and start the inference process. 4.4. Experimental Setup All of our experiments are conducted on a single RTX A6000 with 48 GB VRAM. We use the SimpleTransformers library5 for the T5 model and all other transformer models with the Hugging Face Transformers library6 . For the summarization task, we use the Bert Extractive Summarizer library7 [16], and for machine translation, we use the deep-translator library8 in combination with the free public Google Translate service9 . 5 https://simpletransformers.ai/ 6 https://huggingface.co/transformers 7 https://github.com/dmmiller612/bert-extractive-summarizer 8 https://github.com/nidhaloff/deep-translator 9 https://translate.google.com/ 5. Experiments First, we conduct a stratified split of the development set by the standard 80:20 ratio to have a dev-test set to choose our submission. Then in our experiments, we make five runs of each model with the default hyperparameters, including these changes for all models (except for T5-3B): • Maximum Steps: 705 • Learning Rate: 1e-5 • Max Sequence Length: 256 • Batch Size: 32 • Warmup Ratio: 0.1 • Weight Decay: 0.01 We take the best model with the highest macro-F1 score after saving models every 50 steps. For the T5 model, we make one single run with the following changes from the default: • Maximum Epochs: 200 • Max Sequence Length: 256 • Batch Size: 4 • Early Stopping Metric: Macro-F1 • Early Stopping Delta: 0.01 • Early Stopping Patience: 5 Table 3 shows that the T5-3B classifier with DistilBART-CNN-12-6 as the extractive summa- rizer is the best overall model for both tasks, with 39.54% (best in 3A) and 29.58% (second-highest in 3B), respectively. Our submission (marked with *) is the extraction-based BERTLarge model’s first run taking 1st place in the 3B leaderboard, and is the third-highest performer of our ex- periments for Test 3B. The best performing model in 3B is XLM-RoBERTaLarge with T5-3B as the abstractive summarizer with a macro-F1 score of 30.06%. For 3A, the best abstractive classification model is BERTLarge with 36.48%; for 3B, the best extractive classification model is T5-3B with 29.58%. All results are macro-F1 scores (see Equation 1). 6. Discussion We observe that the variation of the different performances between the five runs of each model is high (up to 9.53% in extractive BERTLarge for Test 3B), an indication of overfitting. Some possible explanations for this might be the choice of imprecise hyperparameters or the substantial discrepancy between the different parts of the dataset. Furthermore, the abstractive T5-3B test result might also be caused by overfitting. While our submission is over the 50% mark and has the lowest difference of only 0.22%, the seemingly stable results between the development set and the dev-test set do not guarantee a good score in the test set. Interestingly, the extractive T5-3B has scored on the dev and dev-test set lower than 50% and is the best overall performer. Abstractive summarization generally gives the BERT models higher macro-F1 results than extractive summarization. Nevertheless, the results of both summarization techniques are similar, and thus both approaches are still viable. Table 3 Experimental runs conducted for Tasks 3A and 3B (actual submission marked with *) Summarization Model Classification Model Run Nr. Dev Dev-Test Test 3A Test 3B 1 47.59 48.19 28.39 23.81 2 50.02 44.27 28.48 27.22 BERTBase 3 48.16 41.41 30.20 25.94 4 49.97 41.47 29.88 26.21 5 48.04 47.06 32.23 25.98 1* 52.40 52.18 28.33 28.99 2 46.43 39.96 26.87 19.46 DistilBART-CNN-12-6 BERTLarge 3 48.77 52.78 30.70 28.69 (extractive) 4 49.21 48.44 32.31 25.32 5 53.25 51.85 30.19 20.46 1 50.53 41.04 30.42 27.40 2 50.93 44.54 33.11 28.01 XLM-RLarge 3 49.08 48.56 30.82 26.09 4 50.80 43.99 28.23 21.94 5 50.95 40.29 32.47 23.34 T5-3B 1 48.05 46.52 39.54 29.58 1 54.05 45.76 35.41 27.14 2 48.12 44.73 31.88 28.03 BERTBase 3 50.02 40.58 33.73 25.84 4 49.58 47.21 31.86 23.91 5 48.89 40.29 31.13 24.18 1 56.33 51.15 28.89 21.34 2 45.85 37.87 32.88 23.43 T5-3B BERTLarge 3 55.08 46.80 35.24 28.33 (abstractive) 4 52.15 47.08 36.48 27.01 5 51.32 46.91 30.56 21.77 1 51.54 44.81 31.66 28.99 2 49.36 42.84 35.63 30.06 XLM-RLarge 3 49.73 44.91 35.67 27.82 4 50.59 44.79 36.01 26.86 5 51.78 40.25 35.29 28.09 T5-3B 1 52.08 43.82 29.72 23.72 6.1. Limitations Because of time constraints, we have not implemented ensembling strategies in our experiments. However, there is plenty of scope as ensembles have been demonstrated to offer substantial gains over individual classifiers, e.g., [5, 23]. For the same reason, only one summarization model for each technique and the number of transformer models for the classification task was possible. The chosen ratio of the stratified split might cause too few data points for the dev-test set, so a different ratio like 50:50 would have been an alternative. 6.2. Future Work In the future, it would be interesting to see how good larger models like XLM-RXL/XXL [24] or other models like Big Bird [25] or PEGASUS [26] perform. For the cross-lingual task, using multilingual models without needing machine translation is another option to experiment with. Alternatively, text generation models like T5 [9] and BART [8] can also be used for machine translation. 7. Conclusion We described our family of approaches to the task of multiclass fake news classification for English and German. At the core, they use fine-tuned transformer architectures and incorporate extractive and abstractive summarization (to be able to deal with long input documents). For the multilingual task, we also incorporate automatic machine translation. The results demonstrate that both summarization techniques and automatic machine translation are competitive. In par- ticular, for the multilingual setting, we observe a large margin between our winning submission and the places further down on the leaderboard. Our analysis also uncovers that large language models perform best if overfitting can be avoided. Acknowledgments This work was supported by the project COURAGE: A Social Media Companion Safeguarding and Educating Students funded by the Volkswagen Foundation, grant number 95564. References [1] P. Nakov, D. P. A. Corney, M. Hasanain, F. Alam, T. Elsayed, A. Barrón-Cedeño, P. Papotti, S. Shaar, G. D. S. Martino, Automated fact-checking for assisting human fact-checkers, in: Z. Zhou (Ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, ijcai.org, 2021, pp. 4551–4558. [2] H. Allcott, M. Gentzkow, Social media and fake news in the 2016 election, Journal of economic perspectives 31 (2017) 211–36. [3] J. Köhler, G. K. Shahi, J. M. Struß, M. Wiegand, M. Siegel, T. Mandl, M. Schütz, Overview of the CLEF-2022 CheckThat! lab task 3 on fake news detection, in: Working Notes of CLEF 2022—Conference and Labs of the Evaluation Forum, CLEF ’2022, Bologna, Italy, 2022. [4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/ N19-1423. doi:10.18653/v1/N19-1423. [5] P. Hartl, U. Kruschwitz, Applying automatic text summarization for fake news detection, in: Proceedings of the 13th Language Resources and Evaluation Conference (LREC 2022), 2022, pp. 2702-–2713. [6] H. N. Tran, U. Kruschwitz, ur-iw-hnt at GermEval 2021: An ensembling strategy with multi- ple BERT models, in: Proceedings of the GermEval 2021 Shared Task on the Identification of Toxic, Engaging, and Fact-Claiming Comments, Association for Computational Linguistics, Duesseldorf, Germany, 2021, pp. 83–87. URL: https://aclanthology.org/2021.germeval-1.12. [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polo- sukhin, Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Curran Associates Inc., USA, 2017, pp. 6000–6010. URL: http://dl.acm.org/citation.cfm?id=3295222.3295349. [8] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle- moyer, BART: Denoising sequence-to-sequence pre-training for natural language gen- eration, translation, and comprehension, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Lin- guistics, Online, 2020, pp. 7871–7880. URL: https://aclanthology.org/2020.acl-main.703. doi:10.18653/v1/2020.acl-main.703. [9] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html. [10] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.747. doi:10.18653/v1/2020.acl-main.747. [11] G. K. Shahi, D. Nandini, FakeCovid – A Multilingual Cross-domain Fact Check News Dataset for COVID-19, in: Workshop Proceedings of the 14th International AAAI Confer- ence on Web and Social Media, 2020. URL: http://workshop-proceedings.icwsm.org/pdf/ 2020_14.pdf. [12] G. K. Shahi, A. Dirkson, T. A. Majchrzak, An exploratory study of covid-19 misinformation on twitter, Online Social Networks and Media 22 (2021) 100104. [13] G. K. Shahi, J. M. Struß, T. Mandl, Overview of the CLEF-2021 CheckThat! lab task 3 on fake news detection, Working Notes of CLEF (2021). [14] G. K. Shahi, AMUSED: An Annotation Framework of Multi-modal Social Media Data, arXiv preprint arXiv:2010.00502 (2020). [15] P. Hartl, U. Kruschwitz, University of Regensburg at CheckThat! 2021: Exploring Text Summarization for Fake News Detection, in: CLEF (Working Notes), volume 2936 of CEUR Workshop Proceedings, CEUR-WS.org, 2021, pp. 508–519. [16] D. Miller, Leveraging bert for extractive text summarization on lectures, arXiv preprint arXiv:1906.04165 (2019). [17] J. MacQueen, et al., Some methods for classification and analysis of multivariate obser- vations, in: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, Oakland, CA, USA, 1967, pp. 281–297. [18] T. M. Kodinariya, P. R. Makwana, Review on determining number of cluster in k-means clustering, International Journal 1 (2013) 90–95. [19] S. Shleifer, A. M. Rush, Pre-trained summarization distillation, ArXiv abs/2010.13002 (2020). [20] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, P. Blun- som, Teaching machines to read and comprehend, in: C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 28, Curran Associates, Inc., 2015. URL: https://proceedings.neurips.cc/paper/2015/ file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf. [21] P. Shaw, J. Uszkoreit, A. Vaswani, Self-attention with relative position representations, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 464–468. URL: https://aclanthology.org/N18-2074. doi:10.18653/v1/N18-2074. [22] J. Opitz, S. Burst, Macro f1 and macro f1, arXiv preprint arXiv:1911.03347 (2019). [23] S. Zimmerman, U. Kruschwitz, C. Fox, Improving hate speech detection with deep learn- ing ensembles, in: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018, pp. 2546–2553. [24] N. Goyal, J. Du, M. Ott, G. Anantharaman, A. Conneau, Larger-scale transformers for multilingual masked language modeling, in: Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), Association for Computational Linguistics, Online, 2021, pp. 29–33. URL: https://aclanthology.org/2021.repl4nlp-1.4. doi:10.18653/v1/2021.repl4nlp-1.4. [25] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, A. Ahmed, Big Bird: Transformers for Longer Se- quences, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Asso- ciates, Inc., 2020, pp. 17283–17297. URL: https://proceedings.neurips.cc/paper/2020/file/ c8512d142a2d849725f31a9a7a361ab9-Paper.pdf. [26] J. Zhang, Y. Zhao, M. Saleh, P. Liu, PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization, in: H. D. III, A. Singh (Eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 11328–11339. URL: https://proceedings.mlr.press/v119/ zhang20ae.html.