Bilingual Propaganda Detection in Diplomats’ Tweets Using Language Models and Linguistic Features Arkadiusz Modzelewski1,2,*,† , Paweł Golik† and Adam Wierzbicki1 1 Polish-Japanese Academy of Information Technology, Poland 2 University of Padua, Italy Abstract Our study presents an approach to a shared task of propaganda identification and characterization at the DIPROMATS 2024 hosted by the Iberian Languages Evaluation Forum. As the DSHacker team, we participated in the propaganda detection task, which comprised three subtasks, each with varying levels of detail in identifying propaganda types. The first subtask required binary identification of propaganda in tweets authored in either English or Spanish by diplomats and authorities from major powers. The second subtask focused on a coarse- grained classification of propaganda, while the third subtask demanded a fine-grained approach to identifying specific propaganda techniques. To tackle these challenges, we fine-tuned different BERT-based pre-trained models, including the XLM-RoBERTa model, and achieved remarkable success. Our system secured first place across all language categories, including monolingual and bilingual approaches, for the second and third subtask. Moreover, we attained high rankings in the binary propaganda classification. Our research also delves into the potential of detecting propaganda using Large Language Models with a few-shot prompting approach. We conducted experiments with two GPT models, including the recently released GPT-4o by OpenAI. Furthermore, we investigated the effectiveness of linguistic features and traditional machine learning models in propaganda detection. Overall, our study highlights our system’s exceptional performance and provides valuable insights into the capabilities of modern language models and machine learning techniques in identifying propaganda. Keywords Propaganda, XLM-RoBERTa, GPT-4o, GPT-3.5, Few-shot Prompting, Linguistic Features 1. Introduction 1.1. Problem Overview In the digital age, online news often uses different propaganda techniques. Propaganda, as defined by Sparkes-Vian [1], is an evolving set of methods and mechanisms that facilitate the propagation of ideas and actions. It employs rhetorical techniques to improve replication, making it a powerful tool for influencing public opinion. Propaganda is not false or immoral by its nature. Its ethical implications depend on the political, social, and technological context. Propaganda is most effective when it goes unnoticed, subtly altering readers’ opinions without their awareness [2]. Therefore, detecting propaganda remains vital but also challenging to implement. Nowadays, information spreads from many online sources. Platforms like X (formerly known as Twitter) have become vital places for sharing news and opinions. However, they have also become channels for spreading propaganda, which can influence people’s thoughts and actions. As a result, it is crucial to detect propaganda, as it affects public discourse and people’s decisions. DIPROMATS 2024 organized as a part of the Iberian Languages Evaluation Forum 2024 (IberLEF) [3] aims to spread knowledge and research on detecting propaganda. In this study, we will present our experiments and final systems that we utilized to detect propaganda in the task organized by DIPROMATS 2024. IberLEF 2024, September 2024, Valladolid, Spain * Corresponding author. † These authors contributed equally. $ arkadiusz.modzelewski@pja.edu.pl (A. Modzelewski); golik.pawel@gmail.com (P. Golik); adamw@pja.edu.pl (A. Wierzbicki) € https://amodzelewski.com/ (A. Modzelewski); http://adamwierzbicki.info/ (A. Wierzbicki)  0009-0003-1169-831X (A. Modzelewski); 0009-0003-1254-6879 (P. Golik); 0000-0003-0075-7030 (A. Wierzbicki) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1.2. Task Description DIPROMATS 2024 introduces the shared task focused on the automatic detection and characterization of propaganda techniques and narratives used by diplomats from major powers. In our experiments, we decided to focus on propaganda detection tasks. This task includes three different subtasks listed below [4]: 1. Subtask 1a: Propaganda identification 2. Subtask 1b: Propaganda characterization, coarse-grained 3. Subtask 1c: Propaganda characterization, fine-grained Propaganda Identification Participants must develop an automatic system to determine whether a tweet contains propaganda. In this scenario, we are dealing with a binary classification task. For each instance, a short tweet text, we aim to predict one of two labels: "false" or "true" indicating the presence of propaganda [4]. Propaganda characterization, coarse-grained Systems must determine which of the four cat- egories each tweet belongs to: Not propagandistic, Appeal to commonality, Discrediting the opponent, and Loaded language. Each tweet can be assigned to one or more categories, making it a multiclass, multilabel classification task [4]. Propaganda characterization, fine-grained Systems must classify each tweet according to the specific propaganda techniques it contains. There is one negative class and seven positive classes: Flag Waving, Ad Populum/Ad Antiquitatem, Name Calling/Labeling, Undiplomatic Assertiveness/Whataboutism, Appeal to Fear, Doubt, and Loaded Language [4]. This task is multiclass multilabel clasification problem. 2. Related Work The identification of propaganda in social media and web articles has gained significant attention in recent years due to the increasing influence of online information on public opinion and political dis- course [5]. Barrón-Cedeno et al. [6] proposed a model to automatically assess the level of propagandistic content on the article level. On the other hand, other approaches have introduced more fine-grained propaganda techniques detection [7, 8] and analyzed the spread of propaganda on X platform [9, 10]. Research on propaganda detection significantly intersects with persuasion detection due to the numerous shared characteristics and techniques [8]. In recent years, multiple workshops have been organized to advance the development of technologies aimed at identifying persuasion techniques [11, 12, 13, 14]. The most recent workshop (SemEval-2023 Task 3) focused on identifying 23 specific persuasion techniques in online news on paragraph level and in a multilingual setup [14]. Systems proposed during SemEval-2023 were mainly based on multilingual BERT models, such as mBERT or XLM-RoBERTa [15, 16, 17]. The detection of propaganda has also been a research focus in the most recent shared task at DIPROMATS 2023 [18]. Casavantes et al. [19] utilized BERTweet [20] and RoBERTuito [21] and aimed to improve the performance of the detection of propagandistic tweets by combining the text of tweets with contextual attributes such as their geographical origin, type of message, and emotions. UniLeon- UniBO Team utilized transfer learning between different tasks of propaganda detection [22]. Another two systems focused on employing data augmentation to improve performance in propaganda detection [23, 24]. Moreover, the best-performing system in binary propaganda classification in English was based on cascades of language models, adopting GPT-J as the backbone model [25]. Table 1 Number of tweets and authorities in Spanish and English datasets. Spanish English Region Tweets Count Authorities Count Tweets Count Authorities Count China 2,997 25 3,022 106 Russia 1,391 22 2,690 114 European Union 2,465 48 2,916 186 United States 2,738 40 3,114 216 Total 9,591 135 12,012 619 Table 2 Summary of datasets with average character and word counts. ALL refers to the data before our split. Language Dataset Size Positive class % Avg. char. count Avg. word count English TRAIN 7146 23.19% 255.50 ± 53.98 46.05 ± 10.98 VALID 1262 24.88% 255.83 ± 54.31 46.22 ± 11.03 ALL 8408 23.44% 255.55 ± 54.03 46.07 ± 10.99 Spanish TRAIN 5202 19.30% 255.93 ± 54.26 44.40 ± 10.32 VALID 918 20.92% 255.30 ± 53.67 44.32 ± 9.97 ALL 6120 19.54% 255.83 ± 54.17 44.39 ± 10.29 3. Dataset The dataset includes tweets in both Spanish and English authored by diplomats representing China, Russia, the United States, and the European Union. These tweets come from official government accounts, embassies, ambassadors, consuls, and other diplomatic profiles. The tweets were collected using the Twitter API for Academic Research and were posted between January 1, 2020, and March 11, 2021. The data contains features such as tweet ID, text, country, annotated labels, and a creation time stamp. Table 1 summarizes the presence of diplomatic authorities in the dataset [4]. The task authors split the original data into training and test sets based on time. They chose a date for each dataset that divides positive tweets into a 70/30 proportion. The 70% subset, consisting of the oldest tweets, became the training set, while the 30% subset, containing the newest tweets, became the unseen test set utilized for final systems’ scores [4]. 4. Our Approach 4.1. Data Preparation Our experiments focused solely on using the tweet text and gold labels, disregarding any other columns in the dataset. In our model-building process, we included a phase for optimizing hyperparameters. To facilitate this, we divided the training data further into a new training subset and a validation subset with a ratio of 85/15. Table 2 shows the characteristics of the datasets. 4.2. Fine-tuning BERT-based Models Our approach relied on fine-tuning pretrained BERT-based models using the labeled dataset. Fine-tuning allows the model to learn the nuances and patterns relevant to our task while retaining the general language understanding from its initial training. We employed both monolingual and multilingual pretrained models loaded from the HuggingFace repository: 1. ENGLISH (ROB-EN) - FacebookAI/roberta-large - the language model (355M parameters) trained on English data in a self-supervised fashion [26]. 2. SPANISH (ROB-ES) - PlanTL-GOB-ES/roberta-large-bne - based on the RoBERTa large model pretrained using a large Spanish dataset, with 570GB of Spanish texts [27]. 3. BILINGUAL (XLM-BI) - FacebookAI/xlm-roberta-large pre-trained on 2.5TB of filtered Common- Crawl data containing 100 languages [28]. We began with hyperparameter optimization for all the pretrained models we tested. This involved fitting 𝑁 models on the training subset with gold labels and using the remaining labeled data for validation. 𝑁 refers here to the number of different combinations of hyperparameter values. We selected the best model based on the F1 score from the validation data, and this model was used for the final submission. Monolingual models were fine-tuned exclusively on tweets in a single language, while multilingual models were fine-tuned on a combination of English and Spanish tweets. Please refer to our Appendix A, which presents optimal hyperparameters of each fine-tuned model. 5. Results 5.1. Leaderboard Performance In DIPROMATS 2024 Task 1, the Information Contrast Model (ICM) score determines the best propaganda categorization model, addressing the classes’ hierarchical nature [29]. In presenting our results, it’s important to clarify that models with the same names (e.g., XLM-BI) are not the same across different subtasks. For instance, the XLM-BI model in subtask 1a was fine-tuned specifically on subtask 1a data, while the XLM-BI model in subtask 1b was fine-tuned on subtask 1b data. In subtask 1a, our best model for the English language, ROB-EN fine-tuned on English tweets, secured 5th place on the English leaderboard. Our bilingual XLM-BI model won the Spanish leaderboard and obtained 4th place on the multilingual leaderboard. In subtask 1b, the XLM-BI model granted us 1st position in all language categories. We also achieved first place on all language leaderboards in subtask 1c. For English, the top results were achieved by the ROB-EN model, while for the other leaderboards, the XLM-BI model prevailed. Tables 3, 4, and 5 show our final results on all subtasks. Table 3 Our final results from the DIPROMATS 2024 Task 1a official leaderboard (LB). Language Model ICM F1 score LB rank Winner ICM GOLD ICM English ROB-EN 0.2012 0.6865 5 0.2123 0.6604 Spanish XLM-BI 0.2187 0.7097 1 0.2187 0.6014 Bilingual XLM-BI 0.4978 0.6896 4 0.2048 0.6323 Table 4 Our final results from the DIPROMATS 2024 Task 1b official leaderboard (LB). Language Model ICM F1 macro LB rank Winner ICM GOLD ICM English XLM-BI 0.0312 0.6219 1 0.0312 0.7014 Spanish XLM-BI -0.1148 0.4204 1 -0.1148 0.7535 Bilingual XLM-BI -0.0074 0.6029 1 -0.0074 0.6692 5.2. Results Discussion In this workshop, our DSHacker team won in all categories for multiclass multilabel classification. Additionally, we secured a strong position in the binary classification subtask. Our XLM-BI model consistently emerged as the top solution among final submissions. However, in certain situations, monolingual models like ROB-EN can perform better than multilingual approaches or yield comparable results. Table 5 Our final results from the DIPROMATS 2024 Task 1c official leaderboard (LB). Language Model ICM F1 macro LB rank Winner ICM GOLD ICM English ROB-EN -0.0311 0.4655 1 -0.0311 0.7883 Spanish XLM-BI -0.0917 0.518 1 -0.0917 0.6140 Bilingual XLM-BI -0.0074 0.4611 1 -0.0074 0.7874 6. Few-shot Prompting with GPT Models Another technique we explored is few-shot prompting using GPT models. In few-shot prompting, the prompt includes a brief description of the task followed by a few input-output pairs demonstrating the desired behavior. This technique allows the model to infer the patterns and rules of the task from the limited examples and generate appropriate outputs for new inputs. We applied this approach only in the subtask 1a binary classification setting. Our experiments included OpenAI’s gpt-4o (GPT-4o) and gpt-3.5-turbo-1106 (GPT-3.5) generative models. We implemented the few-shot prompting technique using the OpenAI Chat Completions API. Each prediction request sent to the GPT model consisted of a list of messages presented to the model. Each message contains the role and content attribute. There are three roles available: 1. system message helps set the behavior of the model (assistant) by providing it context and guidelines. 2. user messages can provide exemplary requests for the assistant. In our case - examplary requests for a provided text’s check-worthiness evaluation. 3. assistant messages indicate the expected output of the assistant. Due to time constraints, we could not submit results produced by GPT models. However, we conducted post-deadline experiments and evaluated these models on the validation part of the data for binary classification from subtask 1a. In our experiments, the prompt is formatted starting with a system message that clarifies the task (See Listing 1). This is followed by alternating pairs of user and assistant messages. One pair for each few-shot example, where a user message asks whether the example’s content contains propaganda, and the corresponding assistant message provides the gold label for the example, either ’Yes’ or ’No’ (See Listing 2). The final message following the pairs is one user message with the actual text to be classified by the model (See Listing 3). For each instance to be classified, we included four examples of few-shot prompting from the training dataset, two containing propaganda. The chosen few-shot examples were consistent in a given language. The prompt templates remained consistent for both the GPT-4o and GPT-3.5 experiments. 6.1. Results on Validation Datasets The table shows the subtask 1a F1 scores of our models on English and Spanish validation datasets. GPT-3.5, using few-shot prompting, had moderate scores of 0.5193 for English and 0.5354 for Spanish. GPT-4o improved on these, with scores of 0.5665 for English and 0.6622 for Spanish. The multilingual model, XLM-RoBERTa-large (XLM-BI), performed better, scoring 0.7440 for English and a top score of 0.7907 for Spanish. The monolingual RoBERTa-large model for English (ROB-EN) achieved the highest score of 0.7692, while its Spanish counterpart (ROB-ES) scored 0.7684. Overall, fine-tuned BERT-based models outperformed GPT-based models, with multilingual and monolingual models showing similar performance. Table 6 Results of experiments obtained on the validation dataset. Language Model F1 Score | Language Model F1 Score English GPT-3.5 0.5193 | Spanish GPT-3.5 0.5354 GPT-4o 0.5665 | GPT-4o 0.6622 XLM-BI 0.7440 | XLM-BI 0.7907 ROB-EN 0.7692 | ROB-ES 0.7684 7. Linguistic Features for Propaganda Detection In this section, we explore the use of StyloMetrix1 vectors for the subtask 1a propaganda detection. StyloMetrix was sucessfully utilized in persuasion detection in Polish [17]. The study of persuasion detection significantly intersects with the study of propaganda detection due to the numerous similarities they share [30]. As a result, in our research we will explore the usage of StyloMetrix for propaganda detection in English as StyloMetrix currently does not support Spanish. With StyloMetrix, we can create text representations that are interpretable, normalized, and repro- ducible [31]. By translating various aspects of linguistic features into numeric values, StyloMetrix vectors can be utilized as input for machine learning classifiers [31]. StyloMetrix quantifies many lin- guistic features, such as the 17 metrics created using the HurtLex lexicon. HurtLex 2 is a comprehensive lexicon encompassing offensive, aggressive, and hateful words [32]. HurtLex categorizes these words into 17 distinct groups, ranging from ethnic slurs to derogatory terms related to physical and cognitive disabilities and words associated with moral and behavioral defects [32]. In our experiments, we employ StyloMetrix vectors to predict propaganda in English tweets. The StyloMetrix vectors for English encompass a comprehensive set of 196 metrics, categorized into several groups: Detailed grammatical forms, General grammar forms, Detailed lexical forms, Additional lexical items, Parts of speech, Social media, Syntactic forms, General text statistics [31]. We utilize these text rep- resentations as features for training classical machine learning models, specifically XGBoost, LightGBM, and Logistic Regression. The models are trained on our training dataset and tested using validation data to evaluate their performance. 7.1. Results on Validation Datasets Table 7 presents results of our experiments with classical machine learning models and linguistic features. Among the classical models, LightGBM performs the best with an F1 score of 0.7663, followed closely by XGBoost at 0.7594. Logistic Regression trails behind significantly with a score of 0.7120, indicating that more complex models like LightGBM and XGBoost are better at capturing the nuances of the dataset. Table 6 from previus Section shows that the highest F1 score for English validation dataset is achieved by the ROB-EN model, but surprisingly it is followed closely by LightGBM. Traditional machine learning models like LightGBM and XGBoost still perform robustly, showing that with well- engineered features offered by StyloMetrix vectors, they can compete closely with advanced language models in this specific task of binary propaganda detection. On the other hand, it may suggest that we should perform more comprehensive hyperparameter tunning for BERT-based models. Table 7 Results of experiments with linguistic features and classical machine learning models obtained on the English validation dataset. Model F1 Score XGBoost 0.7594 LightGBM 0.7663 Logistic Regression 0.7120 1 https://github.com/ZILiAT-NASK/StyloMetrix/tree/main 2 https://github.com/valeriobasile/hurtlex 8. Conclusions As the DSHacker team, we explored various techniques for propaganda detection across multiple languages and subtasks, employing both state-of-the-art pretrained BERT-based models, few-shot prompting with GPT models, and classical machine learning algorithms utilizing StyloMetrix linguistic features. Our fine-tuned BERT-based models demonstrated strong performance in the DIPROMATS 2024 Task 1 competition. In summary, we secured 1st position in 7 out of 9 categories. We won in all categories for multiclass multilabel classification and in the subtask 1a binary classification of Spanish tweets. The multilingual XLM-BI model consistently delivered top results, especially in multilingual and Spanish tasks. Monolingual models like ROB-EN also showed competitive performance, particularly for English tasks, indicating that language-specific models can sometimes outperform multilingual counterparts. Few-shot prompting with GPT models yielded moderate performance on binary propaganda classifica- tion. While GPT-4o beat GPT-3.5, both were still outperformed by fine-tuned BERT-based models. Classical machine learning models like LightGBM and XGBoost, combined with well-engineered linguis- tic features from StyloMetrix, performed well on the binary task of propaganda detection. LightGBM, in particular, achieved a F1 score close to that of the best BERT-based model on the English validation dataset, highlighting the potential of classical models with rich feature sets on this specific task of propaganda detection. References [1] C. Sparkes-Vian, Digital propaganda: The tyranny of ignorance, Critical sociology 45 (2019) 393–409. [2] A. Barrón-Cedeño, I. Jaradat, G. Da San Martino, P. Nakov, Proppy: Organizing the news based on their propagandistic content, Information Processing Management 56 (2019) 1849–1864. URL: https://www.sciencedirect.com/science/article/pii/S0306457318306058. doi:https://doi.org/ 10.1016/j.ipm.2019.03.005. [3] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process- ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024. [4] P. Moral, J. Fraile, G. Marco, A. Peñas, J. Gonzalo, Overview of DIPROMATS 2024: Detection, characterization and tracking of propaganda in messages from diplomats and authorities of world powers, Procesamiento del Lenguaje Natural 73 (2024). [5] G. D. S. Martino, S. Cresci, A. Barrón-Cedeño, S. Yu, R. Di Pietro, P. Nakov, A survey on computa- tional propaganda detection, arXiv preprint arXiv:2007.08024 (2020). [6] A. Barrón-Cedeno, I. Jaradat, G. Da San Martino, P. Nakov, Proppy: Organizing the news based on their propagandistic content, Information Processing & Management 56 (2019) 1849–1864. [7] G. Da San Martino, Y. Seunghak, A. Barrón-Cedeno, R. Petrov, P. Nakov, et al., Fine-grained analysis of propaganda in news article, in: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), Association for Computational Linguistics, 2019, pp. 5636–5646. [8] J. Piskorski, N. Stefanovitch, N. Nikolaidis, G. Da San Martino, P. Nakov, Multilingual multifaceted understanding of online news in terms of genre, framing, and persuasion techniques, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada, 2023, pp. 3001–3022. URL: https://aclanthology.org/2023.acl-long.169. doi:10. 18653/v1/2023.acl-long.169. [9] K. Hristakieva, S. Cresci, G. Da San Martino, M. Conti, P. Nakov, The spread of propaganda by coordinated communities on social media, in: Proceedings of the 14th ACM Web Science Conference 2022, 2022, pp. 191–201. [10] P. Vijayaraghavan, S. Vosoughi, TWEETSPIN: Fine-grained propaganda detection in social media using multi-view representations, in: M. Carpuat, M.-C. de Marneffe, I. V. Meza Ruiz (Eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Association for Computational Linguistics, Seattle, United States, 2022, pp. 3433–3448. URL: https://aclanthology.org/2022.naacl-main.251. doi:10.18653/v1/2022.naacl-main.251. [11] G. Martino, A. Barrón-Cedeno, H. Wachsmuth, R. Petrov, P. Nakov, Semeval-2020 task 11: Detection of propaganda techniques in news articles, arXiv preprint arXiv:2009.02696 (2020). [12] D. Dimitrov, B. B. Ali, S. Shaar, F. Alam, F. Silvestri, H. Firooz, P. Nakov, G. D. S. Martino, Semeval-2021 task 6: Detection of persuasion techniques in texts and images, arXiv preprint arXiv:2105.09284 (2021). [13] M. Hasanain, F. Alam, H. Mubarak, S. Abdaljalil, W. Zaghouani, P. Nakov, G. Da San Martino, A. Freihat, ArAIEval shared task: Persuasion techniques and disinformation detection in Arabic text, in: H. Sawaf, S. El-Beltagy, W. Zaghouani, W. Magdy, A. Abdelali, N. Tomeh, I. Abu Farha, N. Habash, S. Khalifa, A. Keleg, H. Haddad, I. Zitouni, K. Mrini, R. Almatham (Eds.), Proceedings of ArabicNLP 2023, Association for Computational Linguistics, Singapore (Hybrid), 2023, pp. 483–493. URL: https://aclanthology.org/2023.arabicnlp-1.44. doi:10.18653/v1/2023.arabicnlp-1.44. [14] J. Piskorski, N. Stefanovitch, G. Da San Martino, P. Nakov, Semeval-2023 task 3: Detecting the category, the framing, and the persuasion techniques in online news in a multi-lingual setup, in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), 2023, pp. 2343–2361. [15] A. Pauli, R. Sarabia, L. Derczynski, I. Assent, Teamampa at semeval-2023 task 3: Exploring multilabel and multilingual roberta models for persuasion and framing detection, in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), 2023, pp. 847–855. [16] T. Hromadka, T. Smolen, T. Remis, B. Pecher, I. Srba, Kinitveraai at semeval-2023 task 3: Sim- ple yet powerful multilingual fine-tuning for persuasion techniques detection, arXiv preprint arXiv:2304.11924 (2023). [17] A. Modzelewski, W. Sosnowski, M. Wilczynska, A. Wierzbicki, Dshacker at semeval-2023 task 3: Genres and persuasion techniques detection with multilingual data augmentation through machine translation and text generation, in: Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), 2023, pp. 1582–1591. [18] P. Moral, G. Marco, J. Gonzalo, J. Carrillo-de Albornoz, I. Gonzalo-Verdugo, Overview of dipromats 2023: automatic detection and characterization of propaganda techniques in messages from diplomats and authorities of world powers, Procesamiento del lenguaje natural 71 (2023) 397–407. [19] M. Casavantes, M. Montes-y Gómez, D. I. Hernández-Farías, L. C. González, A. Barrón-Cedeño, Propaltl at dipromats: Incorporating contextual features with bert’s auxiliary input for propaganda detection on tweets (2023). [20] D. Q. Nguyen, T. Vu, A. T. Nguyen, Bertweet: A pre-trained language model for english tweets, arXiv preprint arXiv:2005.10200 (2020). [21] J. M. Pérez, D. A. Furman, L. A. Alemany, F. Luque, Robertuito: a pre-trained language model for social media text in spanish, arXiv preprint arXiv:2111.09453 (2021). [22] F. Jáñez-Martino, A. Barrón-Cedeño, Unileon-unibo at iberlef 2023 task dipromats: Roberta-based models to climb up the propaganda tree in english and spanish (2023). [23] V. Ahuir, L. F. Hurtado, F. García-Granada, E. Sanchis, Elirf-vrain at dipromats 2023: Cross-lingual data augmentation for propaganda detection (2023). [24] F.-J. Rodrigo-Ginés, J. Carrillo-de Albornoz, L. Plaza, Hierarchical modeling for propaganda detection: Leveraging media bias and propaganda detection datasets (2023). [25] L. Tian, X. Zhang, M. M.-H. Kim, J. Biggs, Efficient text-based propaganda detection via language model cascades (2023). [26] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [27] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R. Penagos, A. G. Agirre, M. Villegas, Maria: Spanish language models, Procesamiento del Lenguaje Natural 68 (2022). URL: https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0. mendeley. doi:10.26342/2022-68-3. [28] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116. arXiv:1911.02116. [29] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 5809–5819. [30] J. Piskorski, N. Stefanovitch, N. Nikolaidis, G. Da San Martino, P. Nakov, Multilingual multifaceted understanding of online news in terms of genre, framing and persuasion techniques (2023). [31] I. Okulska, D. Stetsenko, A. Kołos, A. Karlińska, K. Głąbińska, A. Nowakowski, Stylometrix: An open-source multilingual tool for representing stylometric vectors, arXiv preprint arXiv:2309.12810 (2023). [32] E. Bassignana, V. Basile, V. Patti, et al., Hurtlex: A multilingual lexicon of words to hurt, in: CEUR Workshop proceedings, volume 2253, CEUR-WS, 2018, pp. 1–6. A. Optimal Hyperparameter Values This appendix includes the optimal hyperparameter values for our best models. Table 8 Optimal hyperparamter values used in our models. Legend: lr - learning_rate; bs - batch_size; nte - num_train_epochs; ws - warmup_steps; wd - weight_decay Subtask Model lr bs nte ws wd 1a ROB-EN 1 × 10−5 16 5 200 1 × 10−2 ROB-ES 2 × 10−5 16 5 200 2 × 10−2 XLM-BI 1 × 10−5 16 5 200 1 × 10−3 1b ROB-EN 1 × 10−5 16 5 200 2 × 10−3 ROB-ES 3 × 10−5 16 5 200 1 × 10−2 XLM-BI 1 × 10−5 16 5 200 2 × 10−2 1c ROB-EN 1 × 10−5 16 5 200 2 × 10−2 ROB-ES 2 × 10−5 16 5 200 5 × 10−2 XLM-BI 1 × 10−5 16 5 200 2 × 10−2 B. Few-shot Prompting Templates In this appendix, we present the prompt messages included with each text classification request. For brevity, the prompts are provided only in English. { "role": "system", "content": """You are an assistant who detects propaganda, manipulation ˓→ and persuasion techniques. You know the definition of propaganda very well: Propaganda is the ˓→ deliberate systematic attempt to shape perceptions and manipulate ˓→ cognitions and direct behavior to achieve a response to further the ˓→ desired intent of the propagandist. """ }, Listing 1: Used initial system prompt. #Example 1 # EXAMPLE 1 { "role": "user", "content": f"""Answer the question whether or not the text contains ˓→ propaganda. Answer using only a single word Yes or No. TEXT: Today, I reflect on the great honor of serving the American ˓→ people this past year, and look forward to continuing to advance ˓→ a diplomacy true to our core values and emboldened by U.S. ˓→ leadership that may turn our greatest challenges into our ˓→ greatest triumphs. Happy New Year! """ }, { "role": "assistant", "content": "Yes" }, # EXAMPLE 2 { "role": "user", "content": f"""Answer the question whether or not the text contains ˓→ propaganda. Answer using only a single word Yes or No. TEXT: The Islamic Republic of #Iran has fundamentally failed the ˓→ Iranian people, and I am convinced that the Iranian people know ˓→ that. And you’ve seen President @realDonaldTrump make very clear ˓→ we will continue to support the Iranian people. """ }, { "role": "assistant", "content": "No" }, # EXAMPLE 3 { "role": "user", "content": f"""Answer the question whether or not the text contains ˓→ propaganda. Answer using only a single word Yes or No. TEXT: The Chinese government’s decision to explore its own virtual ˓→ currency is already monumental, and if it ultimately moves ˓→ forward it will be a global game changer. The future of global ˓→ currencies may very well rest firmly in China’s hands. """ }, { "role": "assistant", "content": "Yes" }, # EXAMPLE 4 { "role": "user", "content": f"""Answer the question whether or not the text contains ˓→ propaganda. Answer using only a single word Yes or No. TEXT: Over the past few years, the #US has repeatedly blocked @UN ˓→ Security Council’s statements condemning attacks on other countries’ ˓→ embassies. The US missile strike in Baghdad will only result in ˓→ escalating tensions in the region - #Zakharova """ }, { "role": "assistant", "content": "No" } Listing 2: Used pairs of user and assistance prompts. { "role": "user", "content": f"""Answer the question whether or not the text contains ˓→ propaganda. Answer using only a single word Yes or No. TEXT: {text} """ } Listing 3: Used final user prompt.