UMUTEAM at DIPROMATS 2024: Feature Integration for Detecting Fine-grained Propaganda and Narrative José Antonio García-Díaz1 , Ronghao Pan1 , Andreu Rodilla Lázaro2 , Camilo Cristancho2 and Rafael Valencia-García1 1 Facultad de Informática, Universidad de Murcia, Campus de Espinardo, 30100, Spain 2 Dep. de Ciència Política, Dret Constitucional i Filosofia del Dret, Universitat de Barcelona Abstract These notes describe our participation in the 2nd edition of the DIPROMATS shared task, held at IberLEF 2024. This edition repeated the fine-grained detection of propaganda techniques in politics and added an additional subtask for narrative detection, which consists of a multiclass and multi-label classification problem to classify a set of predefined narratives of international actors using few-shot learning. Both tasks are multilingual. For the first task, we propose an approach similar to the one used in the previous edition, combining linguistic features and sentence embeddings using ensemble learning and knowledge integration. For Task 1, we obtain our best result by applying knowledge integration. For the Task 2, we evaluate TuLu and Zephyr, but our results fall below the proposed baseline based on Mixtral 8x7B. Keywords Propaganda Identification, Feature Engineering, Transformers, Feature Integration, Few-shot Learning, Natural Language Processing 1. Introduction As defined in [1], propaganda encompasses a continuously evolving set of techniques and mechanisms designed to facilitate the dissemination of ideas and actions. To facilitate its spread, propaganda often uses rhetorical devices. The analysis of these techniques is detailed in [2]. Propaganda is usually perceived as persuasive communication that uses manipulative practices to persuade and has historically been associated with totalitarian regimes. This characterization implies a negative connotation of propaganda as a threat the principles of public debate [3]. However, persuasion is a central component of political debate in democratic contexts. As such, propaganda can also be considered a legitimate political communication strategy used by political actors in their everyday interactions and appeals to their followers and their adversaries. Identifying the elements that make up propaganda in the political context is crucial to understanding the extent of its legitimate use within a democratic rationale. Communication practices based on microtargeting in the context of political campaigns are probably the most prominent case. Cases such as the Brexit referendum exemplify propagandistic interference in democratic decision-making processes [4]. However, these are only prominent cases at the extreme end of communication strategies that take place across multiple political arenas, such as diplomatic communication. Everyday political interaction is based on narratives that emphasize group identities and exploit emotional rhetoric. Narratives convey political messages and worldviews to express particular political positions and to take distance themselves from opposing perspectives, and political opponents. Political actors thus seek to control the narrative in order to shape political processes according to their own IberLEF 2024, September 2024, Valladolid, Spain * Corresponding author. † These authors contributed equally. $ joseantonio.garcia8@um.es (J. A. García-Díaz); ronghao.pan@um.es (R. Pan); rodilla.lazaro@ub.edu (A. R. Lázaro); camilo.cristancho@ub.edu (C. Cristancho); valencia@um.es (R. Valencia-García)  0000-0002-3651-2660 (J. A. García-Díaz); 0009-0008-7317-7145 (R. Pan); 0000-0002-1381-7664 (A. R. Lázaro); 0000-0003-1794-4457 (C. Cristancho); 0000-0003-2457-1791 (R. Valencia-García) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings interests and strategic intentions [5]. Dissecting the constitutive elements in the communication strategies of political actors is crucial to describing the relationships between them. The second edition of the DIPROMATS 2024 challenge [6], held at IberLEF [7], focuses on identifying propaganda techniques by analyzing the language used by official authorities in social networks. To this end, the organizers of the challenge have repeated the previous edition [8], with a dataset of micro-posting messages on Twitter in Spanish and English from diplomatic profiles of China, Russia, the United States and the European Union. Specifically, two subtasks are proposed. The first subtask is a binary classification in which participants have to decide when a text contains propaganda. The second subtask is to categorize the techniques used to spread propaganda. This categorization is done in two ways: a multi-classification approach and a multi-label approach. In this edition, the organizers have added a novel multi-label classification task focused on narrative identification. Narratives are at the heart of propaganda because they consist of sequences of events, linked by cause and effect, that are selected and judged to be significant to a particular audience [9]. Because narratives reduce complex political processes and political values to simple descriptions, they are central to defining sociopolitical realities, and to the way in which individuals understand politics [10]. One of the most important issues is the power of narratives to promote or otherwise undermine the trust in the social relations that underpin everyday compromise and representation in political systems. Our team participated in both tasks, achieving their best results in Task 1 using feature integration based on knowledge integration and evaluating two Large Language Models (LLMs), TuLu and Zephyr, for Task 2. However, we fall below the baseline based on Mixtral 8x7B and only one other participant submitted results for Task 2. Therefore, the limited limited number of participants in Task 2 prevents us us from making better comparison of our results. 2. Dataset According to the organizers, the DIPROMATS’ dataset consists of Spanish and English tweets by diplomats from four different countries, namely China, Russia, the United States and the European Union. The authorities are government accounts, embassies, ambassadors and other diplomatic profiles. The collected tweets were published between January 1, 2020 and March 11, 2021, with the last day coinciding with the first anniversary of the declaration of the COVID-19 pandemic. Specifically, the Spanish dataset contains 9,591 tweets and the English dataset contains 12,012 tweets from 619 agencies. The data was split with a 70/30 ratio using a temporal criterion, where the training split the oldest tweets and the test split the newest. The tweets were labeled using a similar criteria as in previous edition, using the taxonomy proposed in [2] but removing some techniques used in the previous edition. In this edition they keep (1) Appeal to Commonality, which Ad populum and Flag Waving; (2) Discrediting the Opponent, with Name Calling, Appeal to Fear, Undiplomatic Assertiveness, and Doubt; and (3) loaded language, which refers to the use of hyperbolic language, metaphors and expressions with strong emotional implications. It is worth noting that the main category, Appeal to Authority has been removed. Table 1 shows the statistics of the Spanish and English partitions of the DIPROMATS 2024 task. As can be seen, the dataset is very unbalanced. Furthermore, there are no instances of documents marked as appeal to false authority in the English partition and for bandwagoning in the Spanish partition. For Task 2, the dataset is a subset of the Task 1 dataset. Since Task 2 is a few-shot learning task, the authors include the narrative description along with two or three examples for training. 3. Methodology 3.1. Task 1. Propaganda identification and characterization As in the previous edition, we focus on the 3rd subtask for Task 1, since treating it as a multi-label problem also solves the subtasks of binary propaganda identification (subtask 1) and the Propaganda characterization, coarse-grained (subtask 2). This strategy reduces the number of models we need to train, thus saving time and effort. Our proposal for solving Task 1 is based on feature integration of linguistic features (LFs) and sentence embeddings from some state-of-the-art LLMs. For the LFs, we rely on UMUTextStats [11], while for the sentence embeddings, we rely on feature extraction from the fine-tuned models BETO [12], MarIA [13], DeBERTa, Twitter XLM [14], and RoBERTalex [15] for Spanish; and BERT [16], XLM, BoBERTa [17], DeBERTa, Twitter XLM [14], and Legal BERT [18]. Compared with our proposal in 2023, we removed from our pipeline the lightweight models of ALBERT (and ALBETO) and DistilBERT (and DistilBETO), as well as BERTIN, and multilingual BERT. In fact, the only model added is RoBERTalex, which is a Spanish LLM trained on the Spanish Legal Domain Corpora, with a total of 8.9GB of text. For each LLM, we obtain its sentence embeddings, since a fixed representation of the data simplifies the task of combining the LLM with the linguistic features. In order to identify the best configuration for each LLM, we train 10 models for each LLM for Spanish and English, evaluating different learning rates, training epochs, batch sizes, warm-up steps and weight decay. This step is done using RayTune [19] with Distributed Asynchronous Hyperparameter Optimisation (HyperOptSearch) with the Tree of Parzen Estimators (TPE) algorithm [20] and the ASHA scheduler (because it favors parallelism). Table 2 shows the best configuration found for each LLM for Spanish and English for subtask 3. It can be observed that all the models require a larger number of training epochs, between 4 and 5, with the only exception of BETO in Spanish. Regarding the batch size, almost all LLMs prefer smaller batch sizes (8), with the only exception of XLM in English. Regarding the warm-up steps, Spanish usually requires smaller steps compared to English. The next step for our pipeline is to obtain the contextual sentence embeddings from the classification token, as suggested in [21]. This fixed representation of each document in the corpus allows us to more easily apply in feature combination strategies between the LLMs and with the LFs. Once we have extracted the embeddings, we train a different neural network model for each LLM, but using Keras and a multi-input neural network that uses the LFs and the embeddings. This strategy is called Knowledge Integration (KI). With Keras, we also evaluate different network shapes, including the depth of the network and its shape. The learning rate, batch size, and dropout mechanism are also evaluated. The results of the hyperparameter optimization with Keras are shown in the table 3. In the case of LFs, both models (Spanish and English) require a shallow neural network with one hidden layer, but only 16 neurons in the case of Spanish and 256 in the case of English. This can be explained by the fact Table 1 Dataset statistics for the Spanish and English partitions of the DIPROMATS 2024 shared task Spanish English Category Label train val test total train val test total 1 ad populum 40 19 59 56 16 72 1 flag waving 181 53 234 429 116 545 2 doubt 19 8 27 55 19 76 2 fear appeals 44 17 61 39 18 57 2 name calling 64 26 90 164 49 213 2 undiplomatic assertiveness 447 122 430 536 145 681 3 loaded language 302 87 389 723 190 913 total 1097 332 2002 553 that the LFs extracted by UMUTextStats are best focused on Spanish, requiring a smaller number of neurons for the best evaluated parameters. Compared to the LFs, the sentence embedding models also required simpler but larger neural networks (one or two hidden layers but a large number of neurons). For KI, both languages required two hidden layers with 512 neurons. Apart from the KI strategy, we build the ensemble learning models based on combining the outputs of the models trained with the sentence embeddings for each LLM and the LFs. Specifically, we evaluate three strategies for combining the outputs: (1) highest probability, as we choose the maximum probability for each label; (2) averaging the probabilities of each model in the ensemble, and (3) the mode of each label in the predictions. Table 2 Hyperparameter tuning of the LLM before obtaining the sentence embeddings LLM learning rate train epochs batch size warmup steps weight decay Spanish BETO 4.7e-05 3 8 250 0.27 MARIA 2.1e-05 4 8 500 0.26 MDEBERTA 2.8e-05 5 8 250 0.079 ROBERTALEX 3.9e-05 5 8 500 0.3 XLMTWITTER 4.9e-05 5 8 500 0.19 English BERT 3.4e-05 4 8 1000 0.032 LEGALBERT 4.7e-05 4 8 1000 0.11 MDEBERTA 3.1e-05 5 8 500 0.033 ROBERTA 3.8e-05 4 8 250 0.25 XLM 4.1e-05 4 16 1000 0.071 XLMTWITTER 2.9e-05 4 8 500 0.26 Table 3 Best hyperparameters for the deep learning models for Spanish (left) and English (right) hidden layers neurons dropout lr batch size activation feature-set Spanish LF 1 16 - 0.001 64 linear BETO 2 37 0.2 0.001 64 tanh MARIA 1 64 0.1 0.001 64 relu MDEBERTA 1 512 - 0.001 64 linear ROBERTALEX 1 256 0.2 0.001 32 relu XLMTWITTER 2 128 0.3 0.001 64 sigmoid KI 2 512 0.3 0.01 64 linear feature-set English LF 1 256 - 0.001 64 linear BERT 2 512 - 0.001 64 sigmoid LEGALBERT 2 512 - 0.001 64 tanh MDEBERTA 1 128 - 0.001 64 tanh ROBERTA 1 37 0.3 0.001 64 linear XLMTWITTER 1 256 0.1 0.001 32 tanh KI 2 512 0.2 0.001 64 linear 3.2. Task 2. Automatic detection of narratives from diplomats of majors powers Task 2 is a multi-class and multi-label classification problem that aims to determine the narrative to which the tweets belong, given a set of predefined narratives for each international actor. For the implementation, the organizers provided us with the description of each narrative and some examples of English and Spanish tweets corresponding to each narrative. Since this is a multi-label problem, a tweet can be associated with one, several or none of the narratives. For Task 2, we have used the few-shot learning approach of different LLM models to determine the narrative of the tweets. LLMs are mainly neural language models based on the Transformer architecture, which contain tens to hundreds of billions of parameters and are pre-trained on massive text data. These models exhibit more robust language understanding and generation capabilities, as well as emergent capabilities not found in smaller-scale language models. These emergent capabilities include context learning, instruction following, and multi-step reasoning and among others. The in-context learning capability enables LLMs to generate more coherent and contextually relevant responses, making them suitable for interactive and conversational applications. In addition, this capability allows LLMs to quickly adapt to a new task by using examples in the input, without the need for retraining or model adaptation. Few-shot learning is a technique where a model can effectively generalize to new tasks using only a few training examples at the LLMs prompt. In this case, it would be the set of predefined narratives of each international actor. The models evaluated for Task 2 are Zephyr-7b-beta and Tulu-2-dpo-7b. Zephyr-7B-beta [22] is the second model in the Zephyr series of language models designed to serve as useful assistants. It is a tuned version of the Mistral-7B-v0.1 [23] model, trained with a Direct Preference Optimization (DPO) approach using a mixture of public and synthetic datasets. Tulu-2-dpo-7b [24] is a language model developed as part of the Tulu series as a useful assistant in various natural language processing tasks. This model is a tuned version of the Llama 2 [25] (Llama-2-7b-hf) model, and has been trained using a technique called DPO. For Zephyr-7b, prompts must be structured with designated fields: “System”, “User”, and “Assistant”. The “System” field provides instructions or guidance to the model. The “User” field contains the user’s intent and the item to be classified, while the “Assistant” field is the output indicator. The DPO fine-tuned iteration of the Tulu model (Tulu-7b-dpo) requires that the model input consist of two fields: The “user” and the “assistant”. The “User” field is used to specify instructions and the instance to be classified by the model, while the “Assistant” field acts as an output indicator. It’s important to note that a new line must be added after each field, as this can significantly affect the quality of the generation. 4. Results and discussion 4.1. Task 1 To evaluate the performance of the models, we use a custom validation split. The results are shown in Table 4 for Spanish (left) and English (right). These results are for the third subtask, i.e., the multi-label task. For Spanish, the best model is obtained with the KI strategy, with a macro f1-score of 54.467%, outperforming the individual models. This model achieved a very good recall but other models based on Ensemble Learning achieved better precision (EL based on the mode), and recall (based on highest probability). The LFs achieved limited results, but better in terms of recall and f1-score compared with our previous edition. In the case of English, the best result is obtained with ROBERTA in isolation (f1-score of 58.083%), but the EL based on the highest probability obtained a better recall but a very limited precision. The result obtained with English was more similar to our performance in the previous edition, where individual models achieved better performance than feature integration. The results for the Spanish are in line with other work carried out by our research group [26, 27, 28, 29]. For the competition, we sent a total of 5 runs. Three for Task 1 and two for Task 2. All three runs for task 1 were based on feature integration. One for the KI strategy and the rest were two ensembles, one based on highest probability and the other in mode. All runs for Task 2 were based on FSL with TuLu and Zephyr. For Task 1, we report the results of the official leaderboard. The metric used to compare the systems is the ICM-Hard [30]. The table 5 shows the official leaderboard of the DIPROMATS 2024 shared task. In the tables, we have published only one run per competitor, as we believe this is the fairest comparison. Our best results were with our third run, based on KI. We ranked 4th in the binary classification task, with an ICM hard of 0.1667. For the second and third subtasks, we ranked 3rd, with an ICM hard of -0.0832 and -0.33883, respectively. Next, Table 6 shows the results obtained for each run on the test set with the macro F1-score. These runs are based on feature integration. The first run is based on ensemble learning on highest probability, the second run is based on ensemble learning based on mode, and the third run is based on KI. As noted when reviewing the results per run, the best performing strategy is KI, which outperforms the rest with the F1-score except in Subtask 3 (English). In general, for subtasks 2 and 3 the results for English were better than Spanish. A possible interpretation for this is that the Spanish dataset is more complex and propaganda messages are harder to detect. 4.2. Task 2 Table 7 shows the results obtained with the LLMs on the test set for Task 2. Three metrics were used to evaluate the performance of the models: F1 Strict, F1 Lenient and F1 Avg. These metrics provide a comprehensive evaluation of the accuracy of the models in different aspects of classification. In addition, the results are presented for two languages: Spanish and English. For the Spanish language, the baseline model scores highest on all metrics, followed by Zephyr and Table 4 Results of the custom validation split of the DIPROMATS 2024 shared task for Spanish (left) and English (right) Spanish English feature-set precision recall f1-score feature-set precision recall f1-score LF 28.322 33.652 28.295 LF 17.425 31.776 20.841 BETO 63.496 40.014 46.525 BERT 63.383 51.708 56.391 MARIA 68.851 37.047 45.644 LEGALBERT 61.449 50.508 54.710 MDEBERTA 44.130 24.817 26.280 MDEBERTA 65.206 48.096 53.076 ROBERTALEX 67.399 32.479 41.127 ROBERTA 72.187 56.558 58.083 XLMTWITTER 61.669 27.030 33.935 XLMTWITTER 64.222 44.151 50.501 KI 70.041 48.796 54.467 KI 63.527 52.921 56.111 EL (HIGHEST) 33.078 56.158 39.965 EL (HIGHEST) 34.882 76.522 46.104 EL (MEAN) 89.442 30.138 39.725 EL (MEAN) 78.807 46.908 54.681 EL (MODE) 89.040 28.405 38.442 EL (MODE) 79.635 41.196 50.410 Table 5 Official leaderboard for the 3 subtasks of Task 1 of DIPROMATS 2024 shared task Subtask 1 Subtask 2 Subtask 3 Team ICM-Hard Team ICM-Hard Team ICM-Hard Victor Vectors 0.2048 DSHacker -0.0074 DSHacker -0.1144 DSHacker 0.2018 Victor Vectors -0.0425 Victor Vectors -0.1144 PropaLTL 0.1667 UMUTEAM (03) -0.0832 UMUTEAM (03) -0.1383 UMUTEAM (03) 0.1646 UC3M-LCPM -0.2205 -0.6844 UC3M-LCPM 0.1268 Table 6 Results per run using the test set with the macro F1-score for Task 1 of DIPROMATS 2024. Subtask 1 Subtask 2 Subtask 3 run Spanish English Spanish English Spanish English 01 64.55 49.17 31.13 34.31 33.15 25.47 02 77.20 78.77 34.90 55.03 24.53 41.64 03 78.13 81.17 41.34 58.77 38.84 40.80 Tulu. For English, Zephyr outperforms both Tulu and the baseline on all metrics except the baseline’s F1 Lenient. In summary, the Mixtral-8x7B model performs better in Spanish, outperforming the other models evaluated such as Zephyr and Tulu on all metrics. However, the baseline performance in English is worse than Zephyr, with an F1 Avg of 0.4161, which is 0.46% lower, despite being an 8x larger model. Table 7 Results obtained with LLMs in the test set for Task 2 of DIPROMATS 2024. Spanish English run F1 Strict F1 Lenient F1 Avg F1 Strict F1 Lenient F1 Avg Zephyr 0.3046 0.4427 0.3737 0.3149 0.5265 0.4207 TuLu 0.2729 0.3976 0.3353 0.2303 0.4441 0.3372 Baseline 0.3769 0.5278 0.4524 0.2875 0.5446 0.4161 5. Conclusions In this paper we have presented our approach to solving the DIPROMATS 2024 shared task. We focused on propaganda characterization in a multi-label way, since models trained for this task can also solve the propaganda identification and propaganda characterization task using a multi-classification approach. Our approach evaluated linguistic features and sentence embeddings from several LLMs, including models specific to English, Spanish and other multilingual models. We achieved competitive results in all tasks and we are very satisfied with the results. The task of identifying the constitutive elements in narratives, such as references to social symbols, loaded language, and emotional cues is central to understanding the communication processes that shape political identities and attitudes. Group appeals, and their contrasts with outgroups or adversaries, contain value-charged claims and emotional rhetoric that have important effects in attitude formation processes and behavioral intentions. Evidence for the mobilizing effects of anger and hope, or conversely for the inaction produced by fear and doubt, confirms the need to correctly identify the constituent elements of political narratives. The 2024 task is particularly relevant because narratives not only have the power to influence the climate of public opinion, but also benefit from inflamed political contexts to effectively disseminate their messages. Thus, the implications of how diplomats’ narratives are structured in the context of the COVID pandemic can shed light on how everyday political interactions are intimately connected to the context in which they are produced. In a media landscape dominated by fast-moving, profit- driven social media platforms and global media conglomerates, the ability to identify the structuring elements of political narratives that constitute propaganda or persuasive communication is a crucial matter. Identifying propaganda and the elements that constitute its underlying narratives allows us to understand the behavior of political actors and their media landscapes. This is a first, but very important step in observing political communication processes that are central to democracy. In future work, we plan to compare our results in Task 1 with alternative results if we had trained a model focused on propaganda identification. In addition, our evidence suggests that the results of models based on BERT and BETO outperform more sophisticated approaches that have been effective in other collaborative tasks. Accordingly, we will provide a detailed error analysis for each propaganda technique. In addition, we will compile audio and video of politicians and examine propaganda with audio features, similar to [31]. Acknowledgments This work has been supported by projects LaTe4PoliticES (PID2022-138099OB-I00) funded by MICI- U/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF)-a way of making Europe, LT-SWM (TED2021-131167B-I00) funded by MICIU/AEI/10.13039/ 501100011033 and by the European Union NextGenerationEU/PRTR, "Services based on language technologies for political micro- targeting" (22252/PDC/23) funded by the Autonomous Community of the Region of Murcia through the Regional Support Program for the Transfer and Valorization of Knowledge and Scientific Entrepreneur- ship of the Seneca Foundation, Science and Technology Agency of the Region of Murcia. Mr. Ronghao Pan is supported by the Programa Investigo grant, funded by the Region of Murcia, the Spanish Min- istry of Labour and Social Economy and the European Union - NextGenerationEU under the "Plan de Recuperación, Transformación y Resiliencia (PRTR)". References [1] C. Sparkes-Vian, Digital propaganda: The tyranny of ignorance, Critical sociology 45 (2019) 393–409. [2] G. Da San Martino, S. Yu, A. Barrón-Cedeno, R. Petrov, P. Nakov, Fine-grained analysis of propaganda in news article, in: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2019, pp. 5636–5646. [3] A. Godber, G. Origgi, Telling Propaganda from Legitimate Political Persuasion, Episteme 20 (2023) 778–797. [4] Q. Cassam, Bullshit, post-truth, and propaganda, Political epistemology (2021) 49–63. [5] S. Groth, Political narratives/narrations of the political: An introduction, Narrative Culture 6 (2019) 1–18. [6] P. Moral, J. Fraile, G. Marco, A. Peñas, J. Gonzalo, Overview of DIPROMATS 2024: Detection, characterization and tracking of Propaganda in messages from diplomats and authorities of world powers, Procesamiento del Lenguaje Natural 73 (2024). [7] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language Process- ing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org, 2024. [8] Pablo Moral, Guillermo Marco, Julio Gonzalo, Jorge Carrillo-de-Albornoz, Iván Gonzalo-Verdugo, Overview of DIPROMATS 2023: automatic detection and characterization of propaganda tech- niques in messages from diplomats and authorities of world powers, Procesamiento del Lenguaje Natural 71 (2023). [9] C. K. Riessman, Narrative Methods for the Human Sciences, Sage, 2008. [10] B. McLaughlin, J. A. Velez, J. A. Dunn, The political world within: How citizens process and experience political narratives, Annals of the International Communication Association 43 (2019) 156–172. [11] J. A. García-Díaz, P. J. Vivancos-Vicente, A. Almela, R. Valencia-García, UMUTextStats: A linguistic feature extraction tool for Spanish, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 6035–6044. [12] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, Pml4dc at iclr 2020 (2020) 1–10. [13] A. G. F. no, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R. Penagos, A. G. Agirre, M. Villegas, MarIA: Spanish Language Models, Procesamiento del Lenguaje Natural 68 (2022). URL: https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley. doi:10. 26342/2022-68-3. [14] F. Barbieri, L. E. Anke, J. Camacho-Collados, Xlm-t: Multilingual language models in twitter for sentiment analysis and beyond, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 258–266. [15] A. G.-F. no, J. Armengol-Estapé, A. Gonzalez-Agirre, M. Villegas, Spanish Legalese Language Model and Corpora, 2021. arXiv:2110.12201. [16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional Transform- ers for Language Understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/N19-1423. doi:10.18653/v1/N19-1423. [17] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [18] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, LEGAL-BERT: The Muppets straight out of Law School, in: Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 2898–2904. [19] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, I. Stoica, Tune: A research platform for distributed model selection and training, arXiv preprint arXiv:1807.05118 (2018). [20] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimization, Advances in neural information processing systems 24 (2011). [21] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Association for Computational Linguistics, 2019, pp. 3980–3990. URL: https://doi.org/10.18653/v1/D19-1410. doi:10.18653/v1/D19-1410. [22] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, T. Wolf, Zephyr: Direct Distillation of LM Alignment, 2023. arXiv:2310.16944. [23] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, et al., Mistral 7B, arXiv preprint arXiv:2310.06825 (2023). [24] H. Ivison, Y. Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, J. Jang, D. Wadden, N. A. Smith, I. Beltagy, H. Hajishirzi, Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2, 2023. arXiv:2311.10702. [25] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- gava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv:2307.09288 (2023). [26] J. A. García-Díaz, F. García-Sánchez, R. Valencia-García, Smart analysis of economics sentiment in Spanish based on linguistic features and transformers, IEEE Access 11 (2023) 14211–14224. [27] J. A. García-Díaz, S. M. Jiménez-Zafra, M. A. García-Cumbreras, R. Valencia-García, Evaluating feature combination strategies for hate-speech detection in spanish using linguistic features and transformers, Complex & Intelligent Systems (2022) 1–22. [28] J. A. García-Díaz, R. Valencia-García, Compilation and evaluation of the spanish saticorpus 2021 for satire identification using linguistic features and transformers, Complex & Intelligent Systems 8 (2022) 1723–1736. [29] J. A. García-Díaz, G. Beydoun, R. Valencia-García, Evaluating Transformers and Linguistic Features integration for Author Profiling tasks in Spanish, Data & Knowledge Engineering 151 (2024) 102307. [30] E. Amigo, A. Delgado, Evaluating Extreme Hierarchical Multi-label Classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 5809–5819. URL: https://aclanthology.org/2022.acl-long.399. doi:10.18653/v1/2022.acl-long.399. [31] R. Pan, J. A. García-Díaz, M. Á. Rodríguez-García, R. Valencia-García, Spanish MEACorpus 2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environments, Computer Standards & Interfaces 90 (2024) 103856.