1. Introduction

J. A. García-Díaz); valencia@um.es (R. Valencia-García)

UMUTeam at Dipromats 2023: Propaganda Detection in Spanish and English Combining Linguistic Features with Contextual Sentence Embeddings

José Antonio García-Díaz

Rafael Valencia-García

0 0 Facultad de Informática, Universidad de Murcia, Campus de Espinardo , 30100 , Spain

2023

000 0 0002

These notes summarise the UMUTeam's contribution to the Dipromats joint task of IberLEF 2023, which deals with the fine-grained detection of propaganda techniques in the political domain, using texts written in English and Spanish. Our contribution is based on the combination of linguistic features and sentence embeddings extracted for several large language models using ensemble learning and knowledge integration. We rank third in the binary classification subtask, first in the multi-classification subtask, and second in the multi-label classification subtask.

eol>Propaganda Identification Feature Engineering Transformers Knowledge Integration Ensemble Learning Natural Language Processing

1. Introduction 2. Dataset

According to the organisers of Dipromats 2023, the dataset is a cleaned and filtered version of more than one million tweets in diferent languages, collected between 1 January 2020 and 11 March 2021. The selected accounts are from governments, embassies or consulates, among others. The dataset is divided into tweets written in Spanish and English. The final dataset consists of 9,501 Spanish tweets and 14,747 English tweets. In addition, the organisers divided the data using temporal criteria to decide on the training and test sets.

For labelling, the organisers of the Dipromats 2023 shared task have used the taxonomy proposed in [2] but including other techniques. They have also grouped the techniques into four main categories: (1) Appeal to Commonality, which includes techniques related to patriotism based on fallacious reasoning and emotions; (2) Discrediting the Opponent, with techniques that show hostility towards the political opponent using fallacies and evoking negative emotions; (3) loaded language, which refers to the usage of hyperbolic language, metaphors and expressions with strong emotional implications; and (4) appeal to authority, which includes appeals to false authority and band wagoning, which refers to the attempt to persude the audience to take an action because someone else is taking the same action.

Table 1 shows the statistics of the Spanish and English partitions of the Dipromats 2023 task. As can be seen, the dataset is very unbalanced. Furthermore, there are no instances of documents marked as appeal to false authority in the English partition and for bandwagoning in the Spanish partition.

3. Methodology

We focus this problem on language, as there are powerful LLMs available specifically for both Spanish and English. However, we only focus on the third subtask, the fine-grained propaganda characterisation, because it also solves the first task concerning the binary propaganda identification. In this sense, we reduce the number of models we need to train, thus saving time and resources.

In short, our methodology for solving this task is a typical machine learning pipeline that consists of cleaning the dataset, extracting features from the documents, and training and evaluating some neural network models with a custom split before sending our final runs.

During the data cleansing phase, we convert numbers into fixed tokens to give the model some generation. For the same reason, we remove mentions and hyperlinks among other proper elements from social networks. Our last step in the data cleaning stage is to find and expand acronyms and proper language from the message text.

In the feature extraction phase, we extract linguistic features (LFs) from UMUTextStats [4] and sentence embeddings for several LLMs after fine tuning. It is worth mentioning that some of the LLMs are used exclusively on Spanish, English or both, depending on whether they are pre-trained for a specific language or if they are multilingual. The LLMs evaluated are the following: • BERT and BETO. Bidirectional Encoder Representations from Transformers (BERT) [5] is an LLM that uses the Transformer architecture to learn contextual word embeddings that better capture the semantics and relationships between words in a sentence. BETO [6] is the Spanish version of BERT, trained on data from Spanish wikis, subtitles, speeches from the Spanish parliament, among others. • ALBERT and ALBETO. ALBERT is a lighter and more eficient version of BERT [ 7].

ALBERT uses parameter factorisation and shared pre-training, allowing it to be more eficient in its use of computational resources without significantly compromising the performance of the model. There is also a version of ALBERT trained on Spanish data [8]. • DistilBERT [9] and DistilBETO [8]. These are versions of BERT and BETO based on distillation, which is another way of constructing lightweight LLMs. • RoBERTa and Maria. The ROBERTA architecture is an improved version of BERT in the pre-training approach and some technical aspects [10]. MarIA [11] is a model based of the RoBERTA architecture, but trained with Spanish data. • BERTIN. This is a Spanish LLM [12] based on the RoBERTa architecture. Unlike MarIA, BERTIN is trained on the Spanish part of the mC4 dataset during an event sponsored by Google Cloud. • multilingual BERT. This is a multilingual version of BERT [5]. It has the same architecture, but is loaded with data from more than 100 languages. • XLM. This is a multilingual LLM [13] that can transfer knowledge from one language to another, allowing models trained in one language to be used for tasks in other languages without the need for additional training. • XLM-Twitter. It is an alternative version of XLM that it is based on RoBERTA, but trained on almost 198 million of tweets written in diferent languages [14]. • MDeBERTA. This is the multilingual version of DeBERTA, an LLM that uses a disentangled attention. This LLM is currently in its third version [15]. • Legal-BERT. This is an English LLM model based on BERT but trained for the legal domain [16]. This means that it contains about 12 GB of texts on legislation, court cases or contracts extracted from public sources. It should be noted that this model is lighter than BERT.

For each LLM, we obtain its sentence embeddings, since a fixed representation of the data simplifies the task of combining the LLM with the linguistic features. In order to know the best configuration for each LLM, we train 10 models for each LLM for Spanish and English, evaluating diferent learning rates, training epochs, batch sizes, warm-up steps and weight decay. This step is carried out using RayTune [17] with Distributed Asynchronous Hyperparameter Optimisation (HyperOptSearch) with the Tree of Parzen Estimators (TPE) algorithm [18] and the ASHA scheduler (because it favours parallelism). The table 2 shows the best configuration found for each LLM. It can be observed that all the models require a larger number of training epochs, between 4 and 5, with a few exceptions (AlBETO in Spanish and multilingual BERT and DeBERTA in English). Even LegalBERT, being the model with texts more related to this shared task, needed 5 epochs to obtain its better result in this experiment.

We then obtained the contextual sentence embeddings from the classification token, as suggested in [19]. This fixed representation of each document in the corpus allows us to more easily combine these embeddings with each other or with external features.

With these sentence embeddings we train another neural network model, but using Keras and simple neural networks. This process allows a fair comparison with the LFs and the training of a multi-input neural network based on Knowledge Integration (KI), which combines all feature sets at once. In this stage we test diferent numbers of hidden layers and neurons arranged in diferent shapes, including the linear function between layers. The learning rate, batch size and dropout mechanism are also evaluated. The results are shown in the table 3. As these features are already fine-tuned in the previous step, we can observe that most of the architectures are simple, being mostly shallow neural networks with one or two hidden layers at most. The most notable diference is the number of neurons, as we obtained 1024 neurons in one layer in Spanish, but only 16 in English using Knowledge Integration.

Please note that the mBERT model is not included in the Spanish experiments due to a human error during our participation. As this model is not part of the KI strategy, we have decided not to include it in the results. hidden layers neurons dropout lr batch size activation feature-set LF 1 AlBETO 1 BERTIN 2 BETO 1 DistilBETO 1 MarIA 1 mDeBERTA 1 XLM 2 XLM-Twitter 2 KI 1

Finally, we build the ensemble learning models based on combining the outputs of the models trained with the sentence embeddings for each LLM and the LFs. We use three strategies to combine these outputs. The first is called highest probability, as we choose the maximum probability for each label. The second strategy is based on averaging the probabilities of each model in the ensemble, and the last strategy consists in the mode of each label in the predictions.

4. Results and discussion

First, we report our results using our custom validation split. Note that these results focus on the multi-label task. The results are shown in Table 4 for Spanish (left) and English (right). It can be seen that the best results were obtained with individual models rather than with strategies for combining the results. This is the case of BETO for Spanish and ALBERT in English. Usually, the combination of features have reported better performance in other tasks such as Sentiment Analysis [20], hate speech detection [21], satire identification [22] or author profiling [23]. As for the results obtained by the LFs, they are more limited in English. This is to be expected, since UMUTextStats focuses on Spanish although the features based on stylometry and morphosyntax are also suitable for English. The ensemble learning strategies achieve competitive results. The highest probability strategy is usually the model that achieves better recall and good precision for this task.

For the Dipromats 2023 shared task, participants are allowed to submit 5 runs. We decided to send three runs based on the ensemble learning strategies, as they give very competitive results, and reserve runs as small internal baselines, one based on the linguistic features (run 04) and another based on BETO and BERT (run 05).

Next, we report the results of the oficial leader board. The metric used to compare the systems is the ICM-Hard [24]. According to the organisers, there is a baseline based on RoBERTA (MarIA for Spanish) for Task 1, and for Task 2 they trained the same models, but instead of using a multi-label fashion, they trained all the labels separately (including the negative class). However, we suspect that these baselines will appear in the task overview, as the results obtained with these baselines seem to be simple heuristics based on less frequent labels.

The Table 5 shows the oficial leaderboard of the Dipromats 2023 Joint Task. We have published only one run per competitor, as we believe this is the fairest leaderboard. In this sense, we are in second place in the binary classification task, with an ICM hard of 0.1165. We obtained this result with our third run, based on ensemble learning with the mode of predictions. The results obtained by all the participants are similar. They have an average macro F1 score of 77.108% and a standard deviation of 1.9 (F1 score results are not shown in this table). For the second task (multi-classification) we get the best result with an ICM-Hard of -0.0037 with our third run and -0.005 with our first run. These results are followed by VRAIN-ELiRF (ICM-Hard of -0.0117). For the third task, the multi-label classification, the best result is achieved by the VRAIN-ELiRF team (ICM-Hard of -0.1232), followed by us (ICM-Hard of -0.1318) with our fifth run. It should be noted that our best result was obtained in the fifth run, based on BERT for the English and BETO for the Spanish.

Next, Table 6 depicts the resultsobtained for each run on the test set. The first three runs (01, 02 and 03) are based on the ensemble learning strategy of the LLMs and LFs, but use diferent strategies for combining the models. The first run is based on the average probabilities, the second run is based on the highest probability and the third run is based on the mode. We use the fourth and fifth runs as internal baselines. The fourth run is the linguistic features separately and the fifth run is the features from BERT for the English texts and BETO for the Spanish texts.

As noted when reviewing the rankings, it is noticeable that the fifth run, based on a finetuned BERT and a fine-tuned BETO, outperforms the other runs. Looking at the results per run, ensemble learning based on averaging probabilities (run 01) and mode (run 03) achieve similar results. The second run, based on the highest probability, achieves limited results in task 2, especially in the Spanish split. The fourth run, based on linguistic features only, gave the most limited results in all tasks. These results are not surprising, as LFs in isolation are not able to capture the same patterns as state-of-the-art LLMs. However, the fact that combining the features with ensemble learning does not always improve the results draws our attention and further work should be done to perform an error analysis.

5. Conclusions

In this paper we have presented our approach to solving the Dipromats 2023 shared task. We focus on propaganda characterisation in a multi-label way, as the models trained for this task can also solve the propaganda identification and propaganda characterisation task using a multiclassification approach. Our approach evaluates linguistic features and sentence embeddings from multiple LLMs, including models specific to English, Spanish and other multilingual models. We achieve competitive results in all tasks and we are very satisfied with the results. We think that this is a very relevant task and we expect to participate in similar tasks in the future, as ifghting propaganda and misinformation is a relevant challenge in our daily lives.

In terms of further work, we should compare our results in Task 1 if we had trained a model focused on propaganda identification. In addition, it draws our attention to the fact that the results of models based on BERT and BETO outperform more sophisticated approaches that have been efective in other joint tasks. Accordingly, we will perform a detailed error analysis for each propaganda technique.

Acknowledgments

This work is part of the research projects AIInFunds (PDC2021-121112-I00) and LT-SWM (TED2021-131167B-I00) funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. This work is also part of the research project LaTe4PSP (PID2019-107652RB-I00/AEI/ 10.13039/501100011033) funded by MCIN/AEI/10.13039/501100011033. [1] C. Sparkes-Vian, Digital propaganda: The tyranny of ignorance, Critical sociology 45 (2019) 393–409. [2] G. Da San Martino, S. Yu, A. Barrón-Cedeno, R. Petrov, P. Nakov, Fine-grained analysis of propaganda in news article, in: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), 2019, pp. 5636–5646. [3] Pablo Moral, Guillermo Marco, Julio Gonzalo, Jorge Carrillo-de-Albornoz, Iván GonzaloVerdugo, Overview of DIPROMATS 2023: automatic detection and characterization of propaganda techniques in messages from diplomats and authorities of world powers, Procesamiento del Lenguaje Natural 71 (2023). [4] J. A. García-Díaz, P. J. Vivancos-Vicente, A. Almela, R. Valencia-García, Umutextstats: A linguistic feature extraction tool for spanish, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 6035–6044. [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL: https://aclanthology.org/ N19-1423. doi:10.18653/v1/N19-1423. [6] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. Pérez, Spanish pre-trained bert model and evaluation data, Pml4dc at iclr 2020 (2020) 1–10. [7] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, ALBERT: A lite BERT for self-supervised learning of language representations, CoRR abs/1909.11942 (2019). URL: http://arxiv.org/abs/1909.11942. arXiv:1909.11942. [8] J. Cañete, S. Donoso, F. Bravo-Marquez, A. Carvallo, V. Araujo, Albeto and distilbeto: Lightweight spanish language models, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 4291–4298. [9] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019). [10] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692. [11] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R. Penagos, A. G. Agirre, M. Villegas, Maria: Spanish language models, Procesamiento del Lenguaje Natural 68 (2022). URL: https://upcommons.upc.edu/handle/2117/367156# .YyMTB4X9A-0.mendeley. doi:10.26342/2022-68-3. [12] J. de la Rosa, E. G. Ponferrada, M. Romero, P. Villegas, P. González de Prado Salas, M. Grandury, BERTIN: eficient pre-training of a spanish language model using perplexity sampling, Proces. del Leng. Natural 68 (2022) 13–23. URL: http://journal.sepln.org/ sepln/ojs/ojs/index.php/pln/article/view/6403. [13] A. Conneau, G. Lample, Cross-lingual language model pretraining, in: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 7059–7069. [14] F. Barbieri, L. E. Anke, J. Camacho-Collados, Xlm-t: Multilingual language models in twitter for sentiment analysis and beyond, in: Proceedings of the Thirteenth Language Resources and Evaluation Conference, 2022, pp. 258–266. [15] P. He, J. Gao, W. Chen, Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, arXiv preprint arXiv:2111.09543 (2021). [16] I. Chalkidis, M. Fergadiotis, P. Malakasiotis, N. Aletras, I. Androutsopoulos, Legal-bert: The muppets straight out of law school, in: Findings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 2898–2904. [17] R. Liaw, E. Liang, R. Nishihara, P. Moritz, J. E. Gonzalez, I. Stoica, Tune: A research platform for distributed model selection and training, arXiv preprint arXiv:1807.05118 (2018). [18] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimization,

Advances in neural information processing systems 24 (2011). [19] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Association for Computational Linguistics, 2019, pp. 3980–3990. URL: https: //doi.org/10.18653/v1/D19-1410. doi:10.18653/v1/D19-1410. [20] J. A. García-Díaz, F. García-Sánchez, R. Valencia-García, Smart analysis of economics sentiment in spanish based on linguistic features and transformers, IEEE Access 11 (2023) 14211–14224. [21] J. A. García-Díaz, S. M. Jiménez-Zafra, M. A. García-Cumbreras, R. Valencia-García, Evaluating feature combination strategies for hate-speech detection in spanish using linguistic features and transformers, Complex & Intelligent Systems (2022) 1–22. [22] J. A. García-Díaz, R. Valencia-García, Compilation and evaluation of the spanish saticorpus 2021 for satire identification using linguistic features and transformers, Complex & Intelligent Systems 8 (2022) 1723–1736. [23] J. A. García-Díaz, R. Colomo-Palacios, R. Valencia-García, Psychographic traits identification based on political ideology: An author analysis study on spanish politicians’ tweets posted in 2020, Future Generation Computer Systems 130 (2022) 59–74. [24] E. Amigo, A. Delgado, Evaluating extreme hierarchical multi-label classification, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 5809–5819. URL: https://aclanthology.org/2022.acl-long.399. doi:10.18653/v1/2022. acl-long.399.