Evaluation of Linguistic Features Separately or Combined with Transformers for Solving Automatic Text Classification Tasks in Spanish José Antonio García-Díaz1 1 Facultad de Informática, Universidad de Murcia, Campus de Espinardo, 30100 Murcia, Spain Abstract In this paper we describe the evaluation stage and analysis of the UMUTextStats tool for extracting linguistic features applied to text classification in several domains, including the identification of sexist or offensive comments, emotions, and a fine-grained analysis regarding what texts are funny and what mechanisms are involved to make them funny. These subtasks were organised by IberLEF and IberEval 2021 workshops. During the participation on these subtasks, the linguistic features were evaluated separately and combined with state-of-the-art transformers by means of ensembles and knowledge integration strategies, with the objective of achieve competitive results in all tasks. At the same time, we seek to improve our methods to obtain some interpretability of the results. In summary, our results suggest than the combination of different feature sets improves text classification tasks, especially when they are input in the same neural network. Keywords Text classification, Feature engineering, Natural Language Processing 1. Introduction In the past edition of the doctoral symposium organised by the thematic network PLN.net, we described the main objectives related to this doctoral thesis [1]. These objectives consist in the development of a set of linguistic features for Spanish and their inclusion in Natural Language Processing (NLP) tasks, such as forensics linguistic, author profiling, infodemiolgy [2], or misogyny identification [3] among others. We also described our participation in TASS 2020 [4] and MEX-A3T [5] shared tasks, and we described two NLPs tools developed, one for compiling and annotation corpora [6], and the other, inspired in LIWC [7], for extracting the linguistic features [3, 2]. In summary, our main hypothesis is that linguistic features are somehow high-order features than statistical features based on words and their relationships. Examples of these feature sets are n-grams or contextual and non-contextual embeddings. Moreover, we state that applying the linguistic features results in more reliable and interpretable models. Our previous participation in the symposium was very positive for us, as we received valuable Doctoral Symposium on Natural Language Processing from the PLN.net network 2021 (RED2018-102418-T), 19-20 October 2021, Baeza (Jaén), Spain. $ joseantonio.garcia8@um.es (J. A. García-Díaz)  0000-0002-3651-2660 (J. A. García-Díaz) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 feedback from our mentors. Specifically, they recommend us to focus on the interpretability of the machine-learning models. In addition, they note that the results achieved in the shared tasks [8, 9], in which we combined the linguistic features with non-contextual embeddings, were limited compared to the results of other participants. Therefore, this year we have focus on improving our pipeline by (1) evaluating other feature sets to combine and compare with the linguistic features; (2) evaluating techniques for combining the features, such as knowledge transfer, and ensemble learning; and (3) evaluating explainable deep-learning techniques. We have measured our progress by participating in four shared tasks of IberLEF 2021 [10], and one shared task from IberEval 2021. In addition, during this year, we have applied our methods for conducting author profiling and hate-speech detection, with two publications that are under review in scientific journals. 2. Validation After summarising the main hypotheses of this research, here we describe the validation process carried out, that consisted in the participation in the following shared tasks: EXIST-2021 (see Section 2.1), EmoEvalEs 2021 (see Section 2.2), Hahackathon 2021 and HaHa 2021 (see Section 2.3), and MeOffendEs 2021 (see Section 2.4). For each subtasks we include a summary of the main objectives, the methods evaluated as well as the main insights extracted for each one. 2.1. EXIST-2021. Sexist language identification The shared task EXIST-2021 [11] focuses on the identification and categorisation of sexist language written in Spanish and English. The organisers of this shared task compiled and annotated documents from several micro-blogging platforms. This shared task was divided into two subtasks: (1) a binary classification of sexism utterances, and (2) a multi-class identification of sexist traits, namely, ideological and inequality, stereotyping and dominance, objectification, sexual violence, and misogyny and non-sexual violence. We participated in both subtasks with a combination of the linguistic features and trans- formers. For this, we tackle each dataset independently and combine the results at the end. During our research, we evaluated different types of embeddings, including (1) pre-trained non contextual word embeddings from fastText, word2vec, and gloVe; (2) sentence non-contextual embeddings from fastText; and (3) contextual word embeddings based on transformers. In addition, we evaluate different neural network architectures, including multi-layer perceptrons, convolutional neural networks, and bidirectional recurrent neural networks. We combined each feature set with the functional API of Keras1 , in a knowledge integration fashion, entering each feature into separate hidden layers and combining them before predicting the result. Three runs were sent. One with the linguistic features, another combining the linguistic features with transformers, and another based in an ensemble of neural networks of linguistic features and contextual and non-contextual word and sentence embeddings. We achieved our best result in task 1, with an accuracy of 75.14% using the ensemble learning approach, and an accuracy of 61.70% for task 2, with the combination of the linguistic features and transformers. 1 https://keras.io/guides/functional_api/ (Last accessed: 2021-07-17) 2 These results were not far from the best results achieved in the official leader board, achieved by the team AI-UPV_1, with an accuracy of 78.04% in task 1, and an accuracy of 65.77% for task 2. We observed that our baseline, consisted in the linguistic features, achieved limited results. This result was the expected with the English dataset, but unexpected with the Spanish dataset, especially as the linguistic features provided promising results regarding misogyny identification [3]. The main differences between both datasets (the Spanish MisoCorpus 2020 and the EXIST- 2021) are the number annotators (3 for the MisoCorpus, 5 for EXIST-2021) and the fact that the annotators from EXIST-2021 followed the guidelines from two experts in gender issues. Besides, the Spanish MisoCorpus 2021 contains a large number of tweets from news sites that were labelled as neutral. To gain some interpretability, we extracted the information gain from the linguistic features. We observed that the linguistic features related to sexual issues and related to female social groups were discriminatory features for the identification of sexism. However, both features appeared less frequently in documents labelled as stereotyping and dominance. 2.2. EmoEvalEs 2021. Emotion Detection The EmoEvalES shared task [12] is focused on extracting the emotions expressed by users on social media, which it is challenging mainly due to the absence of prosodic features and facial expressions. This shared task consists in a multi-class classification for determining if a text contains one of the following classes: Anger, Disgust, Fear, Joy, Sadness, Surprise or Others. Like the other shared tasks in which we participated, we based our proposal in the combination of the linguistic features and transformers [13]. We achieved 6th position in the official leader board with an accuracy of 68.5990%, falling only 4.1667% below the best result. In this shared task, we achieve a significant improvement in our pipeline, as we were able to extract the contextual sentence embeddings from BETO [13]. For this, we extracted a fixed representation of 768-length vector from the [CLS] token, after fine-tuning the model with the EmoEvalEs dataset [14]. We observe than the performance of this approach was similar to the one achieved using HuggingFace’s Trainer2 . However, the fixed representation of the BERT embeddings provided to us two important benefits: they are easier to combine with other feature sets within the same neural network and the required time for training and performing inference is reduced. As we expected, regarding the interpretability of our results, we observed a strong correlation between lexicons related emotions with the labels. Lexicons containing sad expressions were strong related to documents annotated as sadness and disgust. anger with the psycho-linguistic process anger. Negative processes were also related to anger, disgust, fear, sadness, and surprise. 2.3. Hahackathon 2021 and HaHa 2021 Regarding humour, we have participated in two tasks regarding its identification, categori- sation, and evaluation. On the one hand, the HaHackathon 2021 shared task [15], proposed in IberEval’2021, focused on texts written in English, and HaHa 2021 [16], focused on Span- ish. Both shared tasks were divided into four subtasks each one. HaHackathon focused on 2 https://huggingface.co/transformers/main_classes/trainer.html(lastaccessed:2021-07-17) 3 determining if a text is funny or not (binary classification), how humorous it is (regression), and if its humour is controversial or not and how much (binary classification and regression, respectively). HaHa shared with HaHackathon the first two subtasks, but they included two new subtasks for determining what are the mechanisms to make a text funny and what are the targets of the joke, that were, respectively, a multi-classification and a multi-label tasks. In HaHackathon 2021 we achieved position 45, with a F1-score of 91.60% in the subtask 1a. A RMSE of 0.8847 for subtask 1b, achieving position 47. Position 14 in subtask 1c, with a F1-score of 57.22%. Finally, we achieved position 46 for subtask 2a, with a RMSE score of 0.8740. It is worth mentioning that, as HaHackathon 2021 was focused in English, we only use the subset of the linguistic features based on corpus statistics, such as the type/token ratio (TTR). In HaHa 2021, we achieved the 1st position in Funniness Score Prediction, the 8th position for humor classification subtask, and the 7th and the 3rd position for the subtasks of humour mechanism and target classification, respectively. For subtask 2 of HaHa 2021, in which we achieved the best result, we observed that stylometry is a relevant linguistic category. We also observe that interjections, verbs in third person, adverbs, augmentative suffixes, and proper nouns were also relevant features. It also caught our attention to find features related to the number of orthographic errors, as they can be committed on purpose as a humoristic device. 2.4. MeOffendES 2021 Finally, we participated in the MeOffendEs 2021 shared task [17], focused on the identification and categorisation of offensiveness, with datasets in European and Mexican Spanish extracted from different social media platforms. This shared task was divided into two subtasks (two subtasks per language variation). On the one hand, the European Spanish subtasks were based on multi-classification, discerning among (1) offensive texts whose target is a person; (2) offensive texts whose target to groups; (3) texts with inadequate language, but not necessary offensive; and (4) non offensive texts. The Mexican Spanish, on the other hand, were binary classification problems. Each linguistic variant included a subtask in which contextual features from the documents could be considered. In this case, apart from the linguistic features and transformers, we evaluate fine-grained negation features [18, 19, 20, 21] as a result of a collaboration with the Universidad de Jaén. All these features were combined with ensemble learning. Specifically, we evaluated ensembles based on the mode of the predictions, ensembles based on averaging the predictions of each neural network, ensembles based on the highest probability, and ensembles based on training regression machine learning model from the probabilities of the training split. We observed that the ensembles based on linear regression provided the best results whereas the ones based on the highest probability the best precision over the offensive class. Our official results were promising, as we ranked in the 2nd place in subtask 1 (F1-score of 87.8289%), 1st in subtask 2 (F1-score 87.8289%), 5th in subtask 3 (F1-score of 67.0588%), and 1st in subtask 4 (F1-score of 66.9449%). However, there were less participants in the subtasks that included the contextual features. Regarding the interpretability of the models, we observed in the Spanish dataset that negative psycho-linguistic processes were strong features to discern from non-offensive documents from the others, but that they were not good indicators to discern 4 among if the target is a person, a group or simply the use of inadequate language. 3. Conclusions and further work Since I am in the last year of my doctorate, and having previously participated in the previous version of this symposium, we have focus this study on the validation tasks. Specifically, we described our participation in five shared tasks regarding text classification in which we have achieved promising results. We have tried to follow the indications given by our mentors and we feel that their advises have helped us in a great extent. There is, however, a still a lot of room for improvement. For example, we are still focusing on the interpretability based on the linguistic features in isolation, but not in the context of the neural network. To solve this, we will evaluate the ensembles to analyse which features have the documents that are successfully classified correctly by the transformers and not by the linguistic features and vice versa. Moreover, we are adapting tools such as SHAP and LIME [22]. We are also focusing on improving the detection of figurative language [23] to apply to specific domains such as sarcasm, irony, and satire identification [24]. Acknowledgments This work was supported by the Spanish National Research Agency (AEI) through project LaTe4PSP (PID2019-107652RB-I00/ AEI / 10.13039/501100011033). In addition, José Antonio García-Díaz was supported by Banco Santander and the University of Murcia through the Doctorado industrial programme. References [1] J. A. Garcıa-Dıaz, Using linguistic features for improving automatic text classification tasks in spanish 2802 (2020). [2] J. A. García-Díaz, M. Cánovas-García, R. Valencia-García, Ontology-driven aspect-based sentiment analysis classification: An infodemiological case study regarding infectious diseases in latin america, Future Generation Computer Systems 112 (2020) 641–657. [3] J. A. García-Díaz, M. Cánovas-García, R. Colomo-Palacios, R. Valencia-García, Detect- ing misogyny in spanish tweets. an approach based on linguistics features and word embeddings, Future Generation Computer Systems 114 (2020) 506–518. [4] J. A. García-Díaz, Á. Almela, R. Valencia-García, Umuteam at tass 2020: Combining linguistic features and machine-learning models for sentiment classification, in: Notebook Papers of 2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain, 2020, pp. 187–196. [5] J. A. García-Díaz, R. Valencia-García, Umuteam at mex-a3t’2020: Detecting aggressive- ness with linguistic features and word embeddings, in: Notebook Papers of 2nd SEPLN Workshop on Iberian Languages Evaluation Forum (IberLEF), Malaga, Spain, 2020, pp. 287–292. 5 [6] J. A. García-Díaz, Á. Almela, G. Alcaraz-Mármol, R. Valencia-García, Umucorpusclassifier: Compilation and evaluation of linguistic corpus for natural language processing tasks, Procesamiento del Lenguaje Natural 65 (2020) 139–142. [7] Y. R. Tausczik, J. W. Pennebaker, The psychological meaning of words: Liwc and com- puterized text analysis methods, Journal of language and social psychology 29 (2010) 24–54. [8] M. García-Vega, M. C. Díaz-Galiano, M. Á. García-Cumbreras, F. M. P. del Arco, A. Montejo- Ráez, S. M. Jiménez-Zafra, E. M. Cámara, C. A. Aguilar, M. Antonio, S. Cabezudo, et al., Overview of tass 2020: introducing emotion detection (2020). [9] M. E. Aragón, H. J. Jarquín-Vásquez, M. Montes-Y-Gómez, H. J. Escalante, L. V. Pineda, H. Gómez-Adorno, J. P. Posadas-Durán, G. Bel-Enguix, Overview of mex-a3t at iberlef 2020: Fake news and aggressiveness analysis in mexican spanish., in: IberLEF@ SEPLN, 2020, pp. 222–235. [10] M. Montes, P. Rosso, J. Gonzalo, E. Aragón, R. Agerri, M. Á. Álvarez-Carmona, E. Ál- varez Mellado, J. Carrillo-de Albornoz, L. Chiruzzo, L. Freitas, H. Gómez Adorno, Y. Gutiér- rez, S. M. Jiménez Zafra, S. Lima, F. M. Plaza-de Arco, M. Taulé, Proceedings of the iberian languages evaluation forum (iberlef 2021), in: CEUR workshop, 2021. [11] F. Rodríguez-Sánchez, J. C. de Albornoz, L. Plaza, J. Gonzalo, P. Rosso, M. Comet, T. Donoso, Overview of exist 2021: sexism identification in social networks, Procesamiento del Lenguaje Natural 67 (2021). [12] F. M. Plaza-del-Arco, S. M. Jiménez-Zafra, A. Montejo-Ráez, M. D. Molina-González, L. A. Ureña-López, M. T. Martín-Valdivia, Overview of the EmoEvalEs task on emotion detection for Spanish at IberLEF 2021, Procesamiento del Lenguaje Natural 67 (2021). [13] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, Spanish pre-trained bert model and evaluation data, PML4DC at ICLR 2020 (2020). [14] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019). [15] J. Meaney, S. R. Wilson, L. Chiruzzo, A. Lopez, W. Magdy, Semeval 2021 task 7, hahackathon, detecting and rating humor and offense, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021. [16] L. Chiruzzo, S. Castro, S. Góngora, A. Rosá, J. A. Meaney, R. Mihalcea, Overview of HAHA at IberLEF 2021: Detecting, Rating and Analyzing Humor in Spanish, Procesamiento del Lenguaje Natural 67 (2021). [17] F. M. Plaza-del-Arco, M. Casavantes, H. Escalante, M. T. Martín-Valdivia, A. Montejo- Ráez, M. Montes-y-Gómez, H. Jarquín-Vásquez, L. Villaseñor-Pineda, Overview of the MeOffendEs task on offensive text detection at IberLEF 2021, Procesamiento del Lenguaje Natural 67 (2021). [18] S. M. Jiménez-Zafra, Negation processing in spanish and its application to sentiment analysis, Procesamiento del Lenguaje Natural 66 (2021) 193–196. [19] S. M. Jiménez-Zafra, N. P. Cruz-Díaz, M. Taboada, M. T. Martín-Valdivia, Negation detection for sentiment analysis: A case study in spanish, Natural Language Engineering 27 (2021) 225–248. [20] S. M. Jiménez-Zafra, M. Taulé, M. T. Martín-Valdivia, L. A. Urena-López, M. A. Martí, 6 Sfu review sp-neg: a spanish corpus annotated with negation for sentiment analysis. a typology of negation patterns, Language Resources and Evaluation 52 (2018) 533–569. [21] S. M. Jiménez-Zafra, R. Morante, E. Blanco, M. T. M. Valdivia, L. A. U. Lopez, Detecting negation cues and scopes in spanish, in: Proceedings of The 12th Language Resources and Evaluation Conference, 2020, pp. 6902–6911. [22] Y. Rychener, X. Renard, D. Seddah, P. Frossard, M. Detyniecki, Sentence-based model agnostic NLP interpretability, CoRR abs/2012.13189 (2020). URL: https://arxiv.org/abs/2012. 13189. arXiv:2012.13189. [23] M. del Pilar Salas-Zárate, G. Alor-Hernández, J. L. Sánchez-Cervantes, M. A. Paredes- Valverde, J. L. García-Alcaraz, R. Valencia-García, Review of english literature on figurative language applied to social networks, Knowl. Inf. Syst. 62 (2020) 2105–2137. URL: https: //doi.org/10.1007/s10115-019-01425-3. doi:10.1007/s10115-019-01425-3. [24] M. del Pilar Salas-Zárate, M. A. Paredes-Valverde, M. Á. Rodríguez-García, R. Valencia- García, G. Alor-Hernández, Automatic detection of satire in twitter: A psycholinguistic- based approach, Knowl. Based Syst. 128 (2017) 20–33. URL: https://doi.org/10.1016/j.knosys. 2017.04.009. doi:10.1016/j.knosys.2017.04.009. 7