=Paper=
{{Paper
|id=Vol-2943/exist_paper19
|storemode=property
|title=UMUTeam at EXIST 2021: Sexist Language Identification based on Linguistic Features and Transformers in Spanish and English
|pdfUrl=https://ceur-ws.org/Vol-2943/exist_paper19.pdf
|volume=Vol-2943
|authors=José Antonio García-Díaz,Ricardo Colomo-Palacios,Rafael Valencia-García
|dblpUrl=https://dblp.org/rec/conf/sepln/Garcia-DiazPV21a
}}
==UMUTeam at EXIST 2021: Sexist Language Identification based on Linguistic Features and Transformers in Spanish and English==
UMUTeam at EXIST 2021: Sexist Language Identification based on Linguistic Features and Transformers in Spanish and English José Antonio Garcı́a-Dı́az1[0000−0002−3651−2660] , Ricardo Colomo-Palacios2[0000−0002−1555−9726] , and Rafael Valencia-Garcı́a1[0000−0003−2457−1791] 1 Facultad de Informática, Universidad de Murcia, Campus de Espinardo, 30100, Spain {joseantonio.garcia8,valencia}@um.es 2 Faculty of Computer Sciences, Østfold University College, Halden, Norway ricardo.colomo-palacios@hiof.no Abstract. Sexism is harmful behaviour that can make women feel worth- less promoting self-censorship and gender inequality. In the digital era, misogynists have found in social networks a place in which they can spread their oppressive discourse towards women. Although this partic- ular form of oppressive speech is banned and punished on most social networks, its identification is quite challenging due to the large number of messages posted everyday. Moreover, sexist comments can be unnoticed as condescends or friendly statements which hinders its identification even for humans. With the aim of improving automatic sexist identifi- cation on social networks, we participate in EXIST-2021. This shared task involves the identification and categorisation of sexism language on Spanish and English documents compiled from micro-blogging platforms. Specifically, two tasks were proposed, one concerning a binary classifica- tion of sexism utterances and another regarding multi-class identification of sexist traits. Our proposal for solving both tasks is grounded on the combination of linguistic features and state-of-the-art transformers by means of ensembles and multi-input neural networks. To address the multi-language problem, we tackle the problem independently by lan- guage to put the results together at the end. Our best result was achieved in task 1 with an accuracy of 75.14% and 61.70% for task 2. Keywords: Sexism Identification · Document Classification · Feature Engineering · Natural Language Processing. IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 Introduction This manuscript describes the participation of the UMUTeam in the shared task EXIST 2021 [12] proposed at IberLeF 2021 [7], focused on the identification and categorisation of sexist language. Despite all the benefits of the raising of the Web 2.0. when it comes to reducing barriers to communication, misogynists have found in social networks a place in which they can spread their oppressive discourse towards women, making these networks an intimidating place. Sex- ism and misogyny are particular forms of discriminatory speech and conduct in which women are the victims of the harassment. It is worth mentioning the iden- tification of sexist speech is more challenging than other forms of hate-speech, as sexist and misogynistic messages can go unnoticed within funny or complacent comments. Therefore, some sexist messages are subtle to distinguish even for humans if we are not aware of them, just how messages with a condescending treatment could be. Sexist behaviour includes stereotyping, ideological issues, or sexual violence [11]. The objective of EXIST 2021 shared task is the identification and categorisa- tion of sexism behaviours from a broad sense. Specifically, the organisers of the shared task proposed two challenges. On the one hand, a binary sexism classifica- tion task and, on the other, a multiclass sexism categorisation task that consisted in a fine-grained classification of those messages classified as sexist in the fol- lowing traits: (1) ideological and inequality, (2) stereotyping and dominance, (3) objectification, (4) sexual violence, and (5) misogyny and non-sexual violence. To evaluate both tasks, participants were encouraged to build an automatic clas- sification system and evaluated them on a corpus composed by messages from social networks written in Spanish and English. Our participation is grounded on the usage of linguistic features combined with state-of-the-art transformers. As part of the doctoral thesis of one of the members of the UMUTeam, we are evaluating a tool for extracting linguistic features. This tool is inspired in LIWC [13] but designed to the Spanish from scratch. Our hypothesis is that the usage of linguistic features could be combined with statistical features, such as n-grams or any form of embeddings to improve the accuracy of the results and, at the same time, providing interpretable fea- tures. Therefore, one of the runs consisted into the usage of linguistic features in isolation as baseline and then we combined them with BERT in another run and finally we stack an ensemble for the last run. The remainder of this manuscript is organised as follows. First, in Section 2 some experiments and corpus regarding misogyny and sexist behaviour are discussed. Next, in Section 3 we give some insights regarding the corpus that was made available to the participants. Our pipeline is described in Section 4. In Section 5 we show the results achieved by our team and compare them with the best results achieved by the rest of the participants and the baselines proposed. Finally, the conclusions and future research directions are shown in Section 6. 2 Background information The identification of aggressive, hateful, and oppressive speech have been recur- rent tasks in NLP workshops. For our point of view, this trend is caused by two main factors: (1) They are challenging, as they involve the identification of sub- jective information and figurative language, and of course (2) the benefits that its automatic identification has for society. We can find, therefore, previously shared tasks focused on hate-speech based on racial, gender, religious, disability, sexual-orientation, and gender traits. Due to the scope of this shared task we will focus on sexism and misogyny identification. As far as our knowledge goes, the most relevant work regarding misogyny identification is the Automatic Misogyny Identification (AMI) that has been proposed at [3, 4] with datasets documents written in English, Spanish and Italian and focused on misogyny identification and categorisation of misogynist traits. In both tasks, the organisers note that misogynistic categorisation still remains a challenging problem. From a wider perspective, the SemEval shared task HateEval 2019 [1], focused on hate speech against immigrants and women. This task was also focused on a multilingual perspective, with tweets written in Spanish and English. It consisted in two sub- tasks: a binary classification problem towards Hate Speech Detection against immigrants and women, and a task focused on detecting if the hate-speech was directed to specific individuals or to wider groups, and if the text contains ag- gressive behaviour but that may not be linked to hate speech. Organisers also highlighted the challenging of identifying hate speech in micro-blogging texts. Our research group has also focused on misogyny identification with the re- lease of the Spanish MisoCorpus-2020 [6], focused on Spanish. The MisoCorpus- 2020 contains three subsets: VARW, SELA, and DDSS, focused, respectively, on the identification of misogynistic messages towards female politicians and jour- nalists, cultural and background differences between misogyny among European Spanish countries and Latin America countries, and tweets that contains spe- cific misogynistic traits. In this work we evaluate the combination of linguistic features and sentence word embeddings from fastText. The results shown that linguistic features regarding offensive language, grammatical gender, spelling mistakes, punctuation symbols, and jargon from social networks are effective for misogyny identification. 3 Corpus The dataset is composed by documents written in Spanish and English compiled between December 2020 and February 2021 that contains expressions used to underestimate the role of women in our society. The main data-source is Twitter but Gab was used to extend the dataset. Training and testing have a temporal separation as tweets were selected based on time in order to determine which ones belong to each split. According to the description of the task, a subset of the data were analysed in depth by two experts in gender issues. The resulting dataset contains 6977 tweets for training and 3386 tweets for testing, where both datasets were randomly selected from the 9000 and 4000 labelled sets, to ensure class balancing for Task 1. This dataset was enlarged with 492 gabs in English and 490 in Spanish from the uncensored social network Gab.com following a similar procedure as described before. This set will be included in the EXIST test set in order to measure the difference between social networks. Each document was labelled by five annotators. The final label was selected using a majority vote. However, tweets with 3-2 votes were manually reviewed by two experts of different gender. The reader can find more details regarding the dataset compilation in the overview of the task [12]. To evaluate the reliability of our proposal we extracted a validation split consisting in the 20% of the training dataset. Table 1 and Table 2 depict the label distribution we used for task 1 and 2 respectively. We can observe that the distribution of the sexism identification task is slightly imbalanced with more documents labelled as non-sexist. For subtask 2, we can observe that three of the sexist traits namely ideological inequality, stereotyping and dominance, and misogyny non-sexual violence have similar number of documents but the labels objectification and sexual violence have less examples. Table 1. Dataset distribution for subtask 1. Sexism identification Label Total Train Val Spanish non-sexist 1800 1446 354 sexist 1636 1387 354 English non-sexist 1800 1443 357 sexist 1636 1306 330 4 Methodology Our proposal is grounded on the combination of linguistic features with state-of- the art transformers. During our experimentation, we also evaluated word and sentence embeddings. For the linguistic features we use UMUTextStats [5, 6]. This tool is inspired in LIWC [13] but designed for the Spanish language. Although LIWC has available a Spanish version that has been evaluated in different domains, such as satire identification [10], it does not take into account specific Spanish linguistic phe- nomena that UMUTextStats does. Specifically, UMUTextStats handles a total of 365 linguistic variables classified in the following groups: (1) Phonetics, (2) Morphosyntax, (3) Correction and style, (4) Semantics, (5) Pragmatics and fig- urative language [9], (6) Stylometry, (7) Lexical, (8) Psycho linguistic processes, (9), and (10) Social media. For the transformers we use BERT. Specifically, the large cased version for those documents written in English and BETO [2] for those tweets written in Table 2. Dataset distribution for subtask 2. Sexism categorisation Label Total Train Val Spanish non-sexist 1800 1446 354 ideological-inequality 480 388 92 stereotyping-dominance 443 357 86 misogyny-non-sexual-violence 401 320 81 objectification 244 193 54 sexual-violence 173 129 44 English non-sexist 1800 1443 357 ideological-inequality 386 315 76 stereotyping-dominance 366 290 71 misogyny-non-sexual-violence 344 279 70 objectification 284 214 65 sexual-violence 256 208 48 Spanish. In addition, we train and evaluate models applying word and sentences embeddings from fastText, word2vec, and gloVe for the Spanish documents and from fastText from the English documents. These features were trained with recurrent, convolutional, and vanilla neural networks. Multiple feature sets in the same neural network were also evaluated using the functional API of Keras. It is worth noting that the evaluation of other neural networks architectures such as convolutional neural networks was performed because provided good results in the past for conducting sentiment analysis tasks [8]. Our strategy in this shared task consisted in dealing with documents written in Spanish and English separately, by splitting them in two datasets and evalu- ated with different models to merge the results just before the final submission. For the hyperparameter optimization we proceeded as follows. We evaluate a total of 100 neural networks per feature set in isolation and in combination, both for Spanish and English. As there was a slightly imbalance among the classes, the best models were selected based on weighted F1 score. The features evaluated were linguistic features (LF), sentence embeddings from fastText (SE), sentence embeddings from BERT (BE), and word embeddings (WE) from fastText for the English dataset and fastText, gloVe, and word2vec for the Spanish dataset. The majority of neural networks evaluated consisted in multilayer perceptrons (MLP) with different number of hidden layers (between 1 and 8), and different number of neurons (8, 16, 48, 64, 128, 256) organised in different shapes, including funnel, rhombus, long funnel, brick, diamond, and triangle. In case of WE, convolutional and recurrent networks were also evaluated including Bidirectional Long Short Term Memory (LSTM) and Bidirectional Gated Recurrent Unit (BiGRU). We use the functional API of Keras to feed multiple inputs for each neural network and thus, the combination of several features in the same network were also evaluated. For all test, we tried different dropout rate to avoid overfitting (0, 0.1, 0.2, and 0.3), and several activation functions including relu, sigmoid, tanh, selu, and elu. We also included an early stopping mechanism. The results of each parameter set can be viewed at https://github.com/Smolky/exist-2021. 5 Results We participate with three runs. The first one consisted in the usage of the lin- guistic features in isolation. As commented earlier, this run was used to set a baseline to evaluate the thesis objectives of the doctoral student of the team. The second one consisted in the combination of linguistic features and a BERT-based model, and the third run consisted in an ensemble of neural networks composed from LF, SE, BE, WE, and BERT. First, the results based on the first task concerning sexism identification are depicted in Table 3. Table 3. Comparison of the UMUTeam with the best three runs and the baselines for misogyny identification task Rank Team Accuracy f1-score 1 AI-UPV 1 78.04 78.02 2 SINAI TL 1 78.00 77.97 3 SINAI TL 2 77.77 77.57 22 UMUTEAM 3 75.14 75.14 27 UMUTEAM 2 74.40 74.40 52 SVM TFIDF (baseline) 68.45 68.32 63 UMUTEAM 1 59.66 59.64 66 Majority Class (baseline) 52.22 34.31 In regards of the official results for task 1 (misogyny identification), we achieved our best result with the ensemble model (run 3) with an accuracy and F1-score of 75.14%, reaching position 22 in the official rank. The overall best result was achieved by AI-UPV, with an accuracy of 78.04% and a F1-score of 78.02%. Our second run, based on BERT and LF, achieved a 74.4% of ac- curacy and F1-score, reaching position 27. Finally, our first run, based on the usage of linguistic features in isolation, achieved position 63, with an accuracy of 59.66% and a F1-score of 59.64%. Note that this result does not outperform the baseline result consisted in a bag of words based on TF-IDF score. The poor reliability of linguistic features in isolation is not sparingly due to the fact that UMUTextStats is focused on the Spanish language. In fact, the results only with the Spanish partition achieved an accuracy of 61.94% whereas with the English partition only a 57.43% of accuracy was obtained. The feature selection stage for the English dataset mainly selected those linguistic features based on Corpus statistics such as the length of the text or the Type-Token Ratio (TTR). Figure 1 depicts the Mutual Information of the top ranked LF for the task 1 for the Spanish split compared by label. We can observe that there are no important differences between linguistic features among classes except for lexical related to sex and female groups that appears mostly in sexist posts. non-sexist sexist stylometry corpus inflesz psycholinguistic processes negative morphosyntax gender words masculine morphosyntax morphology determiners interrogative social media reply male stylometry corpus words with 2ltr stylometry corpus words with 19ltr errors orthographics lexical sex lexical social social female 0% 25% 50% 75% 100% Fig. 1. Mutual Information of the top-ten linguistic features for the Spanish split for task 1: Sexism identification. Second, the result of the second task (sexism categorisation) are depicted in Table 4. For task 2 (see Table 4), our best result was achieved with the combination of BERT and LF with an F1-score of 53.62%, reaching position 18 in the official rank. Similarly to task 1, our best run is not far for the best result achieved by UPV with an F1-Score of 57.87%. However, contrary to the first task, our third run based on the ensemble model, achieved lower result than our second run. The first run, consisted on the linguistic features, achieved lower results falling below the baseline. In this task, however, it draw our attention the reliability of the LF were similar for Spanish and English, achieving an F1-score of 28.0% for Spanish and 27.21% for English. Figure 2 depicts the Mutual Information of the linguistic features for the task 2 for the Spanish split compared by label. However, these results must be viewed with caution, due to the limited results of the linguistic features in isolation for the task sexism categorisation. We can observe that lexical words regarding sex Table 4. Comparison of the UMUTeam with the best three runs and the baselines for misogyny categorisation task Rank Team Accuracy f1-score 1 AI-UPV 1 65.77 57.87 2 LHZ 1 65.09 57.06 3 SINAI TL 1 65.27 56.67 18 UMUTEAM 2 61.70 53.62 23 UMUTEAM 3 59.11 52.40 51 SVM TFIDF (baseline) 52.22 39.50 56 UMUTEAM 1 29.05 28.12 62 Majority Class (baseline) 47.78 10.78 appears less frequently in sexism messages categorised as stereotyping and dom- inance. Out of the different sexist traits, we can notice that negative statements (psycholinguistic processes negative) appear most frequently in documents la- belled as misogynist but without sexual violence. The rest of the features does not show significant differences among the sexist traits. 6 Conclusions In this manuscript we have detailed the participation of the UMUTeam in the EXIST 2021 shared task with three runs that combined linguistic features with state-of-the-art transformers. We are very happy with the opportunity that we have been given to participate in these tasks and we hope to repeat it in the future. We are aware that the results achieved by the LF in isolation are below reliable baselines such as n-grams based on TF-IDF. We are currently evaluating the labels of the test set in order to detect weakness and to improve our pipeline. Moreover, we are currently implementing more advanced ensembles by training deep-learning models that learn from the predictions of each individual model. We also have learned another way to create sentence-fixed embeddings from fine- tuned BERT models that are more easy to combine with other kind of features. For future research directions we observed that misogyny categorisation was performed as multiclass, in which all labels are considered mutually exclusive. However, we consider that it will be interesting to evaluate this sexist speech as a multi-label task. However, we are aware that this proposal would imply to relabel the dataset. Another research direction is to incorporate contextual features to the classification in order to provide a context to the documents. 7 Acknowledgments This work was supported by the Spanish National Research Agency (AEI) through project LaTe4PSP (PID2019-107652RB-I00/AEI/10.13039/501100011033). In addition, José Antonio Garcı́a-Dı́az has been supported by Banco Santander and University of Murcia through the industrial doctorate programme. non-sexist misogyny-non-sexual-violence ideological-inequality objectification stereotyping-dominance lexical sex stylometry corpus inflesz morphosyntax gender words feminine stylometry corpus syllabes per word psycholinguistic processes negative morphosyntax gender words masculine stylometry corpus length errors orthographics morphosyntax affixes suffixes stylometry corpus words with 2ltr 0% 25% 50% 75% 100% Fig. 2. Mutual Information of the top-ten linguistic features for the Spanish split for task 2: Sexism categorisation. References 1. Basile, V., Bosco, C., Fersini, E., Debora, N., Patti, V., Pardo, F.M.R., Rosso, P., Sanguinetti, M., et al.: Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In: 13th International Workshop on Semantic Evaluation. pp. 54–63. Association for Computational Linguistics (2019) 2. Cañete, J., Chaperon, G., Fuentes, R., Pérez, J.: Spanish pre-trained bert model and evaluation data. PML4DC at ICLR 2020 (2020) 3. Fersini, E., Nozza, D., Rosso, P.: Overview of the evalita 2018 task on automatic misogyny identification (ami). EVALITA Evaluation of NLP and Speech Tools for Italian 12, 59 (2018) 4. Fersini, E., Rosso, P., Anzovino, M.: Overview of the task on automatic misogyny identification at ibereval 2018. IberEval@ SEPLN 2150, 214–228 (2018) 5. Garcı́a-Dı́az, J.A., Cánovas-Garcı́a, M., Valencia-Garcı́a, R.: Ontology-driven aspect-based sentiment analysis classification: An infodemiological case study re- garding infectious diseases in latin america. Future Generation Computer Systems 112, 614–657 (2020). https://doi.org/10.1016/j.future.2020.06.019 6. Garcı́a-Dı́az, J.A., Cánovas-Garcı́a, M., Colomo-Palacios, R., Valencia-Garcı́a, R.: Detecting misogyny in spanish tweets. an approach based on lin- guistics features and word embeddings. Future Generation Computer Sys- tems 114, 506 – 518 (2021). https://doi.org/10.1016/j.future.2020.08.032, http://www.sciencedirect.com/science/article/pii/S0167739X20301928 7. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez-Carmona, M.Á., Álvarez Mellado, E., Carrillo-de Albornoz, J., Chiruzzo, L., Freitas, L., Gómez Adorno, H., Gutiérrez, Y., Jiménez Zafra, S.M., Lima, S., Plaza-de Arco, F.M., Taulé, M.: Proceedings of the iberian languages evaluation forum (iberlef 2021). In: CEUR workshop (2021) 8. Paredes-Valverde, M.A., Colomo-Palacios, R., Salas-Zárate, M.d.P., Valencia- Garcı́a, R.: Sentiment analysis in spanish for improvement of products and services: a deep learning approach. Scientific Programming 2017 (2017) 9. del Pilar Salas-Zárate, M., Alor-Hernández, G., Sánchez-Cervantes, J.L., Paredes- Valverde, M.A., Garcı́a-Alcaraz, J.L., Valencia-Garcı́a, R.: Review of english litera- ture on figurative language applied to social networks. Knowledge and Information Systems 62(6), 2105–2137 (2020) 10. del Pilar Salas-Zárate, M., Paredes-Valverde, M.A., Rodriguez-Garcı́a, M.Á., Valencia-Garcı́a, R., Alor-Hernández, G.: Automatic detection of satire in twitter: A psycholinguistic-based approach. Knowledge-Based Systems 128, 20–33 (2017) 11. Rodrı́guez-Sánchez, F., Carrillo-de Albornoz, J., Plaza, L.: Automatic classification of sexism in social networks: An empirical study on twitter data. IEEE Access 8, 219563–219576 (2020) 12. Rodrı́guez-Sánchez, F., de Albornoz, J.C., Plaza, L., Gonzalo, J., Rosso, P., Comet, M., Donoso, T.: Overview of exist 2021: sexism identification in social networks. Procesamiento del Lenguaje Natural 67(0) (2021) 13. Tausczik, Y.R., Pennebaker, J.W.: The psychological meaning of words: Liwc and computerized text analysis methods. Journal of language and social psychology 29(1), 24–54 (2010)