Sexism Identification in Social Networks using a Multi-Task Learning System Flor Miriam Plaza-del-Arco, M. Dolores Molina-González, L. Alfonso Ureña-López, and M. Teresa Martı́n-Valdivia Department of Computer Science, Advanced Studies Center in ICT (CEATIC) Universidad de Jaén, Campus Las Lagunillas, 23071, Jaén, Spain {fmplaza, mdmolina, laurena, maite}@ujaen.es Abstract. This paper describes the participation of SINAI-TL team at sEXism Identification in Social neTworks shared task at IberLEF 2021. In order to accomplish the task, we follow a Multi-Task Learning approach where multiple tasks related to sexism identification are learned in parallel while using a shared representation. Specifically, we test the performance of the combination of different tasks related to sentiment analysis and offensive language detection. Our team ranked second in subtask 1 and third in subtask 2, achieving 78% and 56.67% of accuracy, respectively, among the participants. Keywords: Multi-Task Learning · BERT · Sentiment Analysis · Offen- sive Language. 1 Introduction Sexism is any discrimination against people on the basis of sex (or, as it is currently expressed, on the basis of gender). Sexism against women is a cultural component, historically widespread, whose principle is the supremacy of men over women in different areas of life, such as in the workplace, politics, society, the family and even in advertising. We find sexism in daily conversation, in the disregard for opinions expressed by women, in statements loaded with discriminatory ideology, even embedded in hundreds of sayings and fixed expressions. This discrimination against women in society is still deeply rooted in communication, both oral and written, and it is increasingly reproduced on the Internet. Detecting online sexism may be difficult, as it may be expressed in very different forms, but it is necessary in order to design new equality policies, as well as to encourage better behaviour in society. IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Many academic events and shared tasks took place in the last years related to misogyny identification [11, 10] or related to Hate Speech (HS) detection against immigrants and women (HatEval) [4]. Few works have presented sexism detection and, in particular, they addressed sexism as the detection of hate speech against women. But sexism comprises any form of oppression or prejudice against women and therefore may be hostile (as in the case of misogyny) or subtle. Thus, sexism includes misogyny but is not limited to it [17]. In this paper, we present the system we developed as part of our participation for the sEXism Identification in Social neTworks shared task [17] at IberLEF 2021 [15] in both subtasks. The first subtask consists of classifying whether or not a given text (tweet or gab) is sexist (i.e., it is sexist itself, describes a sexist situation or criticizes a sexist behaviour). Once a message has been classified as sexist, the second subtask aims to categorize the message according to the five type of sexism (ideological and inequality, stereotyping and dominance, objecti- fication, sexual violence, and misogyny and non-sexual violence). In order to accomplish the EXIST shared task, we propose a Multi-Task Learning system (MTL) that leverages affective and offensive knowledge to de- tect sexism, using a well-known Transformer-based model. The rest of the paper is structured as follows. In Section 2 we describe the data used in our experiments. In Section 3, we present the proposed system for addressing the task. In Section 4 and 5, we describe the experiment setup and results, respectively. Finally, the conclusion and future work is presented in Section 6. 2 Corpora To run our experiments, we used the English and Spanish datasets provided by the organizers of the sEXism Identification in Social neTworks (EXIST) shared task [17] at IberLEF 2021 [15]. The EXIST dataset incorporates any type of sexist expression or related phenomena, including descriptive or reported asser- tions where the sexist message is a report or a description of a sexist behaviour. Popular expressions and terms, such as terms used in previous approaches to the state of the art, both in English and Spanish, used to undervalue the role of women have been extracted from various Twitter accounts, and analysed and filtered by two gender experts, Trinidad Donoso and Miriam Comet [19]. The final set contains more than 200 expressions that can be used in gendered con- texts. Using the final set of sexism terms (94 seeds for Spanish and 91 seeds for English), tweets were extracted in both languages (over 800,000 tweets were downloaded). As a result, the collected dataset has 4,500 tweets per language for the training set and 2,000 tweets per language for the test set. Final labels of tweets were selected according to the majority vote between five crowdsourcing annotators, who followed the guidelines developed by Trinidad and Miriam, but tweets with 3 to 2 votes were manually reviewed by two people with more than two years of experience analyzing sexist content in social networks. Final EXIST dataset consists of 6,977 tweets for training and 3,386 tweets for testing. Moreover, we used in our experiments other corpora corresponding to tasks that could be related to sexism identification from Twitter including polarity classification (InterTASS), emotion classification (EmoEvent) HS identification (HatEval), and aggressiveness detection (MEX-A3T). The datasets are described below: – International TASS Corpus (InterTASS) was released in 2017 [14] with Spanish tweets and updated in 2018 with texts written in three different variants of Spanish from Spain, Costa Rica and Peru [13]. In 2019, InterTASS was enlarged with new texts written in two new Spanish variants: Uruguayan and Mexican [9] and finally, it was completed with Chilean-Spanish Tweets in 2020 [12]. The corpus released in 2019 is the one used in this paper. Each tweet was annotated by at least three annotators with its level of polarity, which could be labeled as positive, negative, neutral and none. – EmoEvent [3] is a multilingual emotion dataset based on events that took place in April 2019. It focuses on tweets in the areas of entertainment, catas- trophes, politics, global commemoration and global strikes. For the creation of the corpus, the authors collected Spanish and English tweets from the Twitter platform. Then, each tweet was labeled with one of seven emotions, six Ekman’s basic emotions plus the “neutral or other emotions” label. Fo- cusing on the Spanish language, a total of 8,409 were labeled by three Ama- zon Mechanical Turkers. – HatEval [4], the HS dataset used in this paper, was provided by the organiz- ers in SemEval 2019 Task 5. The task consisted in detecting hateful content in Twitter posts, against two targets: women and immigrants. For the cre- ation of the corpus, the data was collected using a different time frame. The majority of tweets against women were derived from an earlier collection made in the context of two earlier challenges on misogynistic speech identi- fication, whose collection phase began on July 2017 and ended on November 2017 [11, 10]. The remaining tweets were collected from July to September 2018. The dataset contains tweets composed of an identifier, the text of the tweet and the mark of HS, which is 0 if the text is not hateful and 1 if the text is hateful speech against women or immigrants. – MEX-A3T [2]. It was provided by the organizers in IberEval 2018: Author- ship and aggressiveness analysis in Mexican Spanish tweets [1]. They built a corpus of tweets to detect aggressiveness from Mexican accounts collected from August to November of 2017. They selected a set of terms that served as seeds for extracting the tweets. They used both words classified as vulgar and non-colloquial in the Dictionary of Mexicanisms . The hashtags were re- lated to sexism, homophobia, politics and discrimination. They used Mexico City as the center and extracted all tweets that were within a radius of 500 km. Finally, the collected tweets were labeled by two people. The dataset contains tweets composed of an identifier, the text of the tweet, and the mark of aggressiveness, being 0 if the tweet is not-aggressive and 1 if the tweet is aggressive. 3 System overview In this section, we describe the systems developed for the sEXism Identification in Social neTworks shared task at IberLEF 2021. We propose a Multi-Task Learning (MTL) system using the well-known Transformer-based model BERT which has been proven to be very successful in many natural language processing tasks. In the MTL model we integrate knowledge from different tasks related to sexism identification. In the MTL scenario, the goal is to learn multiple tasks simultaneously in- stead of learning them separately in order to improve performance on each task [6]. These tasks are usually related, although they may have different data or fea- tures. By sharing representations across related tasks, we can allow our model to better generalize to our original task. In this study, we used tasks related to the target task sexism identification. These tasks include offensive language detection, polarity classification, and emotion classification, sharing the same data source: Twitter. The reason for incorporating polarity and emotion infor- mation to detect sexism is that these tasks are usually emotional and expresses a negative emotion and polarity towards the recipient. To develop the MTL system, we follow the most widely used technique to MTL in neural networks introduced by [6], the hard parameter sharing approach. It consists of a single encoder that is shared and updated between all tasks, while keeping a few task-specific layers to specialize in each task [18]. The general architecture of the MTL model is shown in Figure 1. The shared layers are based on BERT [8]. Following [8], in the first step, all the inputs are converted to WordPieces [20], two additional tokens are added at the start ([CLS]) and end ([SEP]) of the input sequence, respectively. In the shared layers, the BERT model first converts the input sequence to a sequence of embedding vectors. This semantic representation is shared across all tasks. Then, on top of the shared BERT layers, the task-specific output heads are created for each task, and task heads are attached to a common sentence encoder. Finally, the layers are fine-tuned according to the given set of downstream tasks. 4 Experimental setup 4.1 Dataset preprocessing We perform a Twitter-specific data cleaning before including the texts in the models. The following practices to prepare the text for deep learning experiments have been carried out using the ekphrasis module [5]: – URLs, emails, users’ mentions, percentages, monetary amounts, time and date expressions, and phone numbers are normalized. – Hashtags are unpacked and split to their constituent words. – Elongated words and repeated characters in words are annotated and re- duced. – Emojis are converted to their alias. Fig. 1. Proposed MTL system for the EXIST task. Task-specific linear heads Task 1 Task 2 Task 3 (...) Task N Shared Encoder WordPiece Embeddings [CLS] Token1 Token2 Token3 Token4 Token5 Token6 Token7 [SEP] 4.2 System settings All the models were implemented using PyTorch, a high-performance deep learn- ing library [16] based on the Torch library. The experiments were run on a single Tesla-V100 32 GB GPU with 192 GB of RAM. During the evaluation phase, we train the model on the training and valida- tion sets, then we evaluate it on the test set provided by the organizers. Regarding our participation, we submitted three runs using the proposed MTL-based system. The details of the modules and the differences of the three settings we presented are described below. – Run 1. In this setting, our goal is to leverage sentiment analysis to aid in the classification of sexism texts. Our assumption is that sexism texts are associated with a negative polarity, then the knowledge share can help to detect easily sexism texts. To this end, we train the MTL model at the same time on the polarity classification and the sexism identification tasks. For the first task, we use the InterTASS dataset. Finally, we obtain the evaluation on the sexism corpora test set. – Run 2. In this setting, our goal is to leverage emotion analysis to aid in the classification of sexism texts. Our assumption is that negative emotions such as anger, fear, sadness and disgust could be related to sexism texts while positive emotions are not. For the first task, we use the EmoEvent dataset. Finally, we obtain the evaluation on the sexism corpora test set. – Run 3. In this setting, we train the model on the offensive language identi- fication and the sexism identification tasks. Our assumption is that sexism identification is associated with offensive language and sometimes with hate speech, then the knowledge share during training among these tasks can benefit to the task of sexism identification. For the first task, we use two datasets (HatEval and MEX-A3T). Finally, we obtain the evaluation on the sexism corpora test set. As the EXIST dataset is composed of English and Spanish texts, while train- ing the MTL system we use two models based on BERT, the BERT base model (cased) trained on English texts and the BETO model [7] trained on Spanish texts. For the first substask (sexism identification) we employ the following hy- perparameters: learning rate as 4e-05, batch size as 8, dropout probability as 0.01, the optimization algorithm Adamw, and maximum epoch as 2, while for the second subtask (sexism categorization) the batch size was set to 16 and the number of epochs to 3. 5 Results In this section we present the results obtained by the different runs we have ex- plored in both subtasks of the competition. In order to evaluate them we use the official competition metrics for subtask 1 and subtask 2, accuracy and macro- average F-measure, respectively. Besides, other measures employed in classifica- tion tasks including Precision (P) and Recall (R) are computed. The results of our participation in the EXIST task during the evaluation phase are shown in Table 1 (subtask 1) and Table 3 (subtask 2). In particular, we list the performance of the three runs submitted using the MTL model along with the combination of different tasks as explained in Section 4.2. If we analyze the results of our 3 runs in subtask 1 and 2, the best result is achieved by the combination of sexism identification and polarity classification tasks, following by run 2, which combines sexism identification and offensive language detection. In subtask 2, it is well noticeable that the run 3 (emo- tion classification along sexism identification) significantly decreases compared to subtask 1. A possible reason could be that subtask 2 aims to classify 5 differ- ent categories that are not significantly associated with emotions, whereas the transfer knowledge of polarity classification and detection of offensive language helps to identify the different categories. Finally, our results in the competition for both subtasks among the partici- pants (Table 2 and Table 4) show the success of our proposed model achieving the second place in the ranking for the first subtask and the third place for the second subtask. The representations computed by the encoder embed the affective knowledge allows the MTL model to identify sexism more accurately by leveraging the affective nature of the instance. Table 1. Results in Subtask 1 on the test set of EXIST shared task. Run Acc Precision Recall F-measure 1 0.7800 0.7796 0.7800 0.7797 2 0.7766 0.7761 0.7760 0.7761 3 0.7770 0.7779 0.7751 0.7757 Table 2. Ranking of participants’ systems in subtask 1 of EXIST shared task. Ranking Team Acc 1 AI-UPV 1 0.7900 2 SINAI TL 1 0.7800 3 SINAI TL 3 0.7770 4 SINAI TL 2 0.7766 31 task1 CIC 1 0.7278 66 Majority Class 0.5222 Table 3. Results in subtask 2 on the test set of EXIST shared task. Run Acc Precision Recall F-measure 1 0.6527 0.5848 0.5527 0.5667 2 0.6049 0.621 0.4082 0.4549 3 0.6497 0.5774 0.5518 0.5632 Table 4. Ranking of participants’ systems in subtask 2 of EXIST shared task. Ranking Team F1 1 task2 AI-UPV 1 0.5787 2 task2 LHZ 1 0.5706 3 task2 SINAI TL 1 0.5667 4 task2 SINAI TL 3 0.5632 41 task2 SINAI TL 2 0.4549 62 Majority Class 0.1078 6 Conclusion This paper presents the participation of the SINAI-TL research group at sEX- ism Identification in Social neTworks shared task at IberLEF 2021. Our proposal explores how transferred knowledge from tasks related to sexism identification (polarity classification, emotion classification and offensive language detection) may help in a text classification task like EXIST. Experiments conducted show the efficacy of our proposed approach in achieving convincing performance in both subtasks. In particular, polarity classification help the MTL model to clas- sify sexism more accurately by leveraging on the affective knowledge. Finally, as future work we plan to develop a complex model that incorporates other related tasks, such as irony or sarcasm detection, that could be beneficial for sexism identification. Acknowledgement This work has been partially supported by a grant from European Regional Development Fund (FEDER), LIVING-LANG project [RTI2018-094653-B-C21], and Ministry of Science, Innovation and Universities (scholarship [FPI-PRE2019- 089310]) from the Spanish Government. References 1. Álvarez-Carmona, M.Á., Guzmán-Falcón, E., Montes-y-Gómez, M., Escalante, H.J., Villaseñor-Pineda, L., Reyes-Meza, V., Rico-Sulayes, A.: Overview of MEX- A3T at ibereval 2018: Authorship and aggressiveness analysis in mexican spanish tweets. In: Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018) co-located with 34th Con- ference of the Spanish Society for Natural Language Processing (SEPLN 2018), Sevilla, Spain, September 18th, 2018. CEUR Workshop Proceedings, vol. 2150, pp. 74–96. CEUR-WS.org (2018) 2. Aragón, M.E., Jarquı́n-Vásquez, H.J., Montes-y-Gómez, M., Escalante, H.J., Vil- laseñor-Pineda, L., Gómez-Adorno, H., Posadas-Durán, J.P., Bel-Enguix, G.: Overview of MEX-A3T at IberLEF 2020: Fake News and Aggressiveness Analysis in Mexican Spanish. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) co-located with 36th Conference of the Spanish Society for Nat- ural Language Processing (SEPLN 2020), Málaga, Spain, September 23th, 2020. CEUR Workshop Proceedings, vol. 2664, pp. 222–235. CEUR-WS.org (2020) 3. Plaza-del Arco, F.M., Strapparava, C., Ureña-López, L.A., Martı́n-Valdivia, M.: EmoEvent: A multilingual emotion corpus based on different events. In: Proceed- ings of the 12th Language Resources and Evaluation Conference. pp. 1492–1498. European Language Resources Association, Marseille, France (May 2020) 4. Basile, V., Bosco, C., Fersini, E., Nozza, D., Patti, V., Rangel Pardo, F.M., Rosso, P., Sanguinetti, M.: SemEval-2019 task 5: Multilingual detection of hate speech against immigrants and women in Twitter. In: Proceedings of the 13th International Workshop on Semantic Evaluation. pp. 54–63. Associa- tion for Computational Linguistics, Minneapolis, Minnesota, USA (Jun 2019). https://doi.org/10.18653/v1/S19-2007 5. Baziotis, C., Pelekis, N., Doulkeridis, C.: Datastories at semeval-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis. In: Pro- ceedings of the 11th International Workshop on Semantic Evaluation (SemEval- 2017). pp. 747–754. Association for Computational Linguistics, Vancouver, Canada (August 2017) 6. Caruana, R.: Multitask learning. Machine learning 28(1), 41–75 (1997) 7. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish pre- trained bert model and evaluation data. In: PML4DC at ICLR 2020 (2020) 8. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding (2018) 9. Dı́az-Galiano, M.C., Garcı́a-Vega, M., Casasola, E., Chiruzzo, L., Garcı́a- Cumbreras, M.Á., Martı́nez-Cámara, E., Moctezuma, D., Montejo-Ráez, A., Sobrevilla-Cabezudo, M.A., Sadit-Tellez, E., et al.: Overview of TASS 2019: One More Further for the Global Spanish Sentiment Analysis Corpus. In: IberLEF@ SEPLN. pp. 550–560 (2019) 10. Fersini, E., Nozza, D., Rosso, P.: Overview of the evalita 2018 task on automatic misogyny identification (ami). EVALITA Evaluation of NLP and Speech Tools for Italian 12, 59 (2018) 11. Fersini, E., Rosso, P., Anzovino, M.: Overview of the task on automatic misogyny identification at ibereval 2018. In: Rosso, P., Gonzalo, J., Martı́nez, R., Montalvo, S., de Albornoz, J.C. (eds.) Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2018). CEUR Workshop Proceedings, vol. 2150, pp. 214–228. CEUR-WS.org (2018) 12. Garcı́a-Vega, M., Dı́az-Galiano, M.C., Garcı́a-Cumbreras, M.Á., Plaza-del-Arco, F.M., Montejo-Ráez, A., Jiménez-Zafra, S.M., Martı́nez-Cámara, E., et al.: Overview of TASS 2020: Introducing emotion detection. In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020) co-located with 36th Con- ference of the Spanish Society for Natural Language Processing (SEPLN 2020), Málaga, Spain, September 23th, 2020. CEUR Workshop Proceedings, vol. 2664, pp. 163–170. CEUR-WS.org (2020) 13. Martı́nez-Cámara, E., Almeida-Cruz, Y., Dı́az-Galiano, M.C., Estévez-Velarde, S., Garcı́a-Cumbreras, M.Á., Garcı́a-Vega, M., Gutiérrez, Y., Montejo-Ráez, A., Mon- toyo, A., Muñoz, R., Piad-Morffis, A., Villena-Román, J.: Overview of TASS 2018: Opinions, health and emotions. In: Proceedings of TASS 2018: Workshop on Se- mantic Analysis at SEPLN, TASS@SEPLN 2018. CEUR Workshop Proceedings, vol. 2172, pp. 13–27. CEUR-WS.org (2018) 14. Martı́nez-Cámara, E., Dı́az-Galiano, M.C., Garcı́a-Cumbreras, M.A., Garcı́a-Vega, M., Villena-Román, J.: Overview of TASS 2017. Proceedings of TASS pp. 13–21 (2017) 15. Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Álvarez Carmona, M., Álvarez Mellado, E., Carrillo-de Albornoz, J., Chiruzzo, L., Freitas, L., Gómez Adorno, H., Gutiérrez, Y., Jiménez-Zafra, S.M., Lima, S., Plaza-del-Arco, F.M., Taulé, M. (eds.): Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) (2021) 16. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. In: Advances in neural information processing systems. pp. 8026–8037 (2019) 17. Rodrı́guez-Sánchez, F., de Albornoz, J.C., Plaza, L., Gonzalo, J., Rosso, P., Comet, M., Donoso, T.: Overview of exist 2021: sexism identification in social networks. Procesamiento del Lenguaje Natural 67(0) (2021) 18. Ruder, S.: Neural transfer learning for natural language processing. Ph.D. thesis, NUI Galway (2019) 19. Vázquez, T.D., Catalán, Á.R.: Violencias de género en entornos virtuales. Ediciones Octaedro (2018) 20. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al.: Google’s neural machine transla- tion system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016)