MultiAzterTest@Exist-IberLEF 2021: Linguistically Motivated Sexism Identification Kepa Bengoetxea1[0000−0002−0289−6897] and Itziar Gonzalez-Dios1[0000−0003−1048−5403] Ixa Group, HiTZ center / University of the Basque Country (UPV/EHU) {kepa.bengoetxea,itziar.gonzalezd}@ehu.eus Abstract. Identifying sexism in social networks is the focus of the EXIST- IberLEF 2021 shared task. By participating in this task, the aim of the MultiAzterTest team is to see if linguistically motivated features can help in the detection of sexism. That is why, we present the three approaches: i) an approach based on language models, ii) an approach based on lin- guistic and stylistic features + machine learning classifiers and iii) an approach combining the previous approaches. The language model ap- proach obtains the best results in the test data. However, the approaches that use linguistic and stylistic features offer more interpretability. Keywords: Sexism detection · Exist-IberLEF · Language Models · Lin- guistic features 1 Introduction Sexism is defined by the Oxford English Dictionary as “prejudice, stereotyping or discrimination, typically against women, on the basis of sex”. Sexism, moreover, can be classified as indirect sexism, sexual and physical [26] and categorized as in the Exist-IberLEF shared task [25] as ideological and inequality, stereotyping and dominance, objectification, sexual violence and misogyny and non-sexual violence. The Natural Language Processing (NLP) community has focused on detect- ing hate speech [13, 22], and abusive language and offensive language [7] among others but also on the their related outcomes such as misogyny [20] or racism [27]. Sexism has also been addressed and Rodrı́guez-Sánchez et al. experiment with user, network, and text-based features, machine learning classifiers (logistic regression, support vector machine and random forest), deep recurrent neural networks (BI-LSTM) and transformer-based language models (BERT) [24] to detect it. In this paper, we test MultiAzterTest-Social (MATS) in the task of detecting sexism in the context of the EXIST 2021 Shared Task [25], a shared task at IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). IberLef 2021 [1]. MATS is a version of the MultiAzterTest tool [4], which is the trilingual version of AzterTest [5]. MultiAzterTest and AzterTest are open- source NLP based tools and web services for text stylometrics and readability assessment. In addition to the linguistic and stylistic features, MATS includes features to analyse social media texts inspired by Fersini et al. [11] and other improvements. By participating in this shared task, we want to see if a tool that outperforms the state-of-the art results in readability assessment can be applied to other classification tasks, where texts are shorter, include more subjective information and colloquial and informal speech. Linguistic based features have been used to detect fake news [9] and text features have also been taken into account for sexism detection [24], but to other knowledge this is the first time that more than 150 features are taken into account for this task. The aim of using a linguistically motivated tool is to give explanations to the predictions and to be able to analyse the linguistic characteristics of sexism. This paper is structured as follows: in Section 2 we introduce MultiAzterTest- Social, in Section 3 we describe our approaches and the experimental set-up, in Section 4 we present the results and we conclude and outline the future work in 5. 2 MultiAzterTest-Social MultiAzterTest-Social is an improvement of MultiAzterTest [4]. MultiAzterTest analyses more than 125 linguistic and stylistic features in Basque (125 features) English (163 features), and Spanish (141 features). Following, we briefly explain how MultiAzterTest works: – Preprocessing: This step carries out all the necessary analysis in raw texts in order to be processed. This includes multilingual parsing (in our case Stanza [23]), syllable splitting, and stopword removing. – Linguistic and stylistic profiling: Based on the previous text analysis, this step calculates the linguistic and stylistic features. These features are grouped in the following types: descriptive and raw features, lexical diver- sity, classical readability formulae, word frequencies, vocabulary knowledge, morphological information, syntax, semantic information, semantic overlap (semantic similarity), referential cohesion (overlaps) and logical cohesion (connectives). There are five types of indicators: absolute numbers, mean, standard deviation, incidence and ratios. – Classification: Based on the linguistic and stylistic features, a machine learning classifier is applied. This classifier varies depending on the task. In the case of readability assessment, for example, support vector machines seem to be the most adequate. In order to analyse social media text, we have added more features to Multi- AzterTest and, this way, we have adapted it to MultiAzterTest-Social. Some of the new features for social media are inspired in the features presented by Fersini et al. [11] to profile misogynists. Advanced morpho-syntactic, and named-entities are based on other readability assessment works e.g. [14], we have created the sentiment analysis and the abusive term features and we have added more de- scriptive features (descriptive+). Following, we explain the new features: – Descriptive+: We analyse the number of words and sentences per tweet; the number of numerical expressions, its incidence per 1000 words and the ratio of numbers per tweet and per sentence; the number and incidence of each punctuation mark (colon, exclamation mark...); and the number and incidence of special characters. – Advanced morpho-syntactic: We calculate the number and incidence per 1000 words of the types of determinants (definite, indefinite), adjectives (comparative, superlative), pronouns (person and number), causal and in- tentional verbs and particles, adverbial and prepositional phrases and the ratios of causal/intentional particles to causal/intentional verbs. – Named entities: Stanza’s base version detects 4 named entity types (Per- son, Location, Organisation and Miscellaneous). We calculate the mean of all the entities per sentence and the incidence per 1000 words; the ratio of entities per nouns; and each entity type per all the entities, per sentence and its the incidence per 1000 words. – Social media: These features include the number and ratio of emojis per tweet and sentence; the number and incidence of hashtags/mentions/stretched words, ratio of each of them per sentence and tweet; the number and inci- dence of mentions, ratio of hashtags per sentence and tweet; and percentage of capital letters per sentence. – Sentiment analysis: We calculate the average positive, negative, neutral or compound score per sentence based on sentiment intensity analyser from VADER [16], the number of positive, negative, and neutral emojis according to the Emoji sentiment lexicon [19] and average sentiment score per sentence. – Abusive terms: We include features for profane words, abusive words and hurt words based on the following resources: i) Luis von Ahn’s Research Group’s Offensive/Profane Word List [2], ii) the Lexicon of Abusive Words [28], and iii) HurtLex, the multilingual lexicon of words to hurt [3]. Although HurtLex classifies the words in different categories, we take all of them to- gether. We calculate the number, the incidence and the ratio per sentence of the profane, abusive and hurt words. In Table 1 we show the number of new features MATS analyses. Taking into account the features MultiAzterTest calculates, in this work we have used 280 features for English and 244 for Spanish. 3 Approaches and Experimental Set-up In this section we present the experiments carried out for the task 1: Sexism Identification. The dataset we have used has been provided by the organisers [25]. The results are calculated using accuracy, as distribution between sexist and non-sexist categories is balanced. Table 1. The number of linguistic and stylistic features added in English and Spanish to analyse social media text Feature type English Spanish Descriptive+ 20 20 Advanced morpho-syntactic 42 42 Named entities 15 15 Social media 23 23 Sentiment Analysis 8 0 Abusive terms 9 3 Total 117 103 In our experiments, we have tested three approaches: i) a language model (LM), ii) the features of MultiAzterTest-Social together with a machine learning classifier (henceforth, MATS-Sexism) and iii) a combination of the LM and the MATS-Sexism approach. 3.1 Language Model Approach The LM approach uses the Bidirectional Encoder Representation from Trans- former (BERT) [10], exactly the bert-base-uncased model, pre-trained on the BooksCorpus [30] and English Wikipedia for English and BETO (bert-base- spanish-wwm-uncased) [8] for Spanish. Both models are provided by Hugging- Face [29]. We have decided to use this approach because BERT achieves state of the art results on many NLP tasks. This is our experimental setting: we have truncated all texts that had more than 200 tokens and we have added two tokens to mark the beginning and the end of the sequence to each input text, [CLS] and [SEP] respectively. We have padded texts shorter than 200 tokens with zeroes. We have not performed any text augmentation or pre-processing besides standard byte-pair encoding. We have used the PyTorch framework to create our model. On top of BERT, we have probed with two sequential models: i) a dropout layer to fight overfitting. The dropout probability was set equal to 0.1. On top of the dropout layer, we have added a linear layer and sigmoid activation function. The input dimension of the linear layer was 768 and the output 2 (equal to the number of classes); ii) a linear layer, ReLU activation function and linear layer model. The input dimension of the first linear was 768 and the output 50, and the input dimension of the second linear was 50 and the output 2 (equal to the number of classes). We have used the cross-entropy loss function for each of the outputs. We have trained the model in the Google Colaboratory framework. We have split the training data into 80 % for train and 20 % for validation. The training batch size was made equal to 32 and the model was trained for 10 epochs using early stopping technique. We have obtained the best result in the validation data after running 4 epochs, setting the tweet length to 200, and the learning rate to 5e-5 with linear-ReLU-linear sequential model and the Adam optimizer [17]. 3.2 Approach Based on Linguistic Features and Machine Learning The second approach, the MATS-Sexism approach, consists of the outputs of the tool plus a classical machine learning classifier. In order to know which is the most adequate classifier, we have tested the Sequential Minimal Optimization (SMO) [21], Random Forest (RF) [6] and Simple Logistics (SL) [18] classifiers. We have also carried out feature selection with the ten most predictive features according to WEKA [15] based InfoGain attribute evaluator (Table 3), and in the case of SMO, we also have reduced the number of features to 125 and 75. All these preliminary experiments have been done with 10 fold cross-validation. In Table 2 we present the results of the MATS-Sexism approach with dif- ferent features and classifiers on the training data. As it happens in readability assessment, SMO is the best classifier [4] and, therefore, SMO will be the clas- sifier of MATS-Sexism. We also see, contrary to what happens in readability assessment, that feature selection and feature reduction are not competitive. Table 2. MATS-Sexism results in the training data (run2). Method English Spanish MATS-Sexism-SMO-All 68.36 65.32 MATS-Sexism-SMO-125 66.23 63.48 MATS-Sexism-SMO-75 66.00 61.98 MATS-Sexism-SMO-Top10 54.39 58.88 MATS-Sexism-RF-All 62.31 59.95 MATS-RF-Top10 54.45 53.94 MATS-Sexism-SL-All 63.82 61.59 MATS-Sexism-SL-Top10 55.24 59.08 Before continuing with the approaches, let us analyse the most predictive features presented in Table 3 from a linguistic and stylistic point of view. Four of the most predictive features for English are descriptive (word, lemma and syl- lable length), there are 2 semantic similarity features, and one of the social media features (the percentage of capital letters), sentiment analysis (the VADER com- pound score), the Flesch readability formula [12] and word frequencies (minimum word frequency). In the case of Spanish, it is remarkable that 4 features are related to rare words. This can imply that ii) infrequent words have been used or ii) that the spelling of the words was not the correct one and they have not been correctly analysed. The importance of the hashtags is also noticeable (4 features). The use of the first person pronouns and the unclassified miscellaneous named entities play also a role. Finally, 6 out of the 10 features were not in MultiAzterTest and come from the update to social. This shows the validity of the new added features. Table 3. Top10 features according to InfoGain in English and Spanish. English Spanish word length (std) number of different rare words (incidence) lemma length (std) number of rare words (incidence) percentage of capital letters hashtag ratio per sentence word length without stopwords (std) hashtag ratio per tweet VADER compound score per sentence hashtag incidence (mean) semantic similarity between adjacent sen- rare distint content words (mean) tences (mean) number of syllables per word (std) number of hashtags semantic similarity between all possible rare content words (mean) pairs of sentences in a paragraph (mean) Flesch number of first person pronouns minimum word frequency per sentence MISC named-entities per sentence (mean) 3.3 Combination Approach The third approach is a combination of the results of the LM and MATS-Sexism. To combine the results, we have two options: i) label as sexist if one of the tools tags a tweet as sexist or ii) label as sexist only if both tools consider that a tweet is sexist. We have decided to implement the second option (only if LM and MATS-Sexism agree) in order to give more precision to our predictions (although the official evaluation metric is accuracy). We think that in subjective tasks striving for precision can avoid doing harm. 4 Results in Test Data In this section we present the results in the test data as provided by the organis- ers. In total, 72 systems were evaluated in Task1. In Table 4 we present the results of our three approaches together with the baseline results (TF-IDF+SVM), also provided by the organisers. The LM approach obtains the best results in all the settings: both languages, and English and Spanish on their own. The combination stays in the middle and the MATS-Sexism approach is the worst. Only the LM approach is above the baseline. It is remarkable that all the approaches perform similarly in all the set- tings: if we rounded numbers, the accuracy of the LMs will be 0.77 in both Span- ish and English and also in Spanish and English separately. The MATS-Sexism approach has an accuracy of 0.59-0.60. The combination has more variation, from 0.65 to 0.67. In general, we can say that there is a difference of 17 point between LM and MATS-Sexism, and a difference of 10 between the LM and the combination and 5-8 points between the combination and MATS-Sexism. Looking at the results, we see that approaches based on distributional infor- mation such as the languages models or TF-IDF are very effective when working Table 4. Results in the test data Lang. Method Accuracy Precision Recall F 1 Ranking All Baseline 0.6845 0.6943 0.6888 0.6832 51 All LM 0.7740 0.7741 0.7727 0.7731 6 All MATS-Sexism 0.5948 0.5983 0.5974 0.5944 64 All Comb 0.6582 0.6951 0.6670 0.6482 60 EN Baseline 0.6889 0.6934 0.691 0.6886 - EN LM 0.7717 0.7753 0.7683 0.7691 - EN MATS-Sexism 0.5996 0.6049 0.6032 0.5989 - EN Comb 0.6481 0.6789 0.6571 0.6398 - ES Baseline 0.6801 0.6972 0.6853 0.6766 - ES LM 0.7764 0.7764 0.7769 0.7763 - ES MATS-Sexism 0.5898 0.5920 0.5915 0.5897 - ES Comb 0.6685 0.7129 0.6770 0.6566 - with short texts and features that take into account syntactic and discursive in- formation may not be so helpful in these classification tasks. Indeed, they worsen the accuracy. 5 Conclusion and Future Work In this paper we have presented the results of the MultiAzterTest team at the first task of the Exist-IberLEF 2021 shared task. The aim of these experiments was to see if linguistically motivated features could identify sexism. Looking at our results, we see that distributional approaches are very efficient and linguistic features are not so important when classifying short texts. However, in the future, the outputs of the linguistically motivated approach can be used to interpret the characteristics of sexism. It would be also an inter- esting work to test these approaches in longer texts. Acknowledgments We acknowledge the following projects: DeepText (KK-2020/00088), DeepRead- ing RTI2018-096846-B-C21 (MCIU/AEI/FEDER, UE), BigKnowledge for Text Mining, BBVA and IXA taldea, A motako ikertalde finkatua (IT1343-19). References 1. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021. CEUR Workshop Proceedings (2021) 2. von Ahn, L.: Offensive/profane word list. https://www.cs.cmu.edu/ biglou/resources/, accessed: 2021-05-14 3. Bassignana, E., Basile, V., Patti, V.: Hurtlex: A multilingual lexicon of words to hurt. In: 5th Italian Conference on Computational Linguistics, CLiC-it 2018. vol. 2253, pp. 1–6. CEUR-WS (2018) 4. Bengoetxea, K., Gonzalez-Dios, I.: MultiAzterTest: a Multilingual Analyzer on Multiple Levels of Language for Readability Assessment. Manuscript from author (2021) 5. Bengoetxea, K., Gonzalez-Dios, I., Aguirregoitia, A.: AzterTest: Open Source Lin- guistic and Stylistic Analysis Tool. Procesamiento del Lenguaje Natural 64, 61–68 (2020) 6. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001) 7. Caselli, T., Basile, V., Mitrović, J., Kartoziya, I., Granitzer, M.: I feel offended, don’t be abusive! implicit/explicit messages in offensive and abusive language. In: Proceedings of The 12th Language Resources and Evaluation Conference. pp. 6193–6202 (2020) 8. Cañete, J., Chaperon, G., Fuentes, R., Ho, J.H., Kang, H., Pérez, J.: Spanish Pre- Trained BERT Model and Evaluation Data. In: PML4DC at ICLR 2020 (2020) 9. Choudhary, A., Arora, A.: Linguistic feature based learning model for fake news detection and classification. Expert Systems with Applications p. 114171 (2020) 10. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Compu- tational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4171–4186 (2019) 11. Fersini, E., Nozza, D., Boifava, G.: Profiling italian misogynist: An empirical study. In: Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language. pp. 9–13 (2020) 12. Flesch, R.: A new readability yardstick. Journal of applied psychology 32(3), 221 (1948) 13. Fortuna, P., Soler, J., Wanner, L.: Toxic, hateful, offensive or abusive? what are we really classifying? an empirical analysis of hate speech datasets. In: Proceedings of the 12th Language Resources and Evaluation Conference. pp. 6786–6794 (2020) 14. Gonzalez-Dios, I., Aranzabe, M.J., Dı́az de Ilarraza, A., Salaberri, H.: Sim- ple or complex? assessing the readability of basque texts. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguis- tics: Technical Papers. pp. 334–344. DCU and ACL, Dublin, Ireland (Aug 2014), https://www.aclweb.org/anthology/C14-1033 15. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter 11(1), 10–18 (2009) 16. Hutto, C., Gilbert, E.: Vader: A parsimonious rule-based model for sentiment anal- ysis of social media text. In: Proceedings of the International AAAI Conference on Web and Social Media. vol. 8 (2014) 17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 18. Landwehr, N., Hall, M., Frank, E.: Logistic model trees 95(1-2), 161–205 (2005) 19. Novak Kralj, P., Smailović, J., Sluban, B., Mozetič, I.: Sentiment of emojis. PloS one 10(12), e0144296 (2015) 20. Pamungkas, E.W., Basile, V., Patti, V.: Misogyny Detection in Twitter: a Mul- tilingual and Cross-domain study. Information Processing & Management 57(6), 102360 (2020) 21. Platt, J.: Fast training of support vector machines using sequential min- imal optimization. In: Schoelkopf, B., Burges, C., Smola, A. (eds.) Ad- vances in Kernel Methods - Support Vector Learning. MIT Press (1998), http://research.microsoft.com/j̃platt/smo.html 22. Poletto, F., Basile, V., Sanguinetti, M., Bosco, C., Patti, V.: Resources and bench- mark corpora for hate speech detection: a systematic review. Language Resources and Evaluation pp. 1–47 (2020) 23. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: A python natural language processing toolkit for many human languages. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. pp. 101–108 (2020) 24. Rodrı́guez-Sánchez, F., Carrillo-de Albornoz, J., Plaza, L.: Automatic Classifica- tion of Sexism in Social Networks: An Empirical Study on Twitter Data. IEEE Access 8, 219563–219576 (2020) 25. Rodrı́guez-Sánchez, F., de Albornoz, J.C., Plaza, L., Gonzalo, J., Rosso, P., Comet, M., Donoso, T.: Overview of exist 2021: sexism identification in social networks. Procesamiento del Lenguaje Natural 67(0) (2021) 26. Sharifirad, S., Jacovi, A.: Learning and understanding different categories of sexism using convolutional neural network’s filters. In: Proceedings of the 2019 Workshop on Widening NLP. pp. 21–23. Association for Computational Linguistics, Florence, Italy (Aug 2019), https://www.aclweb.org/anthology/W19-3609 27. Waseem, Z.: Are you a racist or am i seeing things? annotator influence on hate speech detection on twitter. In: Proceedings of the first workshop on NLP and computational social science. pp. 138–142 (2016) 28. Wiegand, M., Ruppenhofer, J., Schmidt, A., Greenberg, C.: Inducing a lexicon of abusive words–a feature-based approach. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). pp. 1046–1056 (2018) 29. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al.: Huggingface’s transformers: State-of- the-art natural language processing. arXiv preprint arXiv:1910.03771 (2019) 30. Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., Fidler, S.: Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In: Proceedings of the IEEE international conference on computer vision. pp. 19–27 (2015)