MDD @ AMI: Vanilla Classifiers for Misogyny Identification Samer El Abassi Sergiu Nisioi Faculty of Mathematics and Human Language Technologies Computer Science Research Center, University of Bucharest University of Bucharest samer.el-abassi@s.unibuc.ro sergiu.nisioi@unibuc.ro Abstract The classifiers we built achieved a relatively good accuracy on our cross-validation tests; how- In this report1 , we present a set of vanilla ever, for this competition, the results obtained by classifiers that we used to identify misog- our systems are not among the top-scoring ones ynous and aggressive texts in Italian so- and show to be misfit, with a significant tendency cial media. Our analysis shows that simple towards biased results. classifiers with little feature engineering In addition to the description of our submis- have a strong tendency to overfit and yield sions, in this report, we analyze the errors of our a strong bias on the test set. Additionally, systems, and we bring into discussion several and we investigate the usefulness of function topic-independent features to: 1) test the effective- words, pronouns, and shallow-syntactical ness of part-of-speech n-grams, function words, features to observe whether misogynous or and pronouns on the task of identifying misogy- aggressive texts have specific stylistic ele- nous and aggressive texts on social media and 2) ments. observe whether texts labeled as misogynous or 1 Introduction aggressive have a particular bias towards certain grammatical structures. This paper discusses our submission (team MDD) to the Evalita 2020 Automatic Misogyny Iden- 2 System Description tification Shared Task (Elisabetta Fersini, 2020; Basile et al., 2020) (Task A). Our methods consist At the basis of submissions is the logistic regres- of a set of simple vanilla classifiers that we employ sion classifier with liblinear (Fan et al., 2008) opti- to assess their effectiveness on the datasets pro- mizer, l2 penalty, and regularization constant C = vided by the organizers. The systems we submit- 3 that we chose based on different cross-validation ted for evaluation use a logistic regression classi- iterations. In addition, we introduced a heuristic at fier with little hyperparameter tuning or feature en- the prediction time in which we predict a text not gineering, being trained on tf-idf and average word to be aggressive if it was not categorized as misog- embeddings pooling. Previous reports on misog- ynous. yny (Fersini et al., 2018b,a) and aggressiveness The difference between our three submissions (Basile et al., 2019) detection indicate that sup- for Task A consist in the feature extraction pro- port vector machines and logistic regression clas- cess, where: sifiers effectively identify these patterns in social media posts. Furthermore, vanilla classifiers with MDD.A.r.c.run1 is the logreg model trained on little feature engineering were successfully used td-idf of word n-grams, n ranging from 1 to 5 for other shared tasks, such as identifying dialec- MDD.A.r.u.run2 is the logreg model trained on tal varieties (Ciobanu et al., 2016; Zampieri et al., pre-trained glove twitter embeddings of size 200 2017) or native language identification (Malmasi on 27 billion words2 et al., 2017), where high scores were obtained by simple approaches using SVMs or logistic regres- MDD.A.r.u.run3 is the logreg model using sion classifiers. spaCy (Honnibal and Montani, 2017) FastText 1 Copyright c 2020 for this paper by its authors. Use per- 2 mitted under Creative Commons License Attribution 4.0 In- English model GloVe.twitter.27B.200d https:// ternational (CC BY 4.0). nlp.stanford.edu/projects/GloVe/ CBOW embeddings pre-trained on Wikipedia and would indicate a certain syntactic and stylis- OSCAR (Common Crawl)3 . tic pattern in misogynous or aggressive texts The second run is trained on English glove em- beddings that surprisingly contain the representa- 5. pronouns - n-grams with n ranging from 1 tion of more than half of our Italian vocabulary, to 5 over the pronouns and pronoun proper- i.e., approximately 9500 words out of the total ties from the texts; we observed an increased 15,000 size of the vocabulary of our data. The usage in aggressive expressions of second- English glove embeddings cover code-switching, person pronouns emojis, and basic Italian words. Despite having 6. filter POS - n-grams over a filtered set of the lowest evaluation score of our submissions words and POS tags. (0.666 macro f1), we believe it provides a decent estimation for identifying non-misogynous texts. For POS tagging and noun phrases extrac- tion, we use the default outputs from the Ital- 2.1 Feature Extraction ian model for spaCy trained on the dataset pro- Our feature extraction processes for the submis- vided by Bosco et al. (2014). In addition, we sions are simple, the first one uses the tf-idf vec- use the tag for each word that covers an en- torizer (Buitinck et al., 2013) on word n-grams, tire set of features separated by whitespace; e.g., with n ranging from 1 to 5, to cover more of word ”Gender=Masc, Number=Sing, Person=2, Pron- context. Tf-idf features were used for their abil- Type=Prs” becomes: ”Masc Sing 2 Prs”. ity to categorize the importance of an n-gram with We expect the noun phrases to be less effective respect to the entire corpus. The second feature at detecting aggressive behaviour because aggres- set is based on pre-trained word representations by siveness often involves verbal constructs and ac- calculating every word’s embeddings in the text tions. to eventually get an average representation. For words not present in the embeddings, an array of 3 Results and Discussion zeroes with the same dimensions was used. In our work, we only describe the submissions for Preprocessing Task A of the competition, which is a classifica- tion task for the identification of misogynous and Our submissions use raw, un-processed texts, in- aggressive texts. Task B measures the bias of such cluding tags and URLs. We have also experi- classifiers with respect to certain concepts. Our mented with different preprocessing and feature submissions for task B are extracted from tf-idf extraction steps for which we did not make any representations of word n-grams and obtain the submission. We consider multiple approaches in smallest scores of the competition. this direction: Table 1 contains the submitted runs for Task A 1. clean - changing the entire text to lowercase, and the experiments we did to get a better under- removing hashtags, and links standing of the subtleties misogynistic and/or ag- gressive tweets contain. The columns CV F1 con- 2. nps - replacing the text with the noun tain the average F1 scores computed for 10-fold phrases; these features contain the nouns cross-validation carried for ten iterations. Each and surrounding attributes that can highlight cross-validation train-test split is stratified to pre- misogynous remarks serve the proportions of misogynoys and/or ag- gressive texts in both splits. The Test F1 columns 3. fct words - classification based on function are the results obtained on the gold standard test word occurrence; these words cover stylistic set. In the last column, we provide the macro F1 and information of texts. We have collected a resulting from the average F1 between aggressive- list of conjunctions, prepositions, connectors, ness and misogyny predictions. etc. for Italian for this purpose. The submitted runs show that the tf-idf vector- 4. POS n-grams - n-grams with n ranging from izer from run1, although it scored better during 1 to 5 over part-of-speech tags; these features the cross-validation stage, ended up being outper- 3 Model it core news lg, version 2.3.0 released from formed by the word embeddings extracted from spaCy https://spaCy.io/models/it spaCy (run3, 0.684 macro F1 ), being unable to Misogyny Aggressiveness Feature CV F1 Test F1 CV F1 Test F1 F1 Macro tf-idf, run1 0.883 0.71 0.8 0.652 0.681 glove, run2 0.818 0.717 0.741 0.616 0.666 spacy, run3 0.842 0.733 0.767 0.635 0.684 clean tf-idf 0.881 0.706 0.791 0.669 0.688 clean, glove 0.847 0.722 0.766 0.618 0.67 clean spacy 0.846 0.746 0.784 0.655 0.7 nps, clean, tf-idf 0.876 0.714 0.79 0.654 0.684 nps, clean, spacy 0.837 0.728 0.768 0.646 0.687 fct words 0.672 0.628 0.614 0.564 0.596 POS n-grams 0.754 0.573 0.723 0.607 0.59 pronouns 0.594 0.596 0.656 0.636 0.616 filter POS 0.832 0.731 0.765 0.657 0.694 Table 1: Cross-validation and test-set results of logistic regression classifier with different feature ex- traction processes. generalize to the new texts. The second run (run2, the two. Moreover, using the tf-idf vectorizer on 0.666 macro F1 ) uses the glove pre-trained em- plain function words achieved 0.628 F1 on the test beddings for English. This result represents the set for misogyny identification, a result that is not biggest surprise of the three since it did not use at all negligible, given that these words do not en- Italian embeddings. We observe that the English capsulate meaning. glove representations cover more than 60 POS n-grams are yet another set of features ca- Cleaned texts aid the classifier by a significant pable of capturing shallow syntactic constructs. threshold. In our experiments, we removed tags Using this feature set, we observed a strong over- and URLs to observe a significant increase in fitting tendency on the cross-validation scenarios macro scores for the same approaches over the (average F1 0.754 for misogyny and 0.723 for ag- cleaned texts. The best result we obtained so far gressiveness) while on the gold test set, the macro (0.7 macro score) uses the Italian spaCy average F1 score is 0.59. This is an indicator that cer- vector representations extracted from clean texts. tain syntactic patterns are indeed occurring in the Noun phrases extracted from each cleaned text misogynistic and aggressive texts, weakly differ- do not indicate significant increases in misogy- entiating them from other types of texts. However, nous or aggressive texts detection. Using these these features have little power to generalize on features yields comparable scores to the best of new samples. our methods, surpassing the classification attempts Pronouns reveal the most interesting result due on uncleaned texts. This indicates that noun to two reasons: 1) the features did not overfit the phrases alleviate the noise extracted by the tf-idf data, as indicated by the cross-validation F1 scores vectorizer. The model was less prone to overfit- that are close to the actual scores on the gold test ting and, therefore, more able to adapt to the un- set; 2) aggressive texts can be differentiated be- seen data. tween each other using only pronouns with an F1 Function words are features with grammatical score (0.636) that is comparable with more ad- roles, consisting of conjunctions, prepositions, ar- vanced methods that use richer features such as ticles, etc. encompassing stylistic aspects of the embeddings (0.655, for the embeddings over clean texts. We tested the accuracy of a simple lo- texts) or tf-idf vectorizer (0.669, for tf-idf over gistic regression using function words, and the clean texts). Therefore, in terms of aggressiveness, results were higher than 50% by a non-trivial it is clear that certain expressions using forms of amount. This is a potential indicator that misogy- second-person pronouns are typically used to con- nistic and/or aggressive tweets have a slightly dif- struct call-out phrases or curse-word expressions. ferent syntax than those that do not fit in either of The most common pronoun observed in aggresive texts is ti - the second person singular acusative of has proven to be one of the most accurate when pronoun tu (’you’). facing different types of data (Pamungkas et al., 2020). Other state of the art methods are LTSM Filter POS account the n-grams of words and (Long short-term memory) and XLNet, the latter POS tags extracted from the following categories: overtaking BERT on various tasks (Yang et al., nouns, adverbs, adpositions, determiners, adjec- 2019). A current issue with such methods and tives, verbs, pronouns, and auxiliary verbs. The word embeddings is that they transfer the hu- features obtain the second best result (0.694 F1 man bias present in large corpora. This is be- macro score) from all our attempts. Again, in this coming a bigger problem as AI filters are preva- situation, we are also facing a big difference be- lent in today’s society and therefore discrimina- tween the cross-validation results and the released tory traits of the models become discriminatory test set. real world actions. For example, textual embed- 4 Discussion dings trained from Wikipedia data show discrim- inatory traits towards minorities such as associat- The results show that the vanilla feature extrac- ing foreigners with criminals, homosexuality with tion methods suffered from a non-trivial amount of corruption, men being linked to aggression and overfitting. Despite the fact that we carried a strat- women with the idea of the loving wife. (Papakyr- ified 10-fold cross-validation, over ten iterations, iakopoulos et al., 2020). Basta et al. (2019) finds the average F1 scores obtained on the test set were that word embeddings are more likely to be dis- considerably lower than the ones we obtained in criminatory and biased than their contextualized our separate experiments. counterparts, implying that state of the art meth- The evaluation scores of said methods was over ods are moving towards the right direction. How- 88% in our cross-validation splits. On the cross- ever, as the models are getting closer to under- validation evaluation from the training set, tf-idf standing language, one cannot help but wonder if produced the best results. On the test set, embed- this will have a negative impact on their bias if pre- dings proved to have a better power of generaliza- cautions aren’t taken, as they might be overly im- tion. Preprocessing the texts by removing stop- pacted by the ubiquitous bias humans carry. Due words, hashtags, links, and other types of noise to the widespread automatisation of daily tasks us- proved to be beneficial for the classifier. The best ing machine learning models, mitigating prejudice results were obtained by extracting average clean becomes a responsibility of the developers, as it text embeddings. Overall, word embeddings were crucial for obtaining equal opportunities and treat- more consistent when comparing cross-validation ment of minorities. results with the test ones for misogyny detection. At a shallow eye-check we noticed in the test set several examples labeled as misogynous with 5 Conclusions no apparent reasons: ”troppo acida... non mangio yogurt”, ”Impiccati”, ”#nome?”. We can only as- sume that the misogynistic character of these com- Our results indicate that simple feature engineer- ments is given by the context in which they were ing and vanilla classifiers cannot distinguish be- posted. On the test set it also appears that the tween misogynistic/aggressive tweets with reli- majority of misogynistic comments are remarks able accuracy and that more research is needed on different body parts, most likely as comments to understand the important features concerning posted to pictures. It is, therefore, difficult to asses this task. However, the experiments imply a cor- the misogynistic character of a short text without relation between a text’s syntax and its misogy- having at hand the full multi-modal context: to nistic/aggressive value. This proposes the idea whom it was posted, what kind of relation is be- that text that falls into either categories, (or maybe tween the ”commenter” and the ”commentee”, if even hate speech in general?) does have a slightly the tweet is a reply or a single post, and so on and more recognisable grammatical pattern than text so forth. that isn’t. Whether it’s the POS n-grams, pro- It is worth noting that most text classification nouns, or just function words, the wording mat- papers mention or use BERT (Bidirectional En- ters and is worth looking into for more advanced coder Representations from Transformers), as it feature engineering. References Xiang-Rui Wang, and Chih-Jen Lin. 2008. Li- blinear: A library for large linear classification. Valerio Basile, Cristina Bosco, Elisabetta Fersini, J. Mach. Learn. Res., 9:1871–1874. Nozza Debora, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo Rosso, Manuela Elisabetta Fersini, Debora Nozza, and Paolo Sanguinetti, et al. 2019. Semeval-2019 task 5: Rosso. 2018a. Overview of the evalita 2018 task Multilingual detection of hate speech against on automatic misogyny identification (ami). immigrants and women in twitter. In 13th In- EVALITA Evaluation of NLP and Speech Tools ternational Workshop on Semantic Evaluation, for Italian, 12:59. pages 54–63. Association for Computational Elisabetta Fersini, Paolo Rosso, and Maria An- Linguistics. zovino. 2018b. Overview of the task on auto- Valerio Basile, Danilo Croce, Maria Di Maro, and matic misogyny identification at ibereval 2018. Lucia C. Passaro. 2020. Evalita 2020: Overview IberEval@ SEPLN, 2150:214–228. of the 7th evaluation campaign of natural lan- Matthew Honnibal and Ines Montani. 2017. spaCy guage processing and speech tools for italian. In 2: Natural language understanding with Bloom Proceedings of Seventh Evaluation Campaign embeddings, convolutional neural networks and of Natural Language Processing and Speech incremental parsing. To appear. Tools for Italian. Final Workshop (EVALITA Shervin Malmasi, Keelan Evanini, Aoife Cahill, 2020), Online. CEUR.org. Joel R. Tetreault, Robert A. Pugh, Christopher Christine Basta, Marta Ruiz Costa-jussà, and Noe Hamill, Diane Napolitano, and Yao Qian. 2017. Casas. 2019. Evaluating the underlying gender A report on the 2017 native language identifica- bias in contextualized word embeddings. CoRR, tion shared task. In BEA@EMNLP, pages 62– abs/1904.08783. 75. Association for Computational Linguistics. Cristina Bosco, Felice Dell’Orletta, Simonetta Endang Wahyu Pamungkas, Valerio Basile, and Montemagni, Manuela Sanguinetti, and Maria Viviana Patti. 2020. Misogyny detection Simi. 2014. The evalita 2014 dependency pars- in twitter: a multilingual and cross-domain ing task. In EVALITA 2014 Evaluation of NLP study. Information Processing & Management, and Speech Tools for Italian, pages 1–8. Pisa 57(6):102360. University Press. Orestis Papakyriakopoulos, Simon Hegelich, Juan Lars Buitinck, Gilles Louppe, Mathieu Blon- Carlos Medina Serrano, and Fabienne Marco. del, Fabian Pedregosa, Andreas Mueller, 2020. Bias in word embeddings. In Proceed- Olivier Grisel, Vlad Niculae, Peter Prettenhofer, ings of the 2020 Conference on Fairness, Ac- Alexandre Gramfort, Jaques Grobler, Robert countability, and Transparency, FAT* ’20, page Layton, Jake VanderPlas, Arnaud Joly, Brian 446–457, New York, NY, USA. Association for Holt, and Gaël Varoquaux. 2013. API design for Computing Machinery. machine learning software: experiences from Zhilin Yang, Zihang Dai, Yiming Yang, Jaime the scikit-learn project. In ECML PKDD Work- Carbonell, Ruslan Salakhutdinov, and Quoc V. shop: Languages for Data Mining and Machine Le. 2019. Xlnet: Generalized autoregressive Learning, pages 108–122. pretraining for language understanding. Alina Maria Ciobanu, Sergiu Nisioi, and Liviu P Marcos Zampieri, Shervin Malmasi, Nikola Dinu. 2016. Vanilla classifiers for distinguish- Ljubešić, Preslav Nakov, Ahmed Ali, Jörg ing between similar languages. In Proceedings Tiedemann, Yves Scherrer, and Noëmi Aepli. of the VarDial Workshop. 2017. Findings of the vardial evaluation cam- Paolo Rosso Elisabetta Fersini, Debora Nozza. paign 2017. In Proceedings of the Fourth Work- 2020. Ami @ evalita2020: Automatic misog- shop on NLP for Similar Languages, Varieties yny identification. In Proceedings of the 7th and Dialects (VarDial), pages 1–15, Valencia, evaluation campaign of Natural Language Pro- Spain. Association for Computational Linguis- cessing and Speech tools for Italian (EVALITA tics. 2020), Online. CEUR.org. Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh,