MDD @ AMI: Vanilla Classifiers for Misogyny Identification

                   Samer El Abassi                                          Sergiu Nisioi
               Faculty of Mathematics and                            Human Language Technologies
                   Computer Science                                        Research Center,
                 University of Bucharest                                University of Bucharest
           samer.el-abassi@s.unibuc.ro                              sergiu.nisioi@unibuc.ro


                         Abstract                                  The classifiers we built achieved a relatively
                                                                good accuracy on our cross-validation tests; how-
        In this report1 , we present a set of vanilla
                                                                ever, for this competition, the results obtained by
        classifiers that we used to identify misog-
                                                                our systems are not among the top-scoring ones
        ynous and aggressive texts in Italian so-
                                                                and show to be misfit, with a significant tendency
        cial media. Our analysis shows that simple
                                                                towards biased results.
        classifiers with little feature engineering
                                                                   In addition to the description of our submis-
        have a strong tendency to overfit and yield
                                                                sions, in this report, we analyze the errors of our
        a strong bias on the test set. Additionally,
                                                                systems, and we bring into discussion several and
        we investigate the usefulness of function
                                                                topic-independent features to: 1) test the effective-
        words, pronouns, and shallow-syntactical
                                                                ness of part-of-speech n-grams, function words,
        features to observe whether misogynous or
                                                                and pronouns on the task of identifying misogy-
        aggressive texts have specific stylistic ele-
                                                                nous and aggressive texts on social media and 2)
        ments.
                                                                observe whether texts labeled as misogynous or
1       Introduction                                            aggressive have a particular bias towards certain
                                                                grammatical structures.
This paper discusses our submission (team MDD)
to the Evalita 2020 Automatic Misogyny Iden-                    2    System Description
tification Shared Task (Elisabetta Fersini, 2020;
Basile et al., 2020) (Task A). Our methods consist              At the basis of submissions is the logistic regres-
of a set of simple vanilla classifiers that we employ           sion classifier with liblinear (Fan et al., 2008) opti-
to assess their effectiveness on the datasets pro-              mizer, l2 penalty, and regularization constant C =
vided by the organizers. The systems we submit-                 3 that we chose based on different cross-validation
ted for evaluation use a logistic regression classi-            iterations. In addition, we introduced a heuristic at
fier with little hyperparameter tuning or feature en-           the prediction time in which we predict a text not
gineering, being trained on tf-idf and average word             to be aggressive if it was not categorized as misog-
embeddings pooling. Previous reports on misog-                  ynous.
yny (Fersini et al., 2018b,a) and aggressiveness                   The difference between our three submissions
(Basile et al., 2019) detection indicate that sup-              for Task A consist in the feature extraction pro-
port vector machines and logistic regression clas-              cess, where:
sifiers effectively identify these patterns in social
media posts. Furthermore, vanilla classifiers with              MDD.A.r.c.run1 is the logreg model trained on
little feature engineering were successfully used               td-idf of word n-grams, n ranging from 1 to 5
for other shared tasks, such as identifying dialec-
                                                                MDD.A.r.u.run2 is the logreg model trained on
tal varieties (Ciobanu et al., 2016; Zampieri et al.,
                                                                pre-trained glove twitter embeddings of size 200
2017) or native language identification (Malmasi
                                                                on 27 billion words2
et al., 2017), where high scores were obtained by
simple approaches using SVMs or logistic regres-                MDD.A.r.u.run3 is the logreg model using
sion classifiers.                                               spaCy (Honnibal and Montani, 2017) FastText
    1
     Copyright c 2020 for this paper by its authors. Use per-
                                                                  2
mitted under Creative Commons License Attribution 4.0 In-           English model GloVe.twitter.27B.200d https://
ternational (CC BY 4.0).                                        nlp.stanford.edu/projects/GloVe/
CBOW embeddings pre-trained on Wikipedia and                     would indicate a certain syntactic and stylis-
OSCAR (Common Crawl)3 .                                          tic pattern in misogynous or aggressive texts
   The second run is trained on English glove em-
beddings that surprisingly contain the representa-            5. pronouns - n-grams with n ranging from 1
tion of more than half of our Italian vocabulary,                to 5 over the pronouns and pronoun proper-
i.e., approximately 9500 words out of the total                  ties from the texts; we observed an increased
15,000 size of the vocabulary of our data. The                   usage in aggressive expressions of second-
English glove embeddings cover code-switching,                   person pronouns
emojis, and basic Italian words. Despite having               6. filter POS - n-grams over a filtered set of
the lowest evaluation score of our submissions                   words and POS tags.
(0.666 macro f1), we believe it provides a decent
estimation for identifying non-misogynous texts.             For POS tagging and noun phrases extrac-
                                                          tion, we use the default outputs from the Ital-
2.1   Feature Extraction                                  ian model for spaCy trained on the dataset pro-
Our feature extraction processes for the submis-          vided by Bosco et al. (2014). In addition, we
sions are simple, the first one uses the tf-idf vec-      use the tag for each word that covers an en-
torizer (Buitinck et al., 2013) on word n-grams,          tire set of features separated by whitespace; e.g.,
with n ranging from 1 to 5, to cover more of word         ”Gender=Masc, Number=Sing, Person=2, Pron-
context. Tf-idf features were used for their abil-        Type=Prs” becomes: ”Masc Sing 2 Prs”.
ity to categorize the importance of an n-gram with           We expect the noun phrases to be less effective
respect to the entire corpus. The second feature          at detecting aggressive behaviour because aggres-
set is based on pre-trained word representations by       siveness often involves verbal constructs and ac-
calculating every word’s embeddings in the text           tions.
to eventually get an average representation. For
words not present in the embeddings, an array of          3     Results and Discussion
zeroes with the same dimensions was used.                 In our work, we only describe the submissions for
Preprocessing                                             Task A of the competition, which is a classifica-
                                                          tion task for the identification of misogynous and
Our submissions use raw, un-processed texts, in-          aggressive texts. Task B measures the bias of such
cluding tags and URLs. We have also experi-               classifiers with respect to certain concepts. Our
mented with different preprocessing and feature           submissions for task B are extracted from tf-idf
extraction steps for which we did not make any            representations of word n-grams and obtain the
submission. We consider multiple approaches in            smallest scores of the competition.
this direction:                                              Table 1 contains the submitted runs for Task A
  1. clean - changing the entire text to lowercase,       and the experiments we did to get a better under-
     removing hashtags, and links                         standing of the subtleties misogynistic and/or ag-
                                                          gressive tweets contain. The columns CV F1 con-
  2. nps - replacing the text with the noun               tain the average F1 scores computed for 10-fold
     phrases; these features contain the nouns            cross-validation carried for ten iterations. Each
     and surrounding attributes that can highlight        cross-validation train-test split is stratified to pre-
     misogynous remarks                                   serve the proportions of misogynoys and/or ag-
                                                          gressive texts in both splits. The Test F1 columns
  3. fct words - classification based on function         are the results obtained on the gold standard test
     word occurrence; these words cover stylistic         set. In the last column, we provide the macro F1
     and information of texts. We have collected a        resulting from the average F1 between aggressive-
     list of conjunctions, prepositions, connectors,      ness and misogyny predictions.
     etc. for Italian for this purpose.
                                                          The submitted runs show that the tf-idf vector-
  4. POS n-grams - n-grams with n ranging from            izer from run1, although it scored better during
     1 to 5 over part-of-speech tags; these features      the cross-validation stage, ended up being outper-
   3
     Model it core news lg, version 2.3.0 released from   formed by the word embeddings extracted from
spaCy https://spaCy.io/models/it                          spaCy (run3, 0.684 macro F1 ), being unable to
                                          Misogyny         Aggressiveness
                       Feature
                                       CV F1 Test F1       CV F1 Test F1       F1 Macro
                    tf-idf, run1       0.883    0.71        0.8     0.652        0.681
                    glove, run2        0.818   0.717       0.741    0.616        0.666
                    spacy, run3        0.842   0.733       0.767    0.635        0.684
                    clean tf-idf       0.881   0.706       0.791    0.669        0.688
                    clean, glove       0.847   0.722       0.766    0.618         0.67
                    clean spacy        0.846   0.746       0.784    0.655         0.7
                  nps, clean, tf-idf   0.876   0.714        0.79    0.654        0.684
                  nps, clean, spacy    0.837   0.728       0.768    0.646        0.687
                     fct words         0.672   0.628       0.614    0.564        0.596
                   POS n-grams         0.754   0.573       0.723    0.607         0.59
                      pronouns         0.594   0.596       0.656    0.636        0.616
                     filter POS        0.832   0.731       0.765    0.657        0.694

Table 1: Cross-validation and test-set results of logistic regression classifier with different feature ex-
traction processes.


generalize to the new texts. The second run (run2,      the two. Moreover, using the tf-idf vectorizer on
0.666 macro F1 ) uses the glove pre-trained em-         plain function words achieved 0.628 F1 on the test
beddings for English. This result represents the        set for misogyny identification, a result that is not
biggest surprise of the three since it did not use      at all negligible, given that these words do not en-
Italian embeddings. We observe that the English         capsulate meaning.
glove representations cover more than 60
                                                        POS n-grams are yet another set of features ca-
Cleaned texts aid the classifier by a significant       pable of capturing shallow syntactic constructs.
threshold. In our experiments, we removed tags          Using this feature set, we observed a strong over-
and URLs to observe a significant increase in           fitting tendency on the cross-validation scenarios
macro scores for the same approaches over the           (average F1 0.754 for misogyny and 0.723 for ag-
cleaned texts. The best result we obtained so far       gressiveness) while on the gold test set, the macro
(0.7 macro score) uses the Italian spaCy average        F1 score is 0.59. This is an indicator that cer-
vector representations extracted from clean texts.      tain syntactic patterns are indeed occurring in the
Noun phrases extracted from each cleaned text           misogynistic and aggressive texts, weakly differ-
do not indicate significant increases in misogy-        entiating them from other types of texts. However,
nous or aggressive texts detection. Using these         these features have little power to generalize on
features yields comparable scores to the best of        new samples.
our methods, surpassing the classification attempts
                                                        Pronouns reveal the most interesting result due
on uncleaned texts. This indicates that noun
                                                        to two reasons: 1) the features did not overfit the
phrases alleviate the noise extracted by the tf-idf
                                                        data, as indicated by the cross-validation F1 scores
vectorizer. The model was less prone to overfit-
                                                        that are close to the actual scores on the gold test
ting and, therefore, more able to adapt to the un-
                                                        set; 2) aggressive texts can be differentiated be-
seen data.
                                                        tween each other using only pronouns with an F1
Function words are features with grammatical            score (0.636) that is comparable with more ad-
roles, consisting of conjunctions, prepositions, ar-    vanced methods that use richer features such as
ticles, etc. encompassing stylistic aspects of the      embeddings (0.655, for the embeddings over clean
texts. We tested the accuracy of a simple lo-           texts) or tf-idf vectorizer (0.669, for tf-idf over
gistic regression using function words, and the         clean texts). Therefore, in terms of aggressiveness,
results were higher than 50% by a non-trivial           it is clear that certain expressions using forms of
amount. This is a potential indicator that misogy-      second-person pronouns are typically used to con-
nistic and/or aggressive tweets have a slightly dif-    struct call-out phrases or curse-word expressions.
ferent syntax than those that do not fit in either of   The most common pronoun observed in aggresive
texts is ti - the second person singular acusative of      has proven to be one of the most accurate when
pronoun tu (’you’).                                        facing different types of data (Pamungkas et al.,
                                                           2020). Other state of the art methods are LTSM
Filter POS account the n-grams of words and
                                                           (Long short-term memory) and XLNet, the latter
POS tags extracted from the following categories:
                                                           overtaking BERT on various tasks (Yang et al.,
nouns, adverbs, adpositions, determiners, adjec-
                                                           2019). A current issue with such methods and
tives, verbs, pronouns, and auxiliary verbs. The
                                                           word embeddings is that they transfer the hu-
features obtain the second best result (0.694 F1
                                                           man bias present in large corpora. This is be-
macro score) from all our attempts. Again, in this
                                                           coming a bigger problem as AI filters are preva-
situation, we are also facing a big difference be-
                                                           lent in today’s society and therefore discrimina-
tween the cross-validation results and the released
                                                           tory traits of the models become discriminatory
test set.
                                                           real world actions. For example, textual embed-
4   Discussion                                             dings trained from Wikipedia data show discrim-
                                                           inatory traits towards minorities such as associat-
The results show that the vanilla feature extrac-          ing foreigners with criminals, homosexuality with
tion methods suffered from a non-trivial amount of         corruption, men being linked to aggression and
overfitting. Despite the fact that we carried a strat-     women with the idea of the loving wife. (Papakyr-
ified 10-fold cross-validation, over ten iterations,       iakopoulos et al., 2020). Basta et al. (2019) finds
the average F1 scores obtained on the test set were        that word embeddings are more likely to be dis-
considerably lower than the ones we obtained in            criminatory and biased than their contextualized
our separate experiments.                                  counterparts, implying that state of the art meth-
   The evaluation scores of said methods was over          ods are moving towards the right direction. How-
88% in our cross-validation splits. On the cross-          ever, as the models are getting closer to under-
validation evaluation from the training set, tf-idf        standing language, one cannot help but wonder if
produced the best results. On the test set, embed-         this will have a negative impact on their bias if pre-
dings proved to have a better power of generaliza-         cautions aren’t taken, as they might be overly im-
tion. Preprocessing the texts by removing stop-            pacted by the ubiquitous bias humans carry. Due
words, hashtags, links, and other types of noise           to the widespread automatisation of daily tasks us-
proved to be beneficial for the classifier. The best       ing machine learning models, mitigating prejudice
results were obtained by extracting average clean          becomes a responsibility of the developers, as it
text embeddings. Overall, word embeddings were             crucial for obtaining equal opportunities and treat-
more consistent when comparing cross-validation            ment of minorities.
results with the test ones for misogyny detection.
   At a shallow eye-check we noticed in the test
set several examples labeled as misogynous with            5   Conclusions
no apparent reasons: ”troppo acida... non mangio
yogurt”, ”Impiccati”, ”#nome?”. We can only as-
sume that the misogynistic character of these com-         Our results indicate that simple feature engineer-
ments is given by the context in which they were           ing and vanilla classifiers cannot distinguish be-
posted. On the test set it also appears that the           tween misogynistic/aggressive tweets with reli-
majority of misogynistic comments are remarks              able accuracy and that more research is needed
on different body parts, most likely as comments           to understand the important features concerning
posted to pictures. It is, therefore, difficult to asses   this task. However, the experiments imply a cor-
the misogynistic character of a short text without         relation between a text’s syntax and its misogy-
having at hand the full multi-modal context: to            nistic/aggressive value. This proposes the idea
whom it was posted, what kind of relation is be-           that text that falls into either categories, (or maybe
tween the ”commenter” and the ”commentee”, if              even hate speech in general?) does have a slightly
the tweet is a reply or a single post, and so on and       more recognisable grammatical pattern than text
so forth.                                                  that isn’t. Whether it’s the POS n-grams, pro-
   It is worth noting that most text classification        nouns, or just function words, the wording mat-
papers mention or use BERT (Bidirectional En-              ters and is worth looking into for more advanced
coder Representations from Transformers), as it            feature engineering.
References                                              Xiang-Rui Wang, and Chih-Jen Lin. 2008. Li-
                                                        blinear: A library for large linear classification.
Valerio Basile, Cristina Bosco, Elisabetta Fersini,
                                                        J. Mach. Learn. Res., 9:1871–1874.
  Nozza Debora, Viviana Patti, Francisco
  Manuel Rangel Pardo, Paolo Rosso, Manuela           Elisabetta Fersini, Debora Nozza, and Paolo
  Sanguinetti, et al. 2019. Semeval-2019 task 5:        Rosso. 2018a. Overview of the evalita 2018 task
  Multilingual detection of hate speech against         on automatic misogyny identification (ami).
  immigrants and women in twitter. In 13th In-          EVALITA Evaluation of NLP and Speech Tools
  ternational Workshop on Semantic Evaluation,          for Italian, 12:59.
  pages 54–63. Association for Computational          Elisabetta Fersini, Paolo Rosso, and Maria An-
  Linguistics.                                          zovino. 2018b. Overview of the task on auto-
Valerio Basile, Danilo Croce, Maria Di Maro, and        matic misogyny identification at ibereval 2018.
  Lucia C. Passaro. 2020. Evalita 2020: Overview        IberEval@ SEPLN, 2150:214–228.
  of the 7th evaluation campaign of natural lan-      Matthew Honnibal and Ines Montani. 2017. spaCy
  guage processing and speech tools for italian. In    2: Natural language understanding with Bloom
  Proceedings of Seventh Evaluation Campaign           embeddings, convolutional neural networks and
  of Natural Language Processing and Speech            incremental parsing. To appear.
  Tools for Italian. Final Workshop (EVALITA          Shervin Malmasi, Keelan Evanini, Aoife Cahill,
  2020), Online. CEUR.org.                              Joel R. Tetreault, Robert A. Pugh, Christopher
Christine Basta, Marta Ruiz Costa-jussà, and Noe       Hamill, Diane Napolitano, and Yao Qian. 2017.
  Casas. 2019. Evaluating the underlying gender         A report on the 2017 native language identifica-
  bias in contextualized word embeddings. CoRR,         tion shared task. In BEA@EMNLP, pages 62–
  abs/1904.08783.                                       75. Association for Computational Linguistics.
Cristina Bosco, Felice Dell’Orletta, Simonetta        Endang Wahyu Pamungkas, Valerio Basile, and
  Montemagni, Manuela Sanguinetti, and Maria            Viviana Patti. 2020.     Misogyny detection
  Simi. 2014. The evalita 2014 dependency pars-         in twitter: a multilingual and cross-domain
  ing task. In EVALITA 2014 Evaluation of NLP           study. Information Processing & Management,
  and Speech Tools for Italian, pages 1–8. Pisa         57(6):102360.
  University Press.                                   Orestis Papakyriakopoulos, Simon Hegelich, Juan
Lars Buitinck, Gilles Louppe, Mathieu Blon-             Carlos Medina Serrano, and Fabienne Marco.
  del, Fabian Pedregosa, Andreas Mueller,               2020. Bias in word embeddings. In Proceed-
  Olivier Grisel, Vlad Niculae, Peter Prettenhofer,     ings of the 2020 Conference on Fairness, Ac-
  Alexandre Gramfort, Jaques Grobler, Robert            countability, and Transparency, FAT* ’20, page
  Layton, Jake VanderPlas, Arnaud Joly, Brian           446–457, New York, NY, USA. Association for
  Holt, and Gaël Varoquaux. 2013. API design for       Computing Machinery.
  machine learning software: experiences from         Zhilin Yang, Zihang Dai, Yiming Yang, Jaime
  the scikit-learn project. In ECML PKDD Work-          Carbonell, Ruslan Salakhutdinov, and Quoc V.
  shop: Languages for Data Mining and Machine           Le. 2019. Xlnet: Generalized autoregressive
  Learning, pages 108–122.                              pretraining for language understanding.
Alina Maria Ciobanu, Sergiu Nisioi, and Liviu P       Marcos Zampieri, Shervin Malmasi, Nikola
  Dinu. 2016. Vanilla classifiers for distinguish-     Ljubešić, Preslav Nakov, Ahmed Ali, Jörg
  ing between similar languages. In Proceedings        Tiedemann, Yves Scherrer, and Noëmi Aepli.
  of the VarDial Workshop.                             2017. Findings of the vardial evaluation cam-
Paolo Rosso Elisabetta Fersini, Debora Nozza.          paign 2017. In Proceedings of the Fourth Work-
  2020. Ami @ evalita2020: Automatic misog-            shop on NLP for Similar Languages, Varieties
  yny identification. In Proceedings of the 7th        and Dialects (VarDial), pages 1–15, Valencia,
  evaluation campaign of Natural Language Pro-         Spain. Association for Computational Linguis-
  cessing and Speech tools for Italian (EVALITA        tics.
  2020), Online. CEUR.org.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh,