MB-Courage @ EXIST: GCN Classification for Sexism Identification in Social Networks? Rodrigo Wilkens and Dimitri Ognibene University of Milano-Bicocca, Italy {rodrigo.souzawilkens,dimitri.ognibene}@unimib.it Abstract. We describe our approach (MB-Courage team) in the Sexism Identification in Social Networks Shared Task (EXIST). We submitted three runs for each task, two of them based on Graph Convolutional Neural Networks (GCN) exploring different edge creation strategies and one combining graph embeddings from different GCN through ensemble methods. In addition, we explored different GCN models and text-to- graph strategies. We identified that in Task 2 the models take advantage of the syntactic relationship between words encoded in the graph, while it did not strongly impact Task 1. Moreover, the models generalized the task while maintaining similar (in some cases better) results in the social network that was not used in training. On average, our best models performed similarly across languages and social media, ranking 37th (out of 72 runs) for Task 1 and 40th (out of 63) for Task 2. Keywords: Graph Neural Network · MeanPooling · set2set · EXIST. 1 Introduction Social media (SM) advent has been described as nothing less than a shift in the communication paradigm [1], or in other words, the freedom to publish marks the birth of a new era altogether [2]. There is obviously ample evidence of SM use’s positive effects that go beyond just-in-time connectivity with a network of friends and like-minded people. However, far from creating a global space for mu- tual understanding, truthful and objective information, the large-scale growth of SM has also fostered negative social phenomena [24]. Therefore, threats on Social Media have become extensively studied, and Hate Speech (HS) is a fre- quent topic. Classification of text generated by social media, and Twitter, in particular [22], poses several significant challenges, between which their infor- mality, noisiness, and limited size, leading first to a lack of features for classifica- tion, negatively affecting results, and second to a lack of context and ambiguity. Due to the tasks’ difficulty and the necessity of generating different responses, some works discriminate the type and target of the hate, including sexism (e.g., ? IberLEF 2021, September 2021, Málaga, Spain. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). [35, 36, 34, 13, 9, 21, 25, 7, 19, 5, 29]). This line of research may support a measure of collective well-being on social media and social media governance strategies that aim at improving it [24]. Recently, the sEXism Identification in Social neTworks shared task (EXIST) [28] in the IberLEF 2021 [23] asked for systems capable of classifying sexism on microblogs. This shared task is divided into two classifying tasks: (1) the presence of sexist content and (2) the type of sexism. Moreover, this shared task provides posts written in both English and Spanish from Twitter (train and test) and gab.com (only test). In this paper, we describe the participation of the MB-Courage team at the EXIST shared task. In this, we aimed to explore Graph Convolutional Neural Networks (GCN) from the perspective of a Graph Classification task. In other words, we focus on a GCN that models each document as an in- dependent graph, which enables the inclusion of new documents without the need to retrain the model. Moreover, we examine different approaches for word association and encoding. The models are available at https://github.com/ rswilkens/courage-at-exist. In specific, this paper is organized as follows. We start presenting initiatives for sexism classification, then describing GCN models employed in this work. We explore three different GCN models: Mean- Pool, SAGpoolh and set2set. Section 3 presents the word encoding, text-to-graph strategies and the methodology to choose the submitted model. A discussion of the performance of the different models is presented in Section 4. Finally, in Section 5, we summarize our finds. 2 Related Work There exist a broad number of possible approaches for carrying out HS clas- sification. For example, Canós [6] resorted to Support Vector Machine (SVM) and TF-IDF. In complementary work, Liu et al. [18] employed SVM, Random Forests and Gradient Boosted Trees in an embedding space created by doc2vec [16] as well as a soft vote approach combining classifiers. Similarly, Shushkevich and Cardiff [30] used a blended model [33] combining Naı̈ve Bayes and SVM. Gambino and Pirrone [10] proposed combining embeddings from fastText [4] and PoS tagging embedding as input features. Hoffmann and Kruschwitz [11] explored an ensemble approach of three SVMs trained with transformer docu- ment embeddings, document pool embeddings, and TF-IDF. Rodrı́guez-Sánchez et al. [27] compared different word encodings (i.e., TF-IDF and word embed- ding) and models (i.e., LR, SVM, and Bi-LSTM), including multilingual BERT (mBERT) fine-tuned to their dataset. In these works, the text is generally en- coded, discarding the relations between the words. Exceptions, for example, are the RNN and LSTM models that encode the sequential information in their in- ternal state, but this representation relies on learning to account for elements at a longer distance. Moreover, despite some methods that use contextualized embeddings from BERT, only a summary vector (which contains limited word information) is used for the classification [8]. GCN, convolutional networks that operate on graphs, can explicitly model the relationships between words by representing words as nodes and their rela- tions as edges in the graph. Thus, in this work, we explore the GCN as a solution for identifying sexism on social networks. 2.1 Graph Neural Networks Applying deep learning models to structured data such as graphs has been pro- posed in recent years. In particular, studies have focused on generalizing con- volutional neural networks to graph data, which includes redefining the convo- lution and the downsampling (pooling) operations for graphs [17]. In a broad sense, graphical networks may be seen as a combination of simple map-reduce operations on graphs, corresponding to transformation and several aggregation operations on graphs [12]. The aggregation (also named as message passing) aims to aggregate multiple messages between a node and its context and re- duce them into one element. The graph pooling aims to aggregate elements in a graph, reducing them into high-order graph-level representations. GCN mainly differs from other neural networks in the forward step that connects nodes by considering an adjacency matrix. In terms of document classification, Kipf and Welling [14] compared the performance of GCN with different classification algorithms. Later, Yao et al. [37] proposed the TextGCN by extending the Kipf and Welling work by enriching the adjacent matrix. Although the interesting results achieved by these works, their models need to be retrained for every new document since the corpus is represented as a single graph and the model uses transductive learning. While this is not a general issue, it poses a problem for post-classification, given the speed at which new data is created on social media. In this work, we face the GCN models looking to overcome this limitation. In the remain of this section, we describe the three GCN architectures explored in this work. The MeanPool is a naı̈ve graph pooling model, which obtains graph repre- sentations by concatenating the mean pooling and max pooling results of GCNs, then the pooled graph feeds a classification layer [12]. At the most basic level, it aggregates the nodes’ neighbors after the multiplication of the node features by the weights, then sums the bias and performs the activation function. This pro- cess is realized twice. Then, the node features are averaged, resulting in a vector representation with the same dimension of the node features for each graph. Self-Attention GraphPooling (SAGpoolh ) [17] exploits the self-attention mechanism to distinguish nodes that should be dropped. The architecture com- prises three blocks (convolutional layer, pooling layer and readout layer) applied sequentially. The outputs of these blocks feed a classification layer. The con- volutional layer in SAGpoolh comprises a GCN and a self-attention mask (or intra-attention), aiming to focus on relevant input features [31]. Vinyals et al. [32], motivated by the impact of the sequence that the data is presented to the models, proposed set2set as an extension of the seq2seq framework. In set2set, the network’s next state is the concatenation of LSTM and an attention readout. The latter is defined as the multiplication of a memory vector by softmax of a scalar function (e.g., dot product) between the memory and the LSTM state. In this work, we use the set2set implementation of Hu et al. [12] in which the input of the set2set module is two stacked GCN. 3 Methodology We trained all the models following a stratified 10-fold cross-validation approach (10% of each fold as validation). We firstly trained the GCN discussed in Section 2.1, using the parser and ngram text-to-graph approaches (see Section 3.1). In addition, inspired by the performance of ensemble methods in HS (e.g., [17, 33, 11]), we apply them, aiming to explore model complementarity. In this step, we extract the GCN output embedding (classification layer input) and train an XGBoost. All these processes resulted in 7 models for each fold and task (i.e., SAGpoolh , set2set and MeanPool using parser or ngram, and XGBoost). Figure 1 illustrates this process, highlighting the main steps and the different models. 3.1 Preprocessing and text to graph We start our pipeline by cleaning and tokenizing the text before training the models. The cleaning step tokenizes the text, standardizes symbols, and replaces URL and emojis by the domain and the emoji textual representation based on the post’s language. The text is then annotated with part-of-speech, morphological feats, dependency relation (tag and attachment), and NER tag using the stanza parser [26] aiming at a rich syntactic representation. We encoded each annotation in a one-hot vector and concatenate these five vectors. Aiming at a semantic representation, we also encoded the text with mBERT embeddings. For that, we take advantage of the feature-extraction pipeline from huggingface (https: //huggingface.co), and truncate all sentences with more than 300 words long due to computational limitation. This threshold is particularly relevant for the posts from GAB in the test set. These encoding processes result in a vector of 1172 dimensions for each token (404 dimensions from syntactic annotation and 768 from mBERT). Our text-to-graph approach takes the tokens’ vectors as nodes, and we ex- plore two different strategies for the edges. The parser strategy, inspired by syntacticGCN [20, 3], links nodes using the dependency attachment, while the n-gram strategy associates all words in a context window creating mesh-like graphs. Furthermore, we set the edge weight in both strategies as the cosine similarity between mBERT features at the node level. 3.2 Submission Selection The models quickly converge to their best score in both tasks (Figure 2). For the cross-validation, we opted to use the best epoch of each model for each fold, taking the vote of the folds as the final classification. The XGBoost is trained using the 15th epoch of the GCNs for simplicity. Our preliminary tests (Table 1) Text to graph #word on post 1172 dimensions #word on post #word on post Adjacency matrix Node features parser ngram Annotation MeanPool set2set GCN SAGpoolh GCN GCN Graph Pooling Convolution GCN GCN GCN Meal Pooling set2set Graph Pooling Convolution Preprocessing ● Tokenizer Pooling Classification GCN ● Replace symbols Layer Classification ● Emoji to text Layer Graph Pooling Convolution Classification Layer XGBoost Documents Models output Fig. 1: System pipeline with the main steps (preprocessing, annotation and text to graph) and the 4 model explored, three GCN (MeanPool, SAGpoolh and set2set) and one ensemble (XGBoost) pointed XGBoost, which combines the architectures and the edge approaches, as the best model, in a second level, set2setngram (Task 1) and set2setngram= , and the set2setparser (Task 1 and 2) in a third place. We submitted the parser one for both tasks and the ngram (Task 1) and ngram= (Task 2). Table 1 shows the results of the different models (GNN and ensemble) using the training set. 4 Results The official evaluation ranked XGBoost in the 53rd (Task 1) and 49th (Task 2) position, set2setngram in 37th, set2setngram= in 44th, and set2setparser in 40th for Task 1 and 2. The consistently poor performance of XGBoost is surprising considering it combines the set2set models. To understand this result, we evaluated the scores of the XGBoost’s inputs (i.e., the output of the GCN), observing that the models (a) Average accuracy in Task 1 (b) Average macro F1-score in Task 2 Fig. 2: Average result for each 10-fold cross-validation model Table 1: Average (and standard deviation) scores per task, specifying the archi- tectures and the strategies for the graph’s edge. Best results are in bold, stars indicate submitted models, and the cross marks a mBERT only setup Architecture Edge approach Task 1 Task 2 set2set parser 0.707 (0.04)* 0.429 (0.04)* set2set ngram 0.713 (0.04)* 0.434 (0.04) set2set ngram= 0.71 (0.04) 0.453 (0.03)* SAGpoolh parser 0.683 (0.02) 0.342 (0.03) SAGpoolh ngram 0.674 (0.02) 0.358 (0.03) MeanPool parser 0.698 (0.02) 0.418 (0.02) MeanPool ngram 0.711 (0.01) 0.418 (0.02) XGBoost both 0.823 (0.01)* 0.669 (0.04)* score worse than expected (e.g., set2setngram scored 3.3% below the official score and set2setparser scored 4.5%). Looking carefully at the folds’ learning curves, we noticed that the models quickly memorize the data. Hence the models used to train the XGBoost are biased towards the training set. Moving forward in studying the set2set models, we evaluate the influence of the language and the source of the post in our results (see Table 2). Looking at the performance of the different languages in the same social media source, we perceive a substantial variation in the results when the source is GAB, but this is not the case when it is Twiter (except for set2setngram in Task 2). This is expected since the models were trained only using tweets. For example, this is marked in the difference of 0.09 and 0.06 points of F-score in set2setngram , respectively, for English and Spanish for Task 2 as well as 0.07 and 0.04 in set2setparser . Table 2 also highlights that set2setngram performs better in Task 1 while set2setparser is better in Task 2. Looking deeper at the models’ performance, we calculate their F1 score for each class (Figure 3). The first observation of this evaluation is that our models perform similarly in Task 1, except GABEN . Moreover, the negative class also score similarly in both Task 1 and 2. However, that is not true for the positive classes in Task 2. In general, the models learned better ideaological-inequality class than the other four positive classes. In GABES , they score better than the negative class, which contains more instances, and even the binary classification. The models could not learn the misogyny-non-sexual-violence class correctly, presenting consistently poor scores in all studied cases. Moreover, we observed that set2setparser performs better than set2setngram in all classes except for objectification. Table 2: Results for Task 1 (accuracy) and 2 (macro F1) of the three submitted runs. Bold results indicate the best score for a task, and italic ones are the worst set2set ngram ngram= parser XGBoost Language Source Task 1 Task 2 Task 1 Task 2 Task 1 Task 2 Both Both 0.714 0.449 0.708 0.459 0.680 0.421 EN GAB 0.745 0.347 0.727 0.392 0.705 0.297 ES GAB 0.665 0.419 0.687 0.423 0.622 0.340 EN Twitter 0.711 0.438 0.699 0.464 0.689 0.440 ES Twitter 0.723 0.479 0.717 0.461 0.682 0.440 Fig. 3: Macro average F-score per class for each model and class, also discrimi- nating the performance by source and language 5 Conclusions This paper reported the participation of the MB-Courage team at the EXIST shared task. Our participation focused on using Graph Convolutional Neural Networks (GCN) as a solution for sexism identification. In this work, we ex- plored three different GCN models (set2set, SAGpoolh and MeanPool) and their combination through ensemble methods. We also explored two different word association strategies. Concerning the GCN models, we identified a poor perfor- mance of the SAGpoolh . This is probably due to an incompatibility between the limited text size and the attention strategy. On the other hand, the set2set mod- els achieved our best scores, pointing to the importance of the word sequence in this task. As a general note about the models, we identified that they quickly learn the task, but at the same time, they also quickly memorize the training set. Assessing our best models’ performance, we noticed a similar performance in sexism identification (Task 1) for both text-to-graph approaches, but that is not the case for identifying the type of sexism. This is probably due to a lack of language knowledge in our models since mBERT poorly encodes the language compared to language-specific versions of BERT [15]. As for the nodes associa- tion strategies explored in this work, we observed better scores using the n-gram association in Task 1, while in task 2 the best score came from the parser-based word association. Again, this is probably due to a need for more language in- formation for identifying the type of sexism. Moreover, we see no significant improvement when we explicitly use syntactic annotations as features. This may have been caused by a mismatch of (sparse and dense) features or by BERT already encoding the same information. The results obtained in this work point out that GCN may be a good solution for identifying threats on social media. Moreover, these models can easily model other social media aspects (e.g., users’ relationships). However, in terms of NLP, it is not completely clear how these models perform for other threats, such as hate speech, and longer or less structured texts. Acknowledgements This work has been developed in the framework of the project COURAGE - A social media companion safeguarding and educating students (no. 95567), funded by the Volkswagen Foundation in the topic Artificial Intelligence and the Society of the Future. Bibliography [1] Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison Wesley (1999) [2] Baeza-Yates, R., Ribeiro-Neto, B. (eds.): Modern Information Retrieval. Addison-Wesley, 2nd edn. (2010) [3] Bastings, J., Titov, I., Aziz, W., Marcheggiani, D., Sima’an, K.: Graph convolutional encoders for syntax-aware neural machine translation. arXiv preprint arXiv:1704.04675 (2017) [4] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computa- tional Linguistics 5, 135–146 (2017) [5] Bosco, C., Felice, D., Poletto, F., Sanguinetti, M., Maurizio, T.: Overview of the evalita 2018 hate speech detection task. In: EVALITA 2018-Sixth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. vol. 2263, pp. 1–9. CEUR (2018) [6] Canós, J.S.: Misogyny identification through svm at ibereval 2018. In: Eval- uation of Human Language Technologies for Iberian Languages: IberEval 2018 (IberEval@SEPLN). pp. 229–233 (2018) [7] Davidson, T., Warmsley, D., Macy, M., Weber, I.: Automated hate speech detection and the problem of offensive language. In: Proceedings of the 11th International AAAI Conference on Web and Social Media. pp. 512–515. ICWSM ’17 (2017) [8] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) [9] Fersini, E., Rosso, P., Anzovino, M.: Overview of the task on automatic misogyny identification at ibereval 2018. Evaluation of Human Language Technologies for Iberian Languages: IberEval 2018 (IberEval@SEPLN) 2150, 214–228 (2018) [10] Gambino, G., Pirrone, R.: Chilab@ haspeede 2: Enhancing hate speech de- tection with part-of-speech tagging. In: 7th Evaluation Campaign of Nat- ural Language Processing and Speech Tools for Italian. Final Workshop, EVALITA 2020 (2020) [11] Hoffmann, J., Kruschwitz, U.: Ur nlp@ haspeede 2 at evalita 2020: To- wards robust hate speech detection with contextual embeddings. In: 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop, EVALITA 2020 (2020) [12] Hu, J., Qian, S., Fang, Q., Wang, Y., Zhao, Q., Zhang, H., Xu, C.: Effi- cient graph deep learning in tensorflow with tf geometric. arXiv preprint arXiv:2101.11552 (2021) [13] Jha, A., Mamidi, R.: When does a compliment become sexist? analysis and classification of ambivalent sexism using twitter data. In: Proceedings of the Second Workshop on NLP and Computational Social Science. pp. 7–16 (2017) [14] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolu- tional networks. arXiv preprint arXiv:1609.02907 (2016) [15] Lavergne, E., Saini, R., Kovács, G., Murphy, K.: Thenorth@ haspeede 2: Bert-based language model fine-tuning for italian hate speech detection. In: 7th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop, EVALITA 2020. vol. 2765. CEUR-WS (2020) [16] Le, Q., Mikolov, T.: Distributed representations of sentences and docu- ments. In: International conference on machine learning. pp. 1188–1196. PMLR (2014) [17] Lee, J., Lee, I., Kang, J.: Self-attention graph pooling. In: International Conference on Machine Learning. pp. 3734–3743. PMLR (2019) [18] Liu, H., Chiroma, F., Cocea, M.: Identification and classification of misogy- nous tweets using multi-classifier fusion. In: Evaluation of Human Language Technologies for Iberian Languages: IberEval 2018 (IberEval@SEPLN). pp. 268–273. CEUR Workshop Proceedings (2018) [19] Mandl, T., Modha, S., Mandlia, C., Patel, D., Patel, A., Dave, M.: Hasoc- hate speech and offensive content identification in indo-european languages (2019) [20] Marcheggiani, D., Titov, I.: Encoding sentences with graph convolutional networks for semantic role labeling. arXiv preprint arXiv:1703.04826 (2017) [21] Mathur, P., Sawhney, R., Ayyar, M., Shah, R.: Did you offend me? classi- fication of offensive tweets in hinglish language. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2). pp. 138–148. Association for Computational Linguistics (2018), http://aclweb.org/anthology/ W18-5118 [22] Michelson, M., Macskassy, S.A.: Discovering users’ topics of interest on twit- ter: a first look. In: Proceedings of the fourth workshop on Analytics for noisy unstructured text data. pp. 73–80 (2010) [23] Montes, M., Rosso, P., Gonzalo, J., Aragón, E., Agerri, R., Ángel Álvarez Carmona, M., Álvarez Mellado, E., de Albornoz, J.C., Chiruzzo, L., Freitas, L., Adorno, H.G., Gutiérrez, Y., Zafra, S.M.J., Lima, S., de Arco, F.M.P., Taulé, M. (eds.): Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021), CEUR Workshop Proceedings, 2021 (2021) [24] Ognibene, D., Taibi, D., Kruschwitz, U., Wilkens, R.S., Hernandez-Leo, D., Theophilou, E., Scifo, L., Lobo, R.A., Lomonaco, F., Eimler, S., et al.: Challenging social media threats using collective well-being aware recom- mendation algorithms and an educational virtual companion. arXiv preprint arXiv:2102.04211 (2021) [25] de Pelle, R.P., Moreira, V.P.: Offensive comments in the brazilian web: a dataset and baseline results (2017) [26] Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: A python natural language processing toolkit for many human languages. arXiv preprint arXiv:2003.07082 (2020) [27] Rodrı́guez-Sánchez, F., Carrillo-de Albornoz, J., Plaza, L.: Automatic clas- sification of sexism in social networks: An empirical study on twitter data. IEEE Access 8, 219563–219576 (2020) [28] Rodrı́guez-Sánchez, F., de Albornoz, J.C., Plaza, L., Gonzalo, J., Rosso, P., Comet, M., Donoso, T.: Overview of exist 2021: sexism identification in social networks. Procesamiento del Lenguaje Natural 67(0) (2021) [29] Sanguinetti, M., Comandini, G., Di Nuovo, E., Frenda, S., Stranisci, M., Bosco, C., Caselli, T., Patti, V., Russo, I., Pisa, I.: Haspeede 2@ evalita2020: Overview of the evalita 2020 hate speech detection task. In: Proceedings of Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR. org (2020) [30] Shushkevich, E., Cardiff, J.: Classifying misogynistic tweets using a blended model: The ami shared task in ibereval 2018. In: Evaluation of Human Language Technologies for Iberian Languages: IberEval 2018 (IberEval@SEPLN). pp. 255–259 (2018) [31] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. arXiv preprint arXiv:1706.03762 (2017) [32] Vinyals, O., Bengio, S., Kudlur, M.: Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391 (2015) [33] Wang, S.I., Manning, C.D.: Baselines and bigrams: Simple, good sentiment and topic classification. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). pp. 90–94 (2012) [34] Waseem, Z.: Are you a racist or am i seeing things? annotator influ- ence on hate speech detection on twitter. In: Proceedings of the First Workshop on NLP and Computational Social Science. pp. 138–142. As- sociation for Computational Linguistics, Austin, Texas (November 2016), http://aclweb.org/anthology/W16-5618 [35] Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In: Proceedings of the NAACL student research workshop. pp. 88–93 (2016) [36] Waseem, Z., Hovy, D.: Hateful symbols or hateful people? predictive features for hate speech detection on twitter. In: Proceedings of the NAACL Student Research Workshop. pp. 88–93. Association for Computational Linguistics, San Diego, California (June 2016), http://www.aclweb.org/anthology/ N16-2013 [37] Yao, L., Mao, C., Luo, Y.: Graph convolutional networks for text classifi- cation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 33, pp. 7370–7377 (2019)