-

minimalist approach

Irene Siragusa

0 1

Roberto Pirrone

0 1 0 Dipartimento di Ingegneria @ Università degli Studi di Palermo , Viale delle Scienze, Edificio 6 90129 - Palermo 1 Processing and Speech Tools for Italian , Sep 7 - 8, Parma, IT

This technical report illustrates the system developed by the CHILab team for the competition HODI at EVALITA 2023. The key idea of the method we proposed for the HODI Subtask A - Homotransphobia detection, was to develop diferent systems arranged as suitable combinations of Pre-Trained Language Model (PTLM) for embedding extraction, neural architectures for further elaborations over the embeddings and a classifier. In particular dense layers, LSTM, BiLSTM and Transformers were used as neural architectures. The best performing system across the ones investigated in this report was made by embeddings extracted via AlBERTo coupled with a Transformer that reaches a macro-F1 score of 0.753.

homotransphobia detection transformer language model

-

content.1

This paper contains examples of potentially ofensive 2. Introduction

The increasing interest for gender-inclusive and nondiscriminatory language passes through its counterpart in hate speech that it is largely spreading in social networks, particularly against the LGBTQIA+ community. The NLP community is currently involved in developing this respect to derive embeddings. Moreover, we decided not to use fine-tuning in our PTLMs to stress the use of light networks to be trained with low computing resources. Finally, we set up a unique approach for all the tasks we have participated in EVALITA 2023.

The paper is arranged as follows: Section 2 reports a description of our systems along with data pre-processing, while results are reported and discussed in Section 3. Concluding remarks are in Section 4.

3. Description of the system systems for hate speech detection as in MAMI (Multime- The data set by the HODI organizers contains 6000 Italian dia Automatic Misogyny Identification) [ 2] and EDOS (Explainable Detection of Online Sexism) [3] where the focus was on detection of misogyny and sexism, but these datasets are focused neither on Italian nor in detecting tweets annotated accordingly to the presence of homotransphobic content. Since the training set released for the competition was made up of 5000 samples, this was randomly split in a training and validation set, using a hate speech against people from the LGBTQIA+ commu- 80-20 ratio, resulting in 4000 and 1000 samples respecnity.

This paper introduces the architecture proposed by the CHILab team for the EVALITA 2023 campaign [4], and in particular as regards the Homotransphobia Detection in tively. 3.1. Pre-processing

Italian task (HODI Subtask A - Homotransphobia detec- The [URL] tag, mention references, and retweet notes tion) [5]. The general approach relies on encoding the text into suitable word embeddings that are processed via were removed since they were not considered meaningful: in particular, mentions are referred to anonymized neural architectures like LSTM, BiLSTM or Transform- accounts thus they add no special information. This was ers. Finally, the output classifier detects the presence of homotransphobic content. done after an analysis on the most cited words and hashtags2. As reported in Table 1, the [URL] tag is the most

We conceived our pipelines as “minimalist” architec- frequent one between classes and adds no information tures. No generative models [6, 7] where considered in EVALITA 2023: 8th Evaluation Campaign of Natural Language htp:/ceur-ws.org ISN1613-073 © 2023 Copyright for this paper by its authors. Use permitted under Creative

CEUR

Workshop Proceedings (CEUR-WS.org) com/dnozza/profanity-obfuscation) [1] form nEvelop-O just like the anonymized mentions that are not reported.

During this analysis it was interesting to notice that the most cited words are slurs directed to LGBTQIA+ members. Although a first idea for approaching the task was shows clearly that slurs are not a good indicator of homotransphobic content. Slurs, in fact, are widely used

1Profanities have been obfuscated with PrOf (https://github. 2for this analysis all the words were reported in their lower case

from the LGBTQIA+ people as a self-definition method suggesting a (re-)appropriation of the term itself [8], and obviously tweets of this kind cannot be considered as homotransphobic so the slur word loses its negative connotation, as in the tweet:

ifrmato una fr*cia in sessione:( 3 here the term fr*cia does not have any negative connotation. Therefore any word-dependent consideration about the polarization of homotransphobic speeches cannot be made, as the presence of slur words does not convey negative content, i.e. slurs cannot be regarded as representative elements for separating classes. The same considerations hold for the hashtags as reported in Table 2 where the most frequent ones are neutral words.

Hence, the hashtag symbol was removed and the subse- contained in the data set where manually substituted quent word was kept along with its meaning inside the with the corresponding most frequent emoji. As an extweet. ample, the “:(” emoticon was translated in “ ” even if

Similar considerations were made for emojis: also in this is not the exact correspondence. This approach does this case a strong polarization in the use of emojis is not not inject bias in the data set as the diferent emoticons found, in particular in the ones that are more associated were very few, while their rough meaning is preserved with disgust and hate (Table 3). Since emojis are deeply thus avoiding to consider them as mere sequences of used in social media communication, they were kept. punctuation marks. No further elaboration were made Based on the statistics reported in Table 3, the emoticons over the tweets: words were not reported to their lower case form, thus allowing a more accurate extraction of

3signed by a queer during the exam session

NH tweets

freq culo rotto url gay c*zzo r*cchione

fare caghino solo me NH tweets pride prelemi eurovision

gay tellonym pridemonth dazn meloni omofobia escita

4https://huggingface.co/docs/transformers/index embeddings for the case-sensitive PTLMs. As for emojis, a Transformer5. The output feature vector has the same uppercase texts has a specific meaning in social media size of the word embedding with the exception of the communication in terms of prosodic and emotions inter- BiLSTM that generates a double-length output. Finally, pretation [9, 10]. the feature vector is passed to a classifier made by either 300 or 768 linear units, depending on the length of the 3.2. Network architectures embedding, and a sigmoidal output to achieve binary classification (Figure 1.a). Some experimental configuDiferent models were developed that share the same rations add an extra ReLU dense layer before the aforemacro structure shown in Figure 1. The key idea was to mentioned classifier with exactly the same size. Those stress, as much as possible, existent neural architectures architectures are referred as LSTM-Deep, BiLSTM-Deep for sequence processing, that are LSTM [11], BiLSTM and Trasformer-Deep (Figure 1.b). and Transformers [12]. Those architecture are used to The illustrated architectures were trained only on the further process the extracted embeddings. given data set using a machine equipped with two In

After pre-processing, the input sentences were padded tel Xeon E5 CPUs 96GB RAM and an NVIDIA TITAN to ℎ + 2 tokens where ℎ is the size Xp GPU 12GB RAM. Hyperparameters were selected as of the longest sentence, and the remaining two tokens follows: dropout values in {0.1, 0.2}, batch size 32, Adam are respectively the [CLS] and the [SEP] one. Either optimizer [18] with learning rate 0.01, and a Binary Cross a pre-trained language model or a static context-free Entropy loss. Models were trained for a maximum of 1000 embedding model were used for embedding extraction. epochs with a patience value of 50. In the last case, fastText [13] was used that generates a Diferent feature extractors were implemented using 1, 300 tokens embedding, while a 768 tokens embedding 2 or 3 LSTM/BiLSTM/Transformer layers, but the best reis obtained as usual by the diferent PTLMs. We used sults were obtained by the single layer feature extraction the following Encoder-based Language Models in the modules. In addition the developed models are relative experiments: BERT base multilingual cased [14], BERT small, where the trainable parameters range from 1M to base italian uncased [15], XLM-RoBERTa [16] and Al- 10M.

BERTo [17] provided by the HuggingFace Transformers library4. The embeddings were extracted from the last layer of the PTLMs without fine-tuning. Fine-tuning in 4. Results these configuration is an option that is not taken into account since the main idea is to stress the use of light The best models during the evaluation window6 networks to be trained with low computing resources. were BERT-it/Transformer (run 1), AlBERTo/Trans

The extracted sequence of embeddings is fed into a formerDeep (run 2) and AlBERTo/LSTM (run 3) and they neural module that consists of a LSTM or a BiLSTM or

4.1. Error analysis

As suggested by the organizers of the shared tasks, an error analysis was performed particularly on the tweets that were mis-classified by the models reported in Table 4 that performs better with reference to the baseline. All classifiers agreed incorrectly on 40 tweets: the 80% of them were homotransphobic ones. Thanks to a direct analysis of their content, the following consideration can be made.

The very first consideration is that the majority of the mis-classified tweets contain slurs. As it has been shown in Section 3.1, slur words are widely used by the LGBTQIA+ people as self-reference without any discriminatory intent, so an automatic classifier may not recognize these shades of meaning as in: placed ad the bottom of the rank and below the baseline [5]. Due to an internal error in the code of the training procedure, the submitted results are intrinsically wrong and for this reason, we repeated all the experiments using the correct architecture, after the release of the golden labels. An overview of all the developed Fanculo Dolce & Gabbana non metto la models is reported in Table 4, while Table 5 shows the roba fr*cia7 submitted runs, their fixed counterpart and the baseline value. In both tables the results refer to the F1-macro Moreover, many non-homotransphobic tweets share score over the test set and, although all possible config- actually some linguistic similarities with the homourations were run, in Table 4 we report the significant transphoic ones: architecture, i.e. the configurations that placed above the baseline. The results show that the AlBERTo/Trans- DI ANORMALE c’è solo che una cripto former architecture with a two dense layers classifier ch*cca repressa e omofoba quale lei è #Pil(Transformer-Deep) has the best performance, and it is lon sia miserabile Senatore della Repubexpected to rank at the 10th place on the leaderboard. blica pagato dagli italiani e che peggio

Moreover, LSTM-Deep and BiLSTM models exhibit getta discredito sulla nostra Nazione con comparable performance: bi-directional sequence pro- esternazioni quotidiane di puro, spregevcessing compensates for the reduced classifier’s capac- ole letame.8 ity. In general, the Transformer-Deep architectures performed better than the Transformer ones. Here some hateful content is reported towards a person

As it was expected, only the models based on fastText that is considered homotransphobic. In those cases, the benefit from removing the stop words. The models using presence of hate speech is correctly detected but it does AlBERTo and BERT-it achieved almost the best results not meet the homotransphobic requirement. both in the training phase and in the evaluation, because the network can take advantage of PTLMs that are specifically fine-tuned on the target languages. In particular, AlBERTo was trained on a corpus of Italian tweets that

7Fuck Dolce & Gabbana I do not wear f*g stuf

8ABNORMAL there is only that a repressed and homophobic crypto queer like you #Pillon is a miserable Senator of the Republic paid by the Italians and worse, discredits our nation with daily utterances of pure, despicable manure.

Acknowledgments

This work is supported by the PO FESR 2014-2020 grant n. 086201000543, “SCuSi - Smart Culture in Sicily”

5. Conclusion References

This paper reported the architectures developed by the CHILab team for HODI Subtask A promoted at the EVALITA 2023 campaign. Our models show that a relatively small classical pipeline made by embedding extraction plus further neural elaboration can have satisfactory performance in homotransphobic speech detection without the need of fine-tuning PTLMs, and using few computational resources. The use of such “minimalist” architecture is intended to allow for future development of compact explainable models where explicit linguistic knowledge is injected in the network to improve its performance. doi:10.5281/zenodo.4263142. [16] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised crosslingual representation learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/ 1911.02116. arXiv:1911.02116. [17] M. Polignano, P. Basile, M. de Gemmis, G. Semeraro, V. Basile, AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets, in: Proceedings of the Sixth Italian Conference on Computational Linguistics (CLiC-it 2019), volume 2481, CEUR, 2019. URL: https://www.scopus.com/inward/record.uri? eid=2-s2.0-85074851349&partnerID=40&md5= 7abed946e06f76b3825ae5e294ffac14. [18] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).