-

SEPLN-

1613-0073

Analysis on BERT-based Models

Pol Pastells

pol.pastells@ub.edu 0 1 3 4

Wolfgang S. Schmeisser-Nieto

wolfgang.schmeisser@ub.edu 0 1 2 3 4

Simona Frenda

simona.frenda@unito.it 1 3 4 5

Mariona Taulé

0 1 2 3 4 0 Centre de Llenguatge i Computació (CLiC), Universitat de Barcelona 1 Commons License Attribution 4.0 International , CC BY 4.0 2 Institute of Complex Systems (UBICS), Universitat de Barcelona 3 Stereotype Detection , Context, Conversational Thread, Immigration 4 Workshop Proce dings 5 aequa-tech , Turin , Italy

2024

40 24 27

Conversational context plays a pivotal role in disambiguating messages in human communication. In this study, we investigate the impact of contextual information on detecting stereotypes related to immigrants using various BERT-based models. We use two Spanish corpora containing news comments and tweets, together with their conversational threads, annotated with stereotypes related to immigrants in Spain. The results show that the influence of context on stereotype detection varies across diferent models, corpora and context levels. Although context can enhance performance in specific scenarios, it does not consistently improve stereotype detection across all the levels of contexts. Our comprehensive evaluation underscores the complex relationship between context and stereotype identification when we use BERT-based Language Models. In particular, we found that the number of texts benefiting from contextual analysis may be too limited for the models to efectively learn Warning: This paper contains derogatory language that may be ofensive to some readers.

Models

1. Introduction tizes vulnerable social groups such as immigrants has increased during the last decade [1]. Stereotypes are oversimplified, generalized beliefs or perceptions about particular groups of people, often based on prejudices or misconceptions, and social networks have facilitated and aggravated the spread and reinforcement of these stereotypes about marginalized groups.

The identification of negative stereotypes related to immigrants is not simple and involves knowledge of the situation of the analyzed society and an understanding of the conventional meanings and secondary references used by speakers in that society. These meanings and references can be expressed at the discourse level through, for instance, anaphora and ellipsis. In human communibiguate and narrow down the interpretation of a particular message. This is observed in the percentage of data that requires knowledge of the context to identify the nEvelop-O CEUR htp:/ceur-ws.org ISN1613-073 © 2024 Copyright for this paper by its authors. Use permitted under Creative

CEUR

Workshop Proceedings (CEUR-WS.org) context-aware-stereotype-detection

1Code available on

GitHub: https://github.com/pastells/

MSC. presence of negative stereotypes related to immigrants in Spain: each annotator needs to read the context in the

Given the critical role of context and the pervasive impact of stereotypes on marginalized communities, we investigated the impact of context on the detection of stereotypes related to immigrants. For human annotators, detecting stereotypes in textual data is a complex task that requires understanding the underlying context and nuances, especially if the stereotype is implicit, that is when the stereotype is not directly stated in the text and there is an inference process to interpret it. (1) shows an example of a tweet from a Multilingual Stereotypes

Corpus (MSC) [2] with an implicit stereotype that re

quires contextual information to classify. It shows the gold standard annotation for the tweet and its context. hemos pagado impuestos toda la vida.3 ‘And who sufers the cuts? Those of us who have paid taxes all our lives.’ Annotation: [+stereotype] [+implicit] [+contextual] Previous tweet:

RECETA PARA COCTEL XENOFÓBICO.

Toma una medida de “Un ilegal tiene los

mismos derechos que tú, pero sin pagar impuestos”. Añade una medida de “Que entren todos”

Agita bien, y ya tienes un partido anti-inmigración a la europea. Servir bien caliente.

‘RECIPE FOR XENOPHOBIC COCKTAIL. Take a measure of “An

2This range of percentage is extracted from the annotation of 3All examples have been manually translated.

CEUR

ceur-ws.org illegal has the same rights as you, but without paying taxes”. groups that they perceive as diferent. Social groups Add a measure of “Let them all in”. Shake well, and you already undergo a categorization process, in which the features have a European-style anti-immigration party. Serve while hot.’ associated with that group are attributed to all of its Annotation: [+stereotype] [+implicit] [−contextual] members [5]. Stereotypes are sets of exaggerated beliefs Fake news: Costear la sanidad de los inmigrantes about a social group [6]. ‘iPleagyainlegsfocrutehsetahe1a.1lt0h0camreilolfoinlleesgadleimeumriogsrants costs 1.1 billion Several studies have been undertaken to mitigate this euros.’ phenomenon. For instance, every year there are more shared tasks oriented at solving automatic stereotype

Despite the importance of context in human commu- detection afecting various target groups, such as women nication and the evident challenge of stereotype identifi- and immigrants [7, 8, 9, 10, 11, 12]. Other works have cation, there is a noticeable gap in the literature concern- taken into account the diferent textual expressions in ing the influence of context on stereotype detection in which stereotypes appear, especially focusing on implicit Natural Language Processing (NLP). Although there is a forms of stereotypes that are spread through discourses. growing body of research in related areas such as irony [13] propose a conceptual formalism to model pragmatic [3] and hate speech detection [4], the role of context frames in which people project stereotypes onto others. in resolving stereotype identification has been largely [14] extract microportraits, i.e., descriptions, of Muslims overlooked. from texts. [15] present a corpus of stereotypes related

In this paper, we propose adding context to fine-tuned to immigrants from mentions at the Spanish Parliament. BERT-based models to observe whether discursive Nevertheless, to our knowledge, the role of conversacontext plays a role in interpreting and disam- tional context has not yet been studied within the phebiguating a message in NLP, as it does in natural lan- nomenon of stereotypes, although there are some studies guage. We use the only two existing corpora in Spanish on context-aware models for the detection of abusive annotated with stereotypes against immigrants that also language, with rather inconclusive results. contain context information: DETESTS [2], consisting on [16] evaluate toxic language in conversational threads online news comments, and MSC, consisting on tweets. from Wikipedia using two types of GRU, CNN and Both corpora feature texts embedded in conversational LSTM models, one trained with single comments and anthreads, where the contextual utterances include: 1) pre- other one considering its context. However, the contextceding sentences, 2) previous comments/tweets, 3) first sensitive models did not significantly outperform the comment/tweet of the thread, 4) wider discourse, such as single-comment ones. In [17] the authors tried a range the news title or the fake news (or hoax) that generates of diferent approaches to add context to LSTM, CNN the conversation. and BERT-like models for the detection of hate speech,

We propose adding these diferent levels of context all with negative or neutral results. The authors hypothafter the [SEP] token of the models. We evaluate esized that context-sensitive comments are not frequent the quantitative performance of the models and enough for the models to learn from them. Therefore, the linguistic characteristics of the texts containing the majority of comments would not need context for stereotypes, to understand their impact on the models’ the correct classification and those that would require performance. context would not get suficient attention.

The remainder of this paper is as follows: Section 2 [18] use a dataset of Facebook posts to identify hate reviews related work in the field of stereotype detection. speech with a Dutch pre-trained language model, BERTje. Section 3 details the methodology, including the dataset, On the contrary to the previous works, they obtain posiexperimental setup, and evaluation metrics. Section 4 tive results when training context-aware models when presents the experimental results and quantitative analy- those contexts are controlled and manually annotated sis, followed by a qualitative analysis in Section 5. Finally, as relevant for the classification of hate speech. On the Section 6 concludes the paper and outlines potential di- same line of positive results, [19] explore context-aware rections for future research. models for the detection of hate speech. Their dataset consists of Twitter posts from Argentinian news outlet 2. Related Work accounts. For their experiments, they trained BETO, a BERT-based model in Spanish, concluding that some conWith the development of virtual communications, such textual information is beneficial for hate speech detection. as social media, chats and online news comments, there In particular, the smallest context, which corresponds to has been a growth of interactions, accompanied by an the news title tweet, gave the best results. increase in abusive language, such as stereotypes. In relation to the length of contexts, [20] present their

Stereotypes are cognitive resources that humans use to participation in a shared task on context-aware sarcasm organize the reality they live in and to categorize social detection using BiLSTM, BERT, and SVM classifiers on Twitter and Reddit posts. The models were trained with ifve scenarios: zero context, last sentence of the context, two sentences, three sentences, or all the sentences of the context. Likewise, we use diferent types of contexts, described in Section 3.1. They obtained the best results when only the last sentence was provided.

From this related work, to our knowledge, there are no works so far that inject this type of context into stereotypes detection in Spanish, however, we are aware of the inconclusive results that previous studies show.

3. Methodology To analyze the models’ behavior when provided with

diferent levels of context, we used two existing datasets annotated with the presence of negative stereotypes regarding immigrants. In this section, we describe the used datasets and models.

3.1. Datasets

We used two Spanish corpora annotated with binary val- need to look into the context to decide if there was a ues indicating the presence of immigration stereotypes stereotype. In those cases, the tweet was annotated as and if the stereotypes are expressed explicitly or implic- contextual. Out of the 1,604 tweets with stereotypes, 590 itly in the text. Table 1 summarizes the two corpora. (37%) were annotated as contextual, with 253 (16%) of this subset also categorized as implicit. An example of this DETESTS [12] consists of sentences extracted from last case is shown in Example (1). MSC difers from DEcomments posted in response to news articles in Spanish TESTS in that the corpus does not contain the full Twitter newspapers (such as ABC, elDiario.es and El Mundo) and threads, but rather a subset of them (previous tweet, first discussion forums (such as Menéame). The articles were tweet and the hoax). Therefore, the annotators did not manually selected based on their immigration-related have access to the entire conversational context, as they subject and potential toxicity. Each comment was seg- did in DETESTS. mented into sentences. The comment to which every Another notable distinction between the texts in both sentence belongs and its position within the comment corpora is that DETESTS comprises individual sentences, and thread are indicated in the corpus. Each sentence with a median length of 13 words4, whereas MSC consists was annotated by three trained annotators, that had ac- of full, unsegmented tweets, with a median of 26 words5. cess to the entire comment the sentence belonged to The corpora are structured into threads, where the when annotating, along with the news title and the rest ifrst direct comment or tweet (text from now on) on the of the comment thread. Example (2) shows an implicit article or post is the root of the thread. Each text can stereotype and its contexts: then have multiple responses, forming a tree structure. We identified a range of diferent contexts to which annotators had access, in order to provide them to the models.

We structured the contexts into four levels, summarized in Table 2: (2) DETESTS Sentence: Y las violaciones.

‘And the rapes.’ Annotation: [+stereotype] [+implicit] Previous comment: Y que siga la fiestaaaaa!!!! ‘And let the party continue!!!!’ News title: Inmigrantes ilegales paralizan el aeropuerto de Palma al huir de un avión marroquí. ‘Illegal immigrants paralyze Palma airport when fleeing a

Moroccan plane.’ MSC [2] is a corpus of Twitter posts (tweets) responding to hoaxes that disseminated fake news against immigrants in newspapers or social media. The tweets were annotated by three trained annotators for the presence of stereotypes and their implicitness. Furthermore, during the annotation process, annotators considered the

1. Previous sentences in the same comment (level 1).

This level is only available for DETESTS, as MSC tweets were not split into sentences. Additionally, this level does not apply to the first sentence of each comment, which constitutes 45% of sentences in the DETESTS. 2. Previous text in the thread (level 2). This level is absent for the first comment in each thread, 4With 1 = 7 and 2 = 20. 5With 1 = 14 and 2 = 41.

Even though the contexts for DETESTS are formed by various sentences, they are still smaller (median of 21 words for previous sentences, with 1 = 13 and 3 = 41) than the MSC contexts (median of 34 words for root text, with 1 = 22 and 2 = 49 ). This is due to the distribution of the comment threads, most of them having few comments.

3.2. Models We fine-tuned three pretrained models from the BERT

family for the classification task of stereotype detection.

The models were trained to output a binary label: 0 for no stereotype, and 1 for stereotype. We are aware of the subjectivity of this task [21], however, considering the evaluative scope of this work, we focused on the gold standard version of the above-mentioned corpora.

We used two diferent models pretrained in Spanish and also multilingual BERT [22]. The selected models, obtained from the Huggingface transformers library (https://huggingface.co/), were:

BETO dccuchile/bert-base-spanish-wwm-cased [23], based on the BERT-Base architecture, was trained with the Whole Word Masking technique.

MarIA PlanTL-GOB-ES/roberta-base-bne [24], based on the RoBERTa-Base model, pre-trained using 570 GB of Spanish texts, extracted from the Spanish Web Archive crawled by the National Library of Spain.

M-BERT [22] google-bert/bert-base-multilingual-cased, based on BERT-Base, pre-trained on the top 104 languages with the largest Wikipedia using the original masked language modeling objective.

For each of the three models and both DETESTS and MSC, we fine-tuned a model without context (as baseline), and a diferent model incorporating each possible context level. To add the context to the input, we used the sequence text + [SEP] + context, where [SEP] is the special BERT token that is usually used to split sequences in BERT-based models.

To address the issue of missing contexts during the ifne-tuning process, we employed a hierarchical filling accounting for 45% of comments in DETESTS strategy. Specifically, if a lower-level context (e.g., level and 8% of tweets in MSC. 1) was absent, it was replaced with the next highest level 3. Root text (level 3). This level does not exist for (e.g., level 2). If both level 1 and level 2 were lacking, they the first comment of each thread and is identical were both filled with level 3, and so on. This approach to the previous comment for the second comment was taken into consideration during the qualitative analon each thread. It is missing in 45% of comments ysis, ensuring that any observed improvements were and 16% of tweets. Note that DETESTS has full attributed to the filled context rather than the missing threads, so the comments missing level 2 and the one. ones missing level 3 are the same, while for MSC Both corpora were split in a stratified manner to mainthey are diferent, although overlapping, sets. tain the same proportion of stereotypes, implicitness and 4. News title for DETESTS or fake news text for MSC stereotype topics6[12].

(level 4). This level is always present and difers To prevent variability in the results, we decided to use from the others in that it does not represent an 50 random seeds for training the models and report the instance of the dataset, but an external reference. average of their results. The data split was the same for all seeds. All models were trained7 with a 512 token window, using batches of 32 texts and evaluating the results every 50 steps, with early stopping.

4. Quantitative Analysis

We first compared the models with and without context using various metrics. Figures 1 and 2 show the 1 metric, precision, and recall for both the negative and the positive classes, i.e., the texts with or without stereotypes in the gold standard annotation. The bars represent the median across 50 seeds, with the error bars indicating the first and third quartiles. Furthermore, arrows mark a p-value smaller than 0.05 in a Welch’s t-test for each metric, comparing the 50 seeds with and without context. The direction of the arrows denotes an improvement (up) or deterioration (down) in respect to the model without context.

We further examined the texts whose predictions changed upon adding context, in order to focus on the diferences between the models. Given the numerous seeds used in our models, we identified texts with consistent classification changes in more than 65% of the seeds. For instance, for true positives (TP), we considered a text classification to have changed if more than 65% of the seeds without context failed to classify it as a stereotype, while more than 65% of the models with a specific context correctly identified it as a stereotype. Moreover, we examined all potential changes, including TP, true negatives (TN), false positives (FP), and false negatives (FN). These cases are shown in Tables 3 and 4 and are the same ones subjected to qualitative analysis in Section 5.

DETESTS predictions. Initially, we looked at the dif

ference in the 1 metric for the negative and the positive classes. To provide a more comprehensive analysis, we

6Although not used for this work, the corpora were also anno

tated with topics.

7We used a single GeForce RTX 4090 GPU, with 24 GB of RAM. also added the precision and recall metrics. This was 1 is the only context that improves. The enhancement crucial, as in some instances, a consistent 1 value ob- was driven by an increase in recall, but counterbalanced scured variations in precision and recall, either in terms by a decrease in precision. A comparable trend was obof improvement or decline. These metrics are presented served for levels 2 and 3. BETO showed an increase in FP in Figure 1. cases and a drop in FN, indicating a tendency to classify

For BETO, there was a slight, yet statistically signifi- more sentences as containing stereotypes when some cant, deterioration in performance for the negative class. context is provided.

When evaluating the 1 metric for the positive class, level Models using news title as context show a wide variability across the two classes in BETO and M-BERT, as evidenced by the disparity between the first and the third quartiles, with an overall worsening tendency.

In contrast, MarIA behaves diferently. It showed a general decline in performance on the 1 metric for the positive class, primarily due to an increased classification of sentences as not containing stereotypes; except when the model, informed with news title context, reports a significant improvement. Lastly, M-BERT’s performance, providing the context, shows no significant change in all scenarios.

Table 3 shows the individual texts that change for each model and context, grouped by category, according to the tified by an improvement of the recall ( Figure 1). The predictions of the models with context, similarly to a con- FP cases had in common that their contexts tended to fusion matrix. The arrows denote an improvement (TN contain stereotypes. and TP) or deterioration (FN and FP). FP and FN changes Example (3) shows a FP for BETO and M-BERT, where are misclassified texts with context that were correctly the text was annotated with no presence of stereotypes. classified without context. Therefore, cases where the However, its context does contain a stereotype: context does not help the models. TP and TN changes, instead, are the instances where the context helps the (3) cDaEtaTlaEnSeTs,SquSizeánstecantcaela:nisStais.aprenden catalán, serán models make the correct prediction. For example, 163 is ‘If they learn Catalan, they will be Catalan, maybe Catalan the number of sentences that did not have stereotypes in nationalists.’ their gold label, but were classified as having one (FP) in Annotation: [−stereotype] more than 65% of the seeds for the BETO model without Context: Los detuvo, pero quedarán libres y se irán de context. The model with level 1 contexts has 40 more FP rositas. Se quedarán en el país para siempre, se llevarán (25% increase). todo tipo de ayudas y traerán a toda la familia.

Looking at this table, BETO shows the biggest change ‘aTwhaeyy waritrheseteadse.thTehmey, bwuitlltshtaeyy iwniltlhebecoruenletarysefdoraenvder,wthilelywwaillkl in FP, with similar numbers for level 1, level 2 and level 3 take all kinds of aid and they will bring the whole family.’ contexts. It also shows a slight improvement in TP for Annotation: [+stereotype] [+implicit] 8 the same contexts. This behavior can be explained by the model just tending to classify more texts as stereotypes, As in the previous example, the classified texts neither in agreement with the metrics in Figure 1. focus on immigrants nor evaluate the in-group regard

MarIA shows a similar behavior, although with the ing immigrants. Instead, the topics of these messages contexts reversed. It tends to classify more sentences predominantly concern evaluations of the in-group, with as stereotypes when given the news title as context, but conclusions that do not necessarily pertain to the target not so much for the rest of the contexts. Instead, level group. Example ( 4 ), FP for both BETO and MarIA, shows 2 and level 3, appear to worsen the negative class, with an evaluation and a consequence derived from previous an increase in FN. M-BERT is the model that has less texts. Although the sentence has no stereotype, both consistent changes, with only a change of more than 10% the previous sentences and the previous comment contexts in the FP with level 3 context, similarly to MarIA. contain stereotypes.

MSC predictions. All classification models got biased toward predicting 0, that is, the models tend to predict fewer stereotypes. This can be seen in Figure 2 with the negative class precision and positive class recall worsening, while the negative class recall and positive class precision tending to improve, except for M-BERT. It is also made evident in Table 4, for all three models, adding any context makes the models’ FN increase significantly.

Similarly to DETESTS, the metrics for the level 4 context, the racial hoax text, had a big variability for BETO’s positive class and M-BERT’s negative and positive classes.

5. Qualitative Analysis

( 4 ) DETESTS Sentence: Dentro de 20 o 30 años, nuestros hijos y nietos nos maldecirán mil veces por el infierno que les hemos dejado. ‘In 20 or 30 years, our children and grandchildren will curse us a thousand times for the hell we have left them. ’ Annotation: [−stereotype] Previous Sentences: y ya es tarde, el Caballo de Troya lo tenemos dentro. ‘and it’s too late, we have the Trojan Horse within us.’ Annotation: [+stereotype] [+implicit] Previous Comment: […] Están moviendo los hilos de esta invasión, que aprovechan para usar a los Ilegales como sicarios, para agredir y amedrentar a los españoles de bien. […] ‘[…] They are pulling the strings of this invasion, which they take advantage of to use the Illegals as hitmen, to attack and intimidate good Spaniards. […]’ Annotation: [+stereotype] [−implicit] In this section, we present a qualitative analysis of the in- Another case of FP, for BETO, was found in Examstances that improved or deteriorated their classification ple ( 5 ). Even though the text concerns immigrants with on the models trained with diferent levels of context, as keywords corresponding to the target group, it contains presented in Tables 3 and 4. Our aim is to gain a deeper no stereotype according to the annotators. Its context, understanding of the impact of context on the models’ however, was annotated with a stereotype, even predictions from a linguistic perspective. We describe though there is no explicit reference to immigrants. This linguistic patterns by comparing three levels of analysis: shows that the model attends to enough tokens from the by models, by datasets, and by levels of contexts. context to determine the presence of a stereotype, which

In the predictions on DETESTS (Table 3), we observed an increase of sensibility towards the positive class, jus8In fact both sentences from the previous comment contain an implicit stereotype. drives the model to a positive classification . ( 5 ) DETESTS Sentence: En Francia, el paro es de 15% en la población general y de 40% en la inmigrada. ‘In France, unemployment is 15% in the general population and 40% in the immigrant population.’ Annotation: [−stereotype] Context: Pobres incautos. Salen como locos en vuelo directo a los invoxnaderos a trabajar por 3 € la hora. ‘Poor dupes. They leave like crazy on a direct flight to the invoxnaderos9 to work for €3 an hour.’

Annotation: [+stereotype] [+implicit]

Nonetheless, out of eleven DETESTS sentences that were

classified as FP by BETO with context levels 1 to 3, only two cases have no stereotypes in any of their contexts. For instance, in Example ( 6 ), there is no interpretation of stereotypes neither by human annotators nor by the decision of the models without context. However, adding the context, which was previously annotated as containing no stereotype, the prediction of the model yielded a FP. ( 6 ) DETESTS Sentence: Que los pececitos coman cachalote franquista. ‘Let the little fish eat Francoist sperm whale. ’ Annotation: [−stereotype] Context: Pues lanza a tu madre. ‘Then throw your mother.’

Annotation: [−stereotype]

Furthermore, the opposite phenomenon occurs when

MarIA is fine-tuned: it shows a 24% of deterioration on DETESTS’s FP when the news title is fed as context. It is worth noting that out of the twelve news articles that were used to create DETESTS, six of them contained in their title a word related directly to the target group, such as immigrant or dinghy, shown in Example ( 7 ). The misclassified texts belong to five of these conversation threads with keywords in their title, which might be an indication that the model was afected by the vocabulary used. ifed as not containing stereotypes by the majority of the models (15 instances). We noticed that, in general, the presence of the hoax as context (level 4) afects negatively the decision of the model. Additionally, with further analysis, we consider that most of these instances contain implicit expressions, inducing the need for context to be understood, as seen in Example ( 8 ). ( 8 ) MSC Tweet: ...fuerzas políticas, ni policiales, ni legales, para empezar a resolver la situación creada. Y yo creo que ni voluntad de hacerlo. Aquello está lejos y a los peninsulares no les preocupa lo más mínimo. Grave error; gravisimo. Una vez controlen las islas vendrán aquí a reclamar.. ‘...political forces, neither police nor legal, to begin to resolve the situation created. And I believe that there is no desire to do so. That is far away and the peninsular people are not the least bit worried. Serious mistake; very serious. Once they control the islands they will come here to complain...’ Annotation: [+stereotype] [+implicit] [+contextual] Level 2: Canarias ya está ”ocupada” por marroquíes y mauritanos. En las islas orientales, Fuerteventura y Lanzarote, el número de moros ya es mayor que el de la población autóctona. Es una estrategia marroquí que empieza a darle resultados: la toma ’pacífica’ de territorios ... ‘The Canary Islands are already ”occupied” by Moroccans and Mauritanians. On the eastern islands, Fuerteventura and Lanzarote, the number of Moors is already greater than that of the native population. It is a Moroccan strategy that is beginning to give results: the ’peaceful’ seizure of territories...’

Considering this analysis, we plan to investigate further the role played by the context in future work, exploring other models and their common behaviors. 6. Conclusions

Taking into account the importance of context during the identification of stereotypes in online conversational threads, in this work, we analyzed the impact of diferent levels of context on stereotype detection in news comments and tweets. ( 7 ) News Title 1: La otra crisis con la que lidia Ceuta: un In particular, we performed quantitative and qualitatercio de los contagios son de inmigrantes acogidos. tive analyses on predictions obtained with fine-tuned ‘The other crisis that Ceuta is dealing with: a third of the language models informed with diferent context levels. infections are from received immigrants.’ Quantitatively, no general improvement was seen when News Title 2: Una “patera aérea”, una nueva e adding contextual information after the [SEP] token to ‘iAnsó“flyliitnagmdianngehrya”,dea ennetwraarnedn uEnsupsauñaal dweayfotromeanitrerreSgpualainr. BERT-based models. The results were highly dependent irregularly.’ on the dataset used. In DETESTS, only BETO proves to became more sensible to stereotypes when some context Looking at Table 4, we notice an interesting tendency is provided, or MarIA when informed with news title conrelated to FN in all the models informed with context. text. Whereas in MSC, models are biased towards the The model performance worsens if we introduce context, negative class. regardless of the level. To understand the behavior of the We hypothesize that the number of texts that benefit models, we observed the instances commonly misclassi- from looking at the context is too small for the models to learn from, as suggested by the number of contextual9Word play in which the main word invernadero ‘greenhouse’ labeled tweets. The models may also be looking into is embedded with the far-right wing party’s name Vox, resulting in other subtleties other than the presence of stereotypes. ‘invoxnadero’.

Acknowledgments References For example, the context in Example (6) has a negative

sentiment, even though it does not contain a stereotype.

Future work may require more involved methods of analysis on the quantitative side, using diferent embeddings for the text and the context or with approaches like mechanistic interpretability.

Limitations Our work was exclusively focused on

the Spanish language and employed solely BERT and RoBERTa models. More advanced generative models, such as Llama 2 [25] or Mixtral 8x7B [26], may ofer diferent ways of capturing context.

Among the various levels of context considered, which difered between the two corpora, only level 4 was consistently present. The other levels had to be filled to prevent the loss of valuable data. Exploring data augmentation techniques, using synthetic data or curating a dataset without missing contexts, could be a promising direction for future research.

This work was supported by the international project

STERHEOTYPES: STudying European Racial Hoaxes and sterEOTYPES funded by the Compagnia di San Paolo and VolksWagen Stiftung under the Challenges for Europe call (CUP: B99C20000640007); the SGR CLiC project (2021 SGR 00313) funded by the Generalitat de Catalunya, and the FairTransNLP-Language project (PID2021-124361OBC33) funded by MICIU/AEI/10.13039/501100011033/ and by FEDER, UE. [12] A. Ariza-Casabona, W. S. Schmeisser-Nieto, [21] W. S. Schmeisser-Nieto, P. Pastells, S. Frenda, M. Nofre, M. Taulé, E. Amigó, B. Chulvi, P. Rosso, M. Taule, Human vs. machine perceptions on Overview of DETESTS at IberLEF 2022: DETEction immigration stereotypes, in: N. Calzolari, M.-Y. and classification of racial STereotypes in Spanish, Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.), Procesamiento del Lenguaje Natural 69 (2022) Proceedings of the 2024 Joint International Con217–228. ference on Computational Linguistics, Language [13] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, Resources and Evaluation (LREC-COLING 2024), Y. Choi, Social bias frames: Reasoning about social ELRA and ICCL, Torino, Italia, 2024, pp. 8453–8463. and power implications of language, in: D. Jurafsky, URL: https://aclanthology.org/2024.lrec-main.741. J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: of the 58th Annual Meeting of the Association for Pre-training of Deep Bidirectional Transformers Computational Linguistics, Association for Compu- for Language Understanding, in: Proceedings tational Linguistics, Online, 2020, pp. 5477–5490. of the 2019 Conference of the North American URL: https://aclanthology.org/2020.acl-main.486. Chapter of the Association for Computational Lindoi:10.18653/v1/2020.acl-main.486. guistics: Human Language Technologies, Volume [14] A. Fokkens, N. Ruigrok, C. Beukeboom, G. Sarah, 1 (Long and Short Papers), Association for ComW. Van Atteveldt, Studying muslim stereotyping putational Linguistics, Minneapolis, Minnesota, through microportrait extraction, in: Proceedings 2019, pp. 4171–4186. URL: https://www.aclweb.org/ of the Eleventh International Conference on Lan- anthology/N19-1423. doi:10.18653/v1/N19-1423. guage Resources and Evaluation (LREC 2018), 2018, [23] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, pp. 3734–3741. H. Kang, J. Pérez, Spanish pre-trained bert model [15] J. J. Sánchez-Junquera, B. Chulvi, P. Rosso, S. P. and evaluation data, in: PML4DC at ICLR 2020, Ponzetto, How do you speak about immigrants? 2020. taxonomy and stereoimmigrants dataset for iden- [24] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao, tifying stereotypes about immigrants, Applied J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R. Sciences 11 (2021). URL: https://www.mdpi.com/ Penagos, A. G. Agirre, M. Villegas, Maria: Spanish 2076-3417/11/8/3610. doi:10.3390/app11083610. language models, Procesamiento del Lenguaje Nat[16] M. Karan, J. Šnajder, Preemptive toxic language ural 68 (2022). URL: https://upcommons.upc.edu/ detection in wikipedia comments using thread-level handle/2117/367156#.YyMTB4X9A-0.mendeley. context, in: Proceedings of the Third Workshop on doi:10.26342/2022-68-3.

Abusive Language Online, 2019, pp. 129–134. [25] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma[17] J. Pavlopoulos, J. Sorensen, L. Dixon, N. Thain, hairi, Y. Babaei, N. Bashlykov, S. Batra, P. BharI. Androutsopoulos, Toxicity Detection: Does gava, S. Bhosale, et al., Llama 2: Open foundaContext Really Matter?, in: Proceedings of the tion and fine-tuned chat models, arXiv preprint 58th Annual Meeting of the Association for Com- arXiv:2307.09288 ( 2023 ). putational Linguistics, Association for Computa- [26] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, tional Linguistics, Online, 2020, pp. 4296–4305. URL: B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, https://aclanthology.org/2020.acl-main.396. doi:10. E. B. Hanna, F. Bressand, et al., Mixtral of experts, 18653/v1/2020.acl-main.396. arXiv preprint arXiv:2401.04088 (2024). [18] I. Markov, W. Daelemans, The Role of Context in Detecting the Target of Hate Speech, in: Proceedings of the Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), Association for Computational Linguistics, Gyeongju, Republic of Korea, 2022, pp. 37–42. URL: https: //aclanthology.org/2022.trac-1.5. [19] J. M. Pérez, F. M. Luque, D. Zayat, M. Kondratzky,

A. Moro, P. S. Serrati, J. Zajac, P. Miguel, N. Debandi, A. Gravano, et al., Assessing the impact of contextual information in hate speech detection,

IEEE Access 11 ( 2023 ) 30575–30590. [20] A. Baruah, K. Das, F. Barbhuiya, K. Dey, Contextaware sarcasm detection using bert, in: Proceedings of the Second Workshop on Figurative Language Processing, 2020, pp. 83–87.

1035- 1044 . URL: https://aclanthology.org/P15-1100.

doi:10 .3115/v1/ P15 -1100.

[4]

Gao ,

Huang , Detecting online hate speech

guage Processing , RANLP 2017 ,

INCOMA

Ltd .,

Varna , Bulgaria, 2017 , pp. 260 - 266 . URL: https:

//doi.org/10.26615/ 978 -954-452-049-6_ 036 . doi:10.

[5]

G. W.

Allport ,

Clark ,

Pettigrew , The nature of

prejudice , Addison-wesley Reading, MA, 1954 .

[6]

D. L.

Hamilton , Cognitive processes in stereotyping

[7]

Fersini ,

Nozza ,

Rosso , Overview of the

evalita 2018 task on automatic misogyny identifica-

Tools for Italian 12 ( 2018 ) 59 .

[8]

Fersini ,

Nozza , P. Rosso, AMI @ EVALITA2020:

Italian . Final Workshop (EVALITA 2020 ), Online

event , December 17th , 2020 , volume 2765 of CEUR

Workshop

Proceedings , CEUR-WS.org, 2020 . URL:

http://ceur-ws. org/ Vol- 2765 /paper161.pdf.

[9]

Rodríguez-Sánchez , J. C. de Albornoz,

Rosso , Overview of EXIST 2022: sexism

del Lenguaje Natural 69 ( 2022 ) 229 - 240 . URL:

http://journal.sepln.org/sepln/ojs/ojs/index.php/ [1]

Ekman , Anti-immigration and racist discourse pln /article/view/6443.

in social media , European journal of Communica - [10]

Chiril ,

Benamara ,

Moriceau , “be nice to

tion 34 ( 2019 ) 606 - 618 . your wife! the restaurants are closed”: Can gender [2]

Bourgeade ,

A. T.

Cignarella ,

Frenda , M.

Lau- stereotype detection improve sexism classification?,

Moriceau ,

Patti ,

Taulé , A Multilingual Linguistics: EMNLP 2021 , Association for Computa-

versational Threads, in: Proceedings of the 17th 2021 , pp. 2833 - 2844 . URL: https://aclanthology.org/

Conference of the European Chapter of the Associ- 2021.findings-emnlp.242. doi:10 .18653/v1/ 2021 .

ation for Computational Linguistics (EACL

2023 ), findings-emnlp. 242 .

2023 . [11]

Sanguinetti ,

Comandini , E. di Nuovo, [3]

B. C.

Wallace ,

D. K.

Choe , E. Charniak, Sparse,

Frenda ,

Stranisci ,

Bosco ,

Caselli ,

Patti ,

contextually informed models for irony detection: I. Russo, Haspeede 2 @ EVALITA2020: Overview

Exploiting user communities, entities and senti- of the EVALITA 2020 hate speech detection task , in:

of the 53rd Annual Meeting of the Association Proceedings of the Seventh Evaluation Campaign

for Computational Linguistics and the 7th Interna- of Natural Language Processing and Speech Tools

tional Joint Conference on Natural Language Pro- for Italian. Final Workshop (EVALITA 2020 ), vol-

cessing (Volume 1 : Long

Papers)

, Association for ume 2765 , CEUR Workshop Proceedings (CEUR-

Computational

Linguistics , Beijing, China, 2015 , pp. WS.org) , 2020 . Conference date: 17 - 12 - 2020 .