<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEPLN-</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Analysis on BERT-based Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pol Pastells</string-name>
          <email>pol.pastells@ub.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wolfgang S. Schmeisser-Nieto</string-name>
          <email>wolfgang.schmeisser@ub.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simona Frenda</string-name>
          <email>simona.frenda@unito.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariona Taulé</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre de Llenguatge i Computació (CLiC), Universitat de Barcelona</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Commons License Attribution 4.0 International</institution>
          ,
          <addr-line>CC BY 4.0</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Complex Systems (UBICS), Universitat de Barcelona</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Stereotype Detection</institution>
          ,
          <addr-line>Context, Conversational Thread, Immigration</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Workshop Proce dings</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>aequa-tech</institution>
          ,
          <addr-line>Turin</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>40</volume>
      <fpage>24</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>Conversational context plays a pivotal role in disambiguating messages in human communication. In this study, we investigate the impact of contextual information on detecting stereotypes related to immigrants using various BERT-based models. We use two Spanish corpora containing news comments and tweets, together with their conversational threads, annotated with stereotypes related to immigrants in Spain. The results show that the influence of context on stereotype detection varies across diferent models, corpora and context levels. Although context can enhance performance in specific scenarios, it does not consistently improve stereotype detection across all the levels of contexts. Our comprehensive evaluation underscores the complex relationship between context and stereotype identification when we use BERT-based Language Models. In particular, we found that the number of texts benefiting from contextual analysis may be too limited for the models to efectively learn Warning: This paper contains derogatory language that may be ofensive to some readers.</p>
      </abstract>
      <kwd-group>
        <kwd>Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
tizes vulnerable social groups such as immigrants has
increased during the last decade [1]. Stereotypes are
oversimplified, generalized beliefs or perceptions about
particular groups of people, often based on prejudices
or misconceptions, and social networks have facilitated
and aggravated the spread and reinforcement of these
stereotypes about marginalized groups.</p>
      <p>The identification of negative stereotypes related to
immigrants is not simple and involves knowledge of the
situation of the analyzed society and an understanding
of the conventional meanings and secondary references
used by speakers in that society. These meanings and
references can be expressed at the discourse level through,
for instance, anaphora and ellipsis. In human
communibiguate and narrow down the interpretation of a
particular message. This is observed in the percentage of data
that requires knowledge of the context to identify the
nEvelop-O
CEUR
htp:/ceur-ws.org
ISN1613-073
© 2024 Copyright for this paper by its authors. Use permitted under Creative</p>
      <p>CEUR</p>
      <p>Workshop Proceedings (CEUR-WS.org)
context-aware-stereotype-detection</p>
    </sec>
    <sec id="sec-2">
      <title>1Code available on</title>
      <p>GitHub:
https://github.com/pastells/</p>
      <p>MSC.
presence of negative stereotypes related to immigrants
in Spain: each annotator needs to read the context in the</p>
      <p>Given the critical role of context and the pervasive
impact of stereotypes on marginalized communities, we
investigated the impact of context on the detection of
stereotypes related to immigrants. For human
annotators, detecting stereotypes in textual data is a complex
task that requires understanding the underlying context
and nuances, especially if the stereotype is implicit, that
is when the stereotype is not directly stated in the text
and there is an inference process to interpret it. (1) shows
an example of a tweet from a Multilingual Stereotypes</p>
    </sec>
    <sec id="sec-3">
      <title>Corpus (MSC) [2] with an implicit stereotype that re</title>
      <p>quires contextual information to classify. It shows the
gold standard annotation for the tweet and its context.
hemos pagado impuestos toda la vida.3
‘And who sufers the cuts? Those of us who have paid taxes all
our lives.’
Annotation: [+stereotype] [+implicit] [+contextual]
Previous
tweet:</p>
    </sec>
    <sec id="sec-4">
      <title>RECETA</title>
    </sec>
    <sec id="sec-5">
      <title>PARA</title>
    </sec>
    <sec id="sec-6">
      <title>COCTEL</title>
    </sec>
    <sec id="sec-7">
      <title>XENOFÓBICO.</title>
      <p>Toma
una
medida
de “Un
ilegal
tiene los</p>
      <p>mismos derechos que tú, pero sin pagar
impuestos”. Añade una medida de “Que entren todos”</p>
    </sec>
    <sec id="sec-8">
      <title>Agita bien, y ya tienes un partido anti-inmigración a la europea. Servir bien caliente.</title>
      <p>‘RECIPE FOR XENOPHOBIC COCKTAIL. Take a measure of “An</p>
    </sec>
    <sec id="sec-9">
      <title>2This range of percentage is extracted from the annotation of</title>
    </sec>
    <sec id="sec-10">
      <title>3All examples have been manually translated.</title>
      <p>CEUR</p>
      <p>ceur-ws.org
illegal has the same rights as you, but without paying taxes”. groups that they perceive as diferent. Social groups
Add a measure of “Let them all in”. Shake well, and you already undergo a categorization process, in which the features
have a European-style anti-immigration party. Serve while hot.’ associated with that group are attributed to all of its
Annotation: [+stereotype] [+implicit] [−contextual] members [5]. Stereotypes are sets of exaggerated beliefs
Fake news: Costear la sanidad de los inmigrantes about a social group [6].
‘iPleagyainlegsfocrutehsetahe1a.1lt0h0camreilolfoinlleesgadleimeumriogsrants costs 1.1 billion Several studies have been undertaken to mitigate this
euros.’ phenomenon. For instance, every year there are more
shared tasks oriented at solving automatic stereotype</p>
      <p>Despite the importance of context in human commu- detection afecting various target groups, such as women
nication and the evident challenge of stereotype identifi- and immigrants [7, 8, 9, 10, 11, 12]. Other works have
cation, there is a noticeable gap in the literature concern- taken into account the diferent textual expressions in
ing the influence of context on stereotype detection in which stereotypes appear, especially focusing on implicit
Natural Language Processing (NLP). Although there is a forms of stereotypes that are spread through discourses.
growing body of research in related areas such as irony [13] propose a conceptual formalism to model pragmatic
[3] and hate speech detection [4], the role of context frames in which people project stereotypes onto others.
in resolving stereotype identification has been largely [14] extract microportraits, i.e., descriptions, of Muslims
overlooked. from texts. [15] present a corpus of stereotypes related</p>
      <p>In this paper, we propose adding context to fine-tuned to immigrants from mentions at the Spanish Parliament.
BERT-based models to observe whether discursive Nevertheless, to our knowledge, the role of
conversacontext plays a role in interpreting and disam- tional context has not yet been studied within the
phebiguating a message in NLP, as it does in natural lan- nomenon of stereotypes, although there are some studies
guage. We use the only two existing corpora in Spanish on context-aware models for the detection of abusive
annotated with stereotypes against immigrants that also language, with rather inconclusive results.
contain context information: DETESTS [2], consisting on [16] evaluate toxic language in conversational threads
online news comments, and MSC, consisting on tweets. from Wikipedia using two types of GRU, CNN and
Both corpora feature texts embedded in conversational LSTM models, one trained with single comments and
anthreads, where the contextual utterances include: 1) pre- other one considering its context. However, the
contextceding sentences, 2) previous comments/tweets, 3) first sensitive models did not significantly outperform the
comment/tweet of the thread, 4) wider discourse, such as single-comment ones. In [17] the authors tried a range
the news title or the fake news (or hoax) that generates of diferent approaches to add context to LSTM, CNN
the conversation. and BERT-like models for the detection of hate speech,</p>
      <p>We propose adding these diferent levels of context all with negative or neutral results. The authors
hypothafter the [SEP] token of the models. We evaluate esized that context-sensitive comments are not frequent
the quantitative performance of the models and enough for the models to learn from them. Therefore,
the linguistic characteristics of the texts containing the majority of comments would not need context for
stereotypes, to understand their impact on the models’ the correct classification and those that would require
performance. context would not get suficient attention.</p>
      <p>The remainder of this paper is as follows: Section 2 [18] use a dataset of Facebook posts to identify hate
reviews related work in the field of stereotype detection. speech with a Dutch pre-trained language model, BERTje.
Section 3 details the methodology, including the dataset, On the contrary to the previous works, they obtain
posiexperimental setup, and evaluation metrics. Section 4 tive results when training context-aware models when
presents the experimental results and quantitative analy- those contexts are controlled and manually annotated
sis, followed by a qualitative analysis in Section 5. Finally, as relevant for the classification of hate speech. On the
Section 6 concludes the paper and outlines potential di- same line of positive results, [19] explore context-aware
rections for future research. models for the detection of hate speech. Their dataset
consists of Twitter posts from Argentinian news outlet
2. Related Work accounts. For their experiments, they trained BETO, a
BERT-based model in Spanish, concluding that some
conWith the development of virtual communications, such textual information is beneficial for hate speech detection.
as social media, chats and online news comments, there In particular, the smallest context, which corresponds to
has been a growth of interactions, accompanied by an the news title tweet, gave the best results.
increase in abusive language, such as stereotypes. In relation to the length of contexts, [20] present their</p>
      <p>Stereotypes are cognitive resources that humans use to participation in a shared task on context-aware sarcasm
organize the reality they live in and to categorize social detection using BiLSTM, BERT, and SVM classifiers on
Twitter and Reddit posts. The models were trained with
ifve scenarios: zero context, last sentence of the context,
two sentences, three sentences, or all the sentences of
the context. Likewise, we use diferent types of contexts,
described in Section 3.1. They obtained the best results
when only the last sentence was provided.</p>
      <p>From this related work, to our knowledge, there are no
works so far that inject this type of context into
stereotypes detection in Spanish, however, we are aware of the
inconclusive results that previous studies show.</p>
      <sec id="sec-10-1">
        <title>3. Methodology</title>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>To analyze the models’ behavior when provided with</title>
      <p>diferent levels of context, we used two existing datasets
annotated with the presence of negative stereotypes
regarding immigrants. In this section, we describe the used
datasets and models.</p>
      <sec id="sec-11-1">
        <title>3.1. Datasets</title>
        <p>We used two Spanish corpora annotated with binary val- need to look into the context to decide if there was a
ues indicating the presence of immigration stereotypes stereotype. In those cases, the tweet was annotated as
and if the stereotypes are expressed explicitly or implic- contextual. Out of the 1,604 tweets with stereotypes, 590
itly in the text. Table 1 summarizes the two corpora. (37%) were annotated as contextual, with 253 (16%) of this
subset also categorized as implicit. An example of this
DETESTS [12] consists of sentences extracted from last case is shown in Example (1). MSC difers from
DEcomments posted in response to news articles in Spanish TESTS in that the corpus does not contain the full Twitter
newspapers (such as ABC, elDiario.es and El Mundo) and threads, but rather a subset of them (previous tweet, first
discussion forums (such as Menéame). The articles were tweet and the hoax). Therefore, the annotators did not
manually selected based on their immigration-related have access to the entire conversational context, as they
subject and potential toxicity. Each comment was seg- did in DETESTS.
mented into sentences. The comment to which every Another notable distinction between the texts in both
sentence belongs and its position within the comment corpora is that DETESTS comprises individual sentences,
and thread are indicated in the corpus. Each sentence with a median length of 13 words4, whereas MSC consists
was annotated by three trained annotators, that had ac- of full, unsegmented tweets, with a median of 26 words5.
cess to the entire comment the sentence belonged to The corpora are structured into threads, where the
when annotating, along with the news title and the rest ifrst direct comment or tweet (text from now on) on the
of the comment thread. Example (2) shows an implicit article or post is the root of the thread. Each text can
stereotype and its contexts: then have multiple responses, forming a tree structure.
We identified a range of diferent contexts to which
annotators had access, in order to provide them to the models.</p>
        <p>We structured the contexts into four levels, summarized
in Table 2:
(2) DETESTS Sentence: Y las violaciones.</p>
        <p>‘And the rapes.’
Annotation: [+stereotype] [+implicit]
Previous comment: Y que siga la fiestaaaaa!!!!
‘And let the party continue!!!!’
News title: Inmigrantes ilegales paralizan el aeropuerto
de Palma al huir de un avión marroquí.
‘Illegal immigrants paralyze Palma airport when fleeing a</p>
        <p>Moroccan plane.’
MSC [2] is a corpus of Twitter posts (tweets)
responding to hoaxes that disseminated fake news against
immigrants in newspapers or social media. The tweets were
annotated by three trained annotators for the presence
of stereotypes and their implicitness. Furthermore,
during the annotation process, annotators considered the</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>1. Previous sentences in the same comment (level 1).</title>
      <p>This level is only available for DETESTS, as MSC
tweets were not split into sentences.
Additionally, this level does not apply to the first sentence
of each comment, which constitutes 45% of
sentences in the DETESTS.
2. Previous text in the thread (level 2). This level
is absent for the first comment in each thread,
4With  1 = 7 and  2 = 20.
5With  1 = 14 and  2 = 41.</p>
      <p>Even though the contexts for DETESTS are formed
by various sentences, they are still smaller (median of
21 words for previous sentences, with 1 = 13 and 3 =
41) than the MSC contexts (median of 34 words for root
text, with 1 = 22 and 2 = 49 ). This is due to the
distribution of the comment threads, most of them having
few comments.</p>
      <sec id="sec-12-1">
        <title>3.2. Models</title>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>We fine-tuned three pretrained models from the BERT</title>
      <p>family for the classification task of stereotype detection.</p>
      <p>The models were trained to output a binary label: 0 for
no stereotype, and 1 for stereotype. We are aware of the
subjectivity of this task [21], however, considering the
evaluative scope of this work, we focused on the gold
standard version of the above-mentioned corpora.</p>
      <p>We used two diferent models pretrained in Spanish
and also multilingual BERT [22]. The selected
models, obtained from the Huggingface transformers library
(https://huggingface.co/), were:</p>
      <p>BETO dccuchile/bert-base-spanish-wwm-cased [23],
based on the BERT-Base architecture, was trained with
the Whole Word Masking technique.</p>
      <p>MarIA PlanTL-GOB-ES/roberta-base-bne [24], based
on the RoBERTa-Base model, pre-trained using 570 GB of
Spanish texts, extracted from the Spanish Web Archive
crawled by the National Library of Spain.</p>
      <p>M-BERT [22] google-bert/bert-base-multilingual-cased,
based on BERT-Base, pre-trained on the top 104
languages with the largest Wikipedia using the original
masked language modeling objective.</p>
      <p>For each of the three models and both DETESTS and
MSC, we fine-tuned a model without context (as
baseline), and a diferent model incorporating each possible
context level. To add the context to the input, we used
the sequence text + [SEP] + context, where [SEP] is the
special BERT token that is usually used to split sequences
in BERT-based models.</p>
      <p>To address the issue of missing contexts during the
ifne-tuning process, we employed a hierarchical filling
accounting for 45% of comments in DETESTS strategy. Specifically, if a lower-level context (e.g., level
and 8% of tweets in MSC. 1) was absent, it was replaced with the next highest level
3. Root text (level 3). This level does not exist for (e.g., level 2). If both level 1 and level 2 were lacking, they
the first comment of each thread and is identical were both filled with level 3, and so on. This approach
to the previous comment for the second comment was taken into consideration during the qualitative
analon each thread. It is missing in 45% of comments ysis, ensuring that any observed improvements were
and 16% of tweets. Note that DETESTS has full attributed to the filled context rather than the missing
threads, so the comments missing level 2 and the one.
ones missing level 3 are the same, while for MSC Both corpora were split in a stratified manner to
mainthey are diferent, although overlapping, sets. tain the same proportion of stereotypes, implicitness and
4. News title for DETESTS or fake news text for MSC stereotype topics6[12].</p>
      <p>(level 4). This level is always present and difers To prevent variability in the results, we decided to use
from the others in that it does not represent an 50 random seeds for training the models and report the
instance of the dataset, but an external reference. average of their results. The data split was the same
for all seeds. All models were trained7 with a 512 token
window, using batches of 32 texts and evaluating the
results every 50 steps, with early stopping.</p>
      <sec id="sec-13-1">
        <title>4. Quantitative Analysis</title>
        <p>We first compared the models with and without context
using various metrics. Figures 1 and 2 show the  1
metric, precision, and recall for both the negative and the
positive classes, i.e., the texts with or without stereotypes
in the gold standard annotation. The bars represent the
median across 50 seeds, with the error bars indicating
the first and third quartiles. Furthermore, arrows mark
a p-value smaller than 0.05 in a Welch’s t-test for each
metric, comparing the 50 seeds with and without context.
The direction of the arrows denotes an improvement (up)
or deterioration (down) in respect to the model without
context.</p>
        <p>We further examined the texts whose predictions
changed upon adding context, in order to focus on the
diferences between the models. Given the numerous
seeds used in our models, we identified texts with
consistent classification changes in more than 65% of the
seeds. For instance, for true positives (TP), we
considered a text classification to have changed if more than
65% of the seeds without context failed to classify it as
a stereotype, while more than 65% of the models with
a specific context correctly identified it as a stereotype.
Moreover, we examined all potential changes, including
TP, true negatives (TN), false positives (FP), and false
negatives (FN). These cases are shown in Tables 3 and 4
and are the same ones subjected to qualitative analysis
in Section 5.</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>DETESTS predictions. Initially, we looked at the dif</title>
      <p>ference in the  1 metric for the negative and the positive
classes. To provide a more comprehensive analysis, we</p>
    </sec>
    <sec id="sec-15">
      <title>6Although not used for this work, the corpora were also anno</title>
      <p>tated with topics.</p>
      <p>7We used a single GeForce RTX 4090 GPU, with 24 GB of RAM.
also added the precision and recall metrics. This was 1 is the only context that improves. The enhancement
crucial, as in some instances, a consistent  1 value ob- was driven by an increase in recall, but counterbalanced
scured variations in precision and recall, either in terms by a decrease in precision. A comparable trend was
obof improvement or decline. These metrics are presented served for levels 2 and 3. BETO showed an increase in FP
in Figure 1. cases and a drop in FN, indicating a tendency to classify</p>
      <p>For BETO, there was a slight, yet statistically signifi- more sentences as containing stereotypes when some
cant, deterioration in performance for the negative class. context is provided.</p>
      <p>When evaluating the  1 metric for the positive class, level Models using news title as context show a wide
variability across the two classes in BETO and M-BERT, as
evidenced by the disparity between the first and the third
quartiles, with an overall worsening tendency.</p>
      <p>In contrast, MarIA behaves diferently. It showed a
general decline in performance on the  1 metric for the
positive class, primarily due to an increased classification
of sentences as not containing stereotypes; except when
the model, informed with news title context, reports a
significant improvement. Lastly, M-BERT’s performance,
providing the context, shows no significant change in all
scenarios.</p>
      <p>Table 3 shows the individual texts that change for each
model and context, grouped by category, according to the tified by an improvement of the recall ( Figure 1). The
predictions of the models with context, similarly to a con- FP cases had in common that their contexts tended to
fusion matrix. The arrows denote an improvement (TN contain stereotypes.
and TP) or deterioration (FN and FP). FP and FN changes Example (3) shows a FP for BETO and M-BERT, where
are misclassified texts with context that were correctly the text was annotated with no presence of stereotypes.
classified without context. Therefore, cases where the However, its context does contain a stereotype:
context does not help the models. TP and TN changes,
instead, are the instances where the context helps the (3) cDaEtaTlaEnSeTs,SquSizeánstecantcaela:nisStais.aprenden catalán, serán
models make the correct prediction. For example, 163 is ‘If they learn Catalan, they will be Catalan, maybe Catalan
the number of sentences that did not have stereotypes in nationalists.’
their gold label, but were classified as having one (FP) in Annotation: [−stereotype]
more than 65% of the seeds for the BETO model without Context: Los detuvo, pero quedarán libres y se irán de
context. The model with level 1 contexts has 40 more FP rositas. Se quedarán en el país para siempre, se llevarán
(25% increase). todo tipo de ayudas y traerán a toda la familia.</p>
      <p>Looking at this table, BETO shows the biggest change ‘aTwhaeyy waritrheseteadse.thTehmey, bwuitlltshtaeyy iwniltlhebecoruenletarysefdoraenvder,wthilelywwaillkl
in FP, with similar numbers for level 1, level 2 and level 3 take all kinds of aid and they will bring the whole family.’
contexts. It also shows a slight improvement in TP for Annotation: [+stereotype] [+implicit] 8
the same contexts. This behavior can be explained by the
model just tending to classify more texts as stereotypes, As in the previous example, the classified texts neither
in agreement with the metrics in Figure 1. focus on immigrants nor evaluate the in-group
regard</p>
      <p>
        MarIA shows a similar behavior, although with the ing immigrants. Instead, the topics of these messages
contexts reversed. It tends to classify more sentences predominantly concern evaluations of the in-group, with
as stereotypes when given the news title as context, but conclusions that do not necessarily pertain to the target
not so much for the rest of the contexts. Instead, level group. Example (
        <xref ref-type="bibr" rid="ref3">4</xref>
        ), FP for both BETO and MarIA, shows
2 and level 3, appear to worsen the negative class, with an evaluation and a consequence derived from previous
an increase in FN. M-BERT is the model that has less texts. Although the sentence has no stereotype, both
consistent changes, with only a change of more than 10% the previous sentences and the previous comment contexts
in the FP with level 3 context, similarly to MarIA. contain stereotypes.
      </p>
      <p>MSC predictions. All classification models got biased
toward predicting 0, that is, the models tend to predict
fewer stereotypes. This can be seen in Figure 2 with the
negative class precision and positive class recall
worsening, while the negative class recall and positive class
precision tending to improve, except for M-BERT. It is
also made evident in Table 4, for all three models, adding
any context makes the models’ FN increase significantly.</p>
      <p>Similarly to DETESTS, the metrics for the level 4
context, the racial hoax text, had a big variability for
BETO’s positive class and M-BERT’s negative and
positive classes.</p>
      <sec id="sec-15-1">
        <title>5. Qualitative Analysis</title>
        <p>
          (
          <xref ref-type="bibr" rid="ref3">4</xref>
          ) DETESTS Sentence: Dentro de 20 o 30 años, nuestros
hijos y nietos nos maldecirán mil veces por el infierno
que les hemos dejado.
‘In 20 or 30 years, our children and grandchildren will curse us a
thousand times for the hell we have left them. ’
Annotation: [−stereotype]
Previous Sentences: y ya es tarde, el Caballo de Troya
lo tenemos dentro.
‘and it’s too late, we have the Trojan Horse within us.’
Annotation: [+stereotype] [+implicit]
Previous Comment: […] Están moviendo los hilos de
esta invasión, que aprovechan para usar a los Ilegales
como sicarios, para agredir y amedrentar a los españoles
de bien. […]
‘[…] They are pulling the strings of this invasion, which they
take advantage of to use the Illegals as hitmen, to attack and
intimidate good Spaniards. […]’
Annotation: [+stereotype] [−implicit]
In this section, we present a qualitative analysis of the in- Another case of FP, for BETO, was found in
Examstances that improved or deteriorated their classification ple (
          <xref ref-type="bibr" rid="ref7">5</xref>
          ). Even though the text concerns immigrants with
on the models trained with diferent levels of context, as keywords corresponding to the target group, it contains
presented in Tables 3 and 4. Our aim is to gain a deeper no stereotype according to the annotators. Its context,
understanding of the impact of context on the models’ however, was annotated with a stereotype, even
predictions from a linguistic perspective. We describe though there is no explicit reference to immigrants. This
linguistic patterns by comparing three levels of analysis: shows that the model attends to enough tokens from the
by models, by datasets, and by levels of contexts. context to determine the presence of a stereotype, which
        </p>
        <p>
          In the predictions on DETESTS (Table 3), we observed
an increase of sensibility towards the positive class,
jus8In fact both sentences from the previous comment contain an
implicit stereotype.
drives the model to a positive classification .
(
          <xref ref-type="bibr" rid="ref7">5</xref>
          ) DETESTS Sentence: En Francia, el paro es de 15% en la
población general y de 40% en la inmigrada.
‘In France, unemployment is 15% in the general population and
40% in the immigrant population.’
Annotation: [−stereotype]
Context: Pobres incautos. Salen como locos en vuelo
directo a los invoxnaderos a trabajar por 3 € la hora.
‘Poor dupes. They leave like crazy on a direct flight to the
invoxnaderos9 to work for €3 an hour.’
        </p>
        <p>Annotation: [+stereotype] [+implicit]</p>
      </sec>
    </sec>
    <sec id="sec-16">
      <title>Nonetheless, out of eleven DETESTS sentences that were</title>
      <p>
        classified as FP by BETO with context levels 1 to 3, only
two cases have no stereotypes in any of their contexts.
For instance, in Example (
        <xref ref-type="bibr" rid="ref9">6</xref>
        ), there is no interpretation of
stereotypes neither by human annotators nor by the
decision of the models without context. However, adding the
context, which was previously annotated as containing
no stereotype, the prediction of the model yielded a FP.
(
        <xref ref-type="bibr" rid="ref9">6</xref>
        ) DETESTS Sentence: Que los pececitos coman cachalote
franquista.
‘Let the little fish eat Francoist sperm whale. ’
Annotation: [−stereotype]
Context: Pues lanza a tu madre.
‘Then throw your mother.’
      </p>
      <p>Annotation: [−stereotype]</p>
    </sec>
    <sec id="sec-17">
      <title>Furthermore, the opposite phenomenon occurs when</title>
      <p>
        MarIA is fine-tuned: it shows a 24% of deterioration on
DETESTS’s FP when the news title is fed as context. It
is worth noting that out of the twelve news articles that
were used to create DETESTS, six of them contained in
their title a word related directly to the target group,
such as immigrant or dinghy, shown in Example (
        <xref ref-type="bibr" rid="ref10">7</xref>
        ). The
misclassified texts belong to five of these conversation
threads with keywords in their title, which might be an
indication that the model was afected by the vocabulary
used.
ifed as not containing stereotypes by the majority of the
models (15 instances). We noticed that, in general, the
presence of the hoax as context (level 4) afects negatively
the decision of the model. Additionally, with further
analysis, we consider that most of these instances contain
implicit expressions, inducing the need for context to be
understood, as seen in Example (
        <xref ref-type="bibr" rid="ref13">8</xref>
        ).
(
        <xref ref-type="bibr" rid="ref13">8</xref>
        ) MSC Tweet: ...fuerzas políticas, ni policiales, ni legales,
para empezar a resolver la situación creada. Y yo creo
que ni voluntad de hacerlo. Aquello está lejos y a los
peninsulares no les preocupa lo más mínimo. Grave
error; gravisimo. Una vez controlen las islas vendrán
aquí a reclamar..
‘...political forces, neither police nor legal, to begin to resolve the
situation created. And I believe that there is no desire to do so.
That is far away and the peninsular people are not the least bit
worried. Serious mistake; very serious. Once they control the
islands they will come here to complain...’
Annotation: [+stereotype] [+implicit] [+contextual]
Level 2: Canarias ya está ”ocupada” por marroquíes y
mauritanos. En las islas orientales, Fuerteventura y Lanzarote,
el número de moros ya es mayor que el de la población
autóctona. Es una estrategia marroquí que empieza a darle
resultados: la toma ’pacífica’ de territorios ...
‘The Canary Islands are already ”occupied” by Moroccans
and Mauritanians. On the eastern islands, Fuerteventura and
Lanzarote, the number of Moors is already greater than that of
the native population. It is a Moroccan strategy that is beginning
to give results: the ’peaceful’ seizure of territories...’
      </p>
    </sec>
    <sec id="sec-18">
      <title>Considering this analysis, we plan to investigate further the role played by the context in future work, exploring other models and their common behaviors.</title>
      <sec id="sec-18-1">
        <title>6. Conclusions</title>
        <p>
          Taking into account the importance of context during
the identification of stereotypes in online conversational
threads, in this work, we analyzed the impact of
diferent levels of context on stereotype detection in news
comments and tweets.
(
          <xref ref-type="bibr" rid="ref10">7</xref>
          ) News Title 1: La otra crisis con la que lidia Ceuta: un In particular, we performed quantitative and
qualitatercio de los contagios son de inmigrantes acogidos. tive analyses on predictions obtained with fine-tuned
‘The other crisis that Ceuta is dealing with: a third of the language models informed with diferent context levels.
infections are from received immigrants.’ Quantitatively, no general improvement was seen when
News Title 2: Una “patera aérea”, una nueva e adding contextual information after the [SEP] token to
‘iAnsó“flyliitnagmdianngehrya”,dea ennetwraarnedn uEnsupsauñaal dweayfotromeanitrerreSgpualainr. BERT-based models. The results were highly dependent
irregularly.’ on the dataset used. In DETESTS, only BETO proves to
became more sensible to stereotypes when some context
Looking at Table 4, we notice an interesting tendency is provided, or MarIA when informed with news title
conrelated to FN in all the models informed with context. text. Whereas in MSC, models are biased towards the
The model performance worsens if we introduce context, negative class.
regardless of the level. To understand the behavior of the We hypothesize that the number of texts that benefit
models, we observed the instances commonly misclassi- from looking at the context is too small for the models
to learn from, as suggested by the number of
contextual9Word play in which the main word invernadero ‘greenhouse’ labeled tweets. The models may also be looking into
is embedded with the far-right wing party’s name Vox, resulting in other subtleties other than the presence of stereotypes.
‘invoxnadero’.
        </p>
      </sec>
      <sec id="sec-18-2">
        <title>Acknowledgments References</title>
      </sec>
    </sec>
    <sec id="sec-19">
      <title>For example, the context in Example (6) has a negative</title>
      <p>sentiment, even though it does not contain a stereotype.</p>
      <p>Future work may require more involved methods of
analysis on the quantitative side, using diferent
embeddings for the text and the context or with approaches like
mechanistic interpretability.</p>
    </sec>
    <sec id="sec-20">
      <title>Limitations Our work was exclusively focused on</title>
      <p>the Spanish language and employed solely BERT and
RoBERTa models. More advanced generative models,
such as Llama 2 [25] or Mixtral 8x7B [26], may ofer
diferent ways of capturing context.</p>
      <p>Among the various levels of context considered, which
difered between the two corpora, only level 4 was
consistently present. The other levels had to be filled to prevent
the loss of valuable data. Exploring data augmentation
techniques, using synthetic data or curating a dataset
without missing contexts, could be a promising direction
for future research.</p>
    </sec>
    <sec id="sec-21">
      <title>This work was supported by the international project</title>
      <p>STERHEOTYPES: STudying European Racial Hoaxes and
sterEOTYPES funded by the Compagnia di San Paolo and
VolksWagen Stiftung under the Challenges for Europe
call (CUP: B99C20000640007); the SGR CLiC project (2021
SGR 00313) funded by the Generalitat de Catalunya, and
the FairTransNLP-Language project
(PID2021-124361OBC33) funded by MICIU/AEI/10.13039/501100011033/ and
by FEDER, UE.
[12] A. Ariza-Casabona, W. S. Schmeisser-Nieto, [21] W. S. Schmeisser-Nieto, P. Pastells, S. Frenda,
M. Nofre, M. Taulé, E. Amigó, B. Chulvi, P. Rosso, M. Taule, Human vs. machine perceptions on
Overview of DETESTS at IberLEF 2022: DETEction immigration stereotypes, in: N. Calzolari, M.-Y.
and classification of racial STereotypes in Spanish, Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.),
Procesamiento del Lenguaje Natural 69 (2022) Proceedings of the 2024 Joint International
Con217–228. ference on Computational Linguistics, Language
[13] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, Resources and Evaluation (LREC-COLING 2024),
Y. Choi, Social bias frames: Reasoning about social ELRA and ICCL, Torino, Italia, 2024, pp. 8453–8463.
and power implications of language, in: D. Jurafsky, URL: https://aclanthology.org/2024.lrec-main.741.
J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings [22] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
of the 58th Annual Meeting of the Association for Pre-training of Deep Bidirectional Transformers
Computational Linguistics, Association for Compu- for Language Understanding, in: Proceedings
tational Linguistics, Online, 2020, pp. 5477–5490. of the 2019 Conference of the North American
URL: https://aclanthology.org/2020.acl-main.486. Chapter of the Association for Computational
Lindoi:10.18653/v1/2020.acl-main.486. guistics: Human Language Technologies, Volume
[14] A. Fokkens, N. Ruigrok, C. Beukeboom, G. Sarah, 1 (Long and Short Papers), Association for
ComW. Van Atteveldt, Studying muslim stereotyping putational Linguistics, Minneapolis, Minnesota,
through microportrait extraction, in: Proceedings 2019, pp. 4171–4186. URL: https://www.aclweb.org/
of the Eleventh International Conference on Lan- anthology/N19-1423. doi:10.18653/v1/N19-1423.
guage Resources and Evaluation (LREC 2018), 2018, [23] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho,
pp. 3734–3741. H. Kang, J. Pérez, Spanish pre-trained bert model
[15] J. J. Sánchez-Junquera, B. Chulvi, P. Rosso, S. P. and evaluation data, in: PML4DC at ICLR 2020,
Ponzetto, How do you speak about immigrants? 2020.
taxonomy and stereoimmigrants dataset for iden- [24] A. G. Fandiño, J. A. Estapé, M. Pàmies, J. L. Palao,
tifying stereotypes about immigrants, Applied J. S. Ocampo, C. P. Carrino, C. A. Oller, C. R.
Sciences 11 (2021). URL: https://www.mdpi.com/ Penagos, A. G. Agirre, M. Villegas, Maria: Spanish
2076-3417/11/8/3610. doi:10.3390/app11083610. language models, Procesamiento del Lenguaje
Nat[16] M. Karan, J. Šnajder, Preemptive toxic language ural 68 (2022). URL: https://upcommons.upc.edu/
detection in wikipedia comments using thread-level handle/2117/367156#.YyMTB4X9A-0.mendeley.
context, in: Proceedings of the Third Workshop on doi:10.26342/2022-68-3.</p>
      <p>
        Abusive Language Online, 2019, pp. 129–134. [25] H. Touvron, L. Martin, K. Stone, P. Albert, A.
Alma[17] J. Pavlopoulos, J. Sorensen, L. Dixon, N. Thain, hairi, Y. Babaei, N. Bashlykov, S. Batra, P.
BharI. Androutsopoulos, Toxicity Detection: Does gava, S. Bhosale, et al., Llama 2: Open
foundaContext Really Matter?, in: Proceedings of the tion and fine-tuned chat models, arXiv preprint
58th Annual Meeting of the Association for Com- arXiv:2307.09288 (
        <xref ref-type="bibr" rid="ref28">2023</xref>
        ).
putational Linguistics, Association for Computa- [26] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch,
tional Linguistics, Online, 2020, pp. 4296–4305. URL: B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas,
https://aclanthology.org/2020.acl-main.396. doi:10. E. B. Hanna, F. Bressand, et al., Mixtral of experts,
18653/v1/2020.acl-main.396. arXiv preprint arXiv:2401.04088 (2024).
[18] I. Markov, W. Daelemans, The Role of Context
in Detecting the Target of Hate Speech, in:
Proceedings of the Third Workshop on Threat,
Aggression and Cyberbullying (TRAC 2022),
Association for Computational Linguistics, Gyeongju,
Republic of Korea, 2022, pp. 37–42. URL: https:
//aclanthology.org/2022.trac-1.5.
[19] J. M. Pérez, F. M. Luque, D. Zayat, M. Kondratzky,
      </p>
      <p>A. Moro, P. S. Serrati, J. Zajac, P. Miguel, N.
Debandi, A. Gravano, et al., Assessing the impact of
contextual information in hate speech detection,</p>
      <p>
        IEEE Access 11 (
        <xref ref-type="bibr" rid="ref28">2023</xref>
        ) 30575–30590.
[20] A. Baruah, K. Das, F. Barbhuiya, K. Dey,
Contextaware sarcasm detection using bert, in: Proceedings
of the Second Workshop on Figurative Language
Processing, 2020, pp. 83–87.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1035-
          <fpage>1044</fpage>
          . URL: https://aclanthology.org/P15-1100.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>doi:10</source>
          .3115/v1/
          <fpage>P15</fpage>
          -1100.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Huang</surname>
          </string-name>
          , Detecting online hate speech
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>guage Processing</source>
          ,
          <source>RANLP</source>
          <year>2017</year>
          ,
          <string-name>
            <given-names>INCOMA</given-names>
            <surname>Ltd</surname>
          </string-name>
          .,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Varna</surname>
          </string-name>
          , Bulgaria,
          <year>2017</year>
          , pp.
          <fpage>260</fpage>
          -
          <lpage>266</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          //doi.org/10.26615/
          <fpage>978</fpage>
          -954-452-049-6_
          <fpage>036</fpage>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G. W.</given-names>
            <surname>Allport</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pettigrew</surname>
          </string-name>
          , The nature of
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>prejudice</surname>
          </string-name>
          , Addison-wesley Reading, MA,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Hamilton</surname>
          </string-name>
          , Cognitive processes in stereotyping
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of the
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>evalita 2018 task on automatic misogyny identifica-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>Tools for Italian</source>
          <volume>12</volume>
          (
          <year>2018</year>
          )
          <fpage>59</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          , P. Rosso, AMI @ EVALITA2020:
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Italian</surname>
          </string-name>
          . Final Workshop (EVALITA
          <year>2020</year>
          ), Online
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>event</surname>
          </string-name>
          ,
          <year>December 17th</year>
          ,
          <year>2020</year>
          , volume
          <volume>2765</volume>
          <source>of CEUR</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Workshop</given-names>
            <surname>Proceedings</surname>
          </string-name>
          , CEUR-WS.org,
          <year>2020</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2765</volume>
          /paper161.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rodríguez-Sánchez</surname>
          </string-name>
          , J. C. de Albornoz,
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , Overview of EXIST 2022: sexism
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>del Lenguaje Natural</source>
          <volume>69</volume>
          (
          <year>2022</year>
          )
          <fpage>229</fpage>
          -
          <lpage>240</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          http://journal.sepln.org/sepln/ojs/ojs/index.php/ [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ekman</surname>
          </string-name>
          ,
          <article-title>Anti-immigration and racist discourse pln</article-title>
          /article/view/6443.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>in social media</article-title>
          ,
          <source>European journal of Communica</source>
          <volume>-</volume>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chiril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Benamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Moriceau</surname>
          </string-name>
          , “be nice to
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>tion 34</source>
          (
          <year>2019</year>
          )
          <fpage>606</fpage>
          -
          <lpage>618</lpage>
          .
          <article-title>your wife! the restaurants are closed”: Can gender</article-title>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Bourgeade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Cignarella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Lau- stereotype detection improve sexism classification?,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Moriceau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taulé</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Multilingual Linguistics: EMNLP</surname>
          </string-name>
          <year>2021</year>
          , Association for Computa-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          versational Threads,
          <source>in: Proceedings of the 17th 2021</source>
          , pp.
          <fpage>2833</fpage>
          -
          <lpage>2844</lpage>
          . URL: https://aclanthology.org/
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>Conference of the European Chapter of the Associ- 2021.findings-emnlp.242. doi:10</source>
          .18653/v1/
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>ation for Computational Linguistics (EACL</article-title>
          <year>2023</year>
          ), findings-emnlp.
          <volume>242</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <year>2023</year>
          . [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Comandini</surname>
          </string-name>
          , E. di Nuovo, [3]
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Choe</surname>
          </string-name>
          , E. Charniak, Sparse,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stranisci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>contextually informed models for irony detection: I. Russo, Haspeede 2 @ EVALITA2020: Overview</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <article-title>Exploiting user communities, entities and senti- of the EVALITA 2020 hate speech detection task</article-title>
          , in:
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>of the 53rd Annual Meeting of the Association Proceedings of the Seventh Evaluation Campaign</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>for Computational Linguistics and the 7th Interna- of Natural Language Processing and Speech Tools</source>
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>tional Joint Conference on Natural Language Pro- for Italian. Final Workshop (EVALITA</source>
          <year>2020</year>
          ), vol-
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>cessing (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for ume 2765</source>
          , CEUR Workshop Proceedings (CEUR-
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <string-name>
            <given-names>Computational</given-names>
            <surname>Linguistics</surname>
          </string-name>
          , Beijing, China,
          <year>2015</year>
          , pp.
          <source>WS.org)</source>
          ,
          <year>2020</year>
          . Conference date:
          <fpage>17</fpage>
          -
          <lpage>12</lpage>
          -
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>