1. Introduction

B. Fiumanò);

Unveiling Stereotypes: Combining Knowledge Graphs and LLMs for Implied Stereotype Generation

Marco Cuccarini

1 2

Lia Draetta

Beatrice Fiumanò

Stefano Bistarelli

Rossana Damiano

Valentina Presutti

0 0 University of Bologna, Department of Modern Languages , Literatures, and Cultures 1 University of Naples Federico II, Department of Biology 2 University of Perugia, Department of Mathematics and Computer Science 3 University of Turin, Department of Computer science

2025

000 0 0001

In recent years, hate speech detection models have achieved significantly improved results, largely due to advances in Large Language Models (LLMs). As a result, research has increasingly focused on more nuanced phenomena, such as the detection of implicit hate and stereotypes. Although the challenge of identifying implicit language has been largely explored, it remains an open issue for state-of-the-art models due to their limited ability to grasp contextual and culturally specific knowledge. In this work, we address the task of identifying stereotypes implicitly encoded in hate speech messages, and propose a method for generating them by leveraging the combined potential of LLMs and Knowledge Graphs (KGs). As a first step, we designed an ontology specifically tailored to represent implicit hate speech. We then populated the ontology using a subset of an Italian-language hate speech dataset, in which targets and implied stereotype statements were manually annotated. The remaining portion of the dataset was reserved as a test set to evaluate the impact of knowledge graph-derived information on LLM-generated stereotypes. For each input sentence, relevant knowledge was extracted from the ontology using SPARQL queries and used to enrich the prompt provided to various LLMs. We compared the results of the knowledge-enhanced approach against those of a baseline few-shot learning approach. Evaluation was conducted using BLEU, BERTScore and ROUGE metrics. Additionally, given the high subjectivity of the task, we performed a manual qualitative analysis on a subset of the model outputs to assess both the quality of the evaluation and the soundness of the generated stereotypes. Warning: This paper contains examples of explicitly ofensive content.

eol>Hate speech detection Stereotype Large Language Models Knowledge Graph Retrieval Augmented Generation

1. Introduction

resources. We then populate the ontology using a subset Several approaches have been explored to identify of an Italian dataset on implicit stereotypes, which com- subtle hate speech, including transformer-based modprises manual annotations on HS targets, hateful chunks els [ 16, 17, 18 ], neural networks [19] or leveraging seand stereotypes. Finally, starting from the target entities mantic information embedded in texts [19, 20]. Other in each sentence, we extract relevant knowledge from approaches tried to tackle this task by incorporating the the KG and integrate it into the prompt of three diferent potentiality of external sources of knowledge, such as LLMs. We task the models with generating the implicit Knowledge Graphs [21]. stereotype that underlies each hate speech message. We In this context, few studies have directly addressed the compare these stereotypes with those generated by a challenge of unveiling or explaining subtle hate speech. baseline model using a non-KG-enhanced prompt. The Some researchers [ 16, 22 ] have focused on the role of main contributions of this work are the following: social stereotypes, aiming to uncover their implicit meanings and to develop benchmarks for explanation-oriented • StereoGraph: a Knowledge Graph grounded in a tasks. Other works have specifically addressed the task dedicated ontology designed to represent implicit of implicit hate speech explanation. Kim and colleagues hate expressed in social media posts. [23] present a pipeline that guides transformer models’ • A graph-based methodology to generate explicit predictive decisions through the identification of key stereotypes encoded in hateful messages. rationales. More recent studies have leveraged the generative capabilities of LLMs. For example, Huang and • A fine-grained manual assessment and error anal- colleagues [ 24 ] propose a Chain-of-Explanation promptysis to evaluate the suitability of the evaluation ing method to generate stereotypes. Similarly, Yang et metrics used to compare both the baseline and al. [ 25 ] introduce step-by-step approach that combines KG-enhanced outputs against the gold standard. LLM-based chain-of-thought prompting with a humanThis was particularly relevant given the highly annotated benchmark. subjective and culturally specific nature of task. While several studies have focused on creating benchmarks and providing insights into implicit hate speech in

In the following Section (2) we present relevant related English, resources for the Italian language remain limited, works on detection and analysis of subtle hate speech with only a few datasets addressing the hate speech phe(2.1), together with graph-based approaches (2.2) to the nomenon in depth. Notable studies [ 26, 27, 28, 29 ] have same tasks. Section 3 describes the adopted methodol- provided valuable annotated resources that distinguish ogy, the dataset we used for constructing the KG, and between implicit and explicit hate speech and stereothe ontology design process. The experimental setup is types, with the goal of detecting the more subtle and detailed in Section 4, while the results, including quanti- less recognizable nuances of hate. Nevertheless, research tative evaluation, human assessment, and error analysis on stereotype explication remains limited. For example, are discussed in Section 5. Finally, the conclusions and Muti and colleagues [ 30 ] investigate the ability of LLMs to limitations are presented in Sections 6 and 7, respectively. accurately identify implicit messages in misogynistic conAll data and code for reproducibility can be found on the texts, also exploring how prompts can reconstruct subtle following GitHub page1. meanings to make the messages explicit. However, to our knowledge, no previous work about embedded stereo2. Related Works types has been carried out in the Italian cultural context. We suggest that the generation of implicit stereotypes 2.1. Subtle Hate Speech Explanation can support the development of more comprehensive benchmarks, improving models’ performance in detectUnlike explicit hate speech, the interpretation of implicit ing subtle forms of hate speech. hate speech often requires inference and integration of background knowledge [ 12, 13 ], particularly since hate 2.2. Knowledge-Enhanced Approaches expressions are usually socio-culturally dependent and rely on contextual knowledge [ 14 ]. These factors con- Knowledge-enhanced and Retrieval-Augmented Genertribute to the challenge of detecting implicit hate speech ation (RAG) methods [ 31 ] have emerged as a powerful and highlight the ongoing need for more sophisticated paradigm to address key limitations of LLMs. More redetection systems, as current state-of-the-art models still cently, this line of work has incorporated structured, struggle to eficiently handle this task [ 15 ]. Some studies graph-based knowledge, particularly KGs [ 8 ], to enhance have attempted to identify subtle hate speech by leverag- retrieval and reasoning capabilities. ing diferent approaches In the domain of hate speech research, knowledgeenhanced approaches have provided solutions to address the challenges posed by implicit hate speech across vari

1https://github.com/marcocuccarini/

StereoGraphUnveilingStereotypes ous tasks.

Zhao et al. [21] propose MetaTox, a RAG-based approach that integrates a meta-toxic knowledge graph In this work, we aim to perform the task of implicit with LLMs for hate speech detection. First, LLMs are used stereotype generation using LLMs, comparing a baseto construct the KG by combining data from three En- line approach with a KG-enhanced alternative. Given a glish datasets. Then Qwen and LLaMA3.1 are prompted sentence and its associated hate speech target, the model to classify tweets as toxic or non-toxic. The authors is prompted to generate the subtle stereotype that condemonstrate that the MetaTox method enables to re- tributes to the message’s hateful nature. In the followduce false positives, leading to better generalization and ing sections, we briefly present the proposed pipeline reduced hallucinations from LLMs. Lin [ 13 ] combines (Section 3.1), describe the dataset used (Section 3.2), and Entity Linking techniques with summarized Wikipedia outline the construction of the ontology that serves as descriptions to improve performances in implicit hate the foundation for the knowledge graph (Section 3.3). speech detection and classification task. Although it does not follow a standard RAG approach, the paper proposes 3.1. Pipeline Overview feeding a Multi-Layer Perceptron with embeddings of concatenated tweet and external knowledge representa- Our methodology is designed to make subtle stereotypes tions, training it to perform a multi-label classification conveyed in hateful content explicit. This is a particof implicit hate speech types. This approach demon- ularly challenging task, as it requires nuanced contexstrated significant improvements when entity triggers tual understanding and awareness of culturally specific were mentioned in text, although limitations remained stereotypes associated with the target. By integrating for the classification of tweets requiring pragmatic un- external knowledge, we investigate whether language derstanding. models can efectively contextualize such messages and

In the context of implicit hate speech, Yadav et al. generate more accurate and transparent stereotypes. [ 32 ] introduce Tox-BART, a BART-based architecture The proposed approach is illustrated in Figure 1. Given enhanced with toxicity attributes, i.e. structured meta- an input sentence and its associated HS target, retrieved information on tweets, encompassing target groups, in- from the annotated dataset, we use the target to query sult types, and hate intensity levels. This approach ad- the KG via a SPARQL query, retrieving all triples in which dresses limitations derived from poor quality of retrieved target is linked to its stereotypes. We then adopt a fewKG tuples, which can hinder KG-augmented approaches. shot learning approach, integrating into the prompt the Using diferent evaluation metrics, they demonstrate external knowledge retrieved from the KG in RDF format. that infusion of toxicity attributes achieves performance The evaluation phase consists of a comparison between comparable to simple KG-infusion. In the Italian con- the results (i.e. generated stereotypes) obtained using text, Di Bonaventura and colleagues [33] implemented a the knowledge-enhanced and the baseline approach. A knowledge-enhanced approach for detecting homotrans- hybrid evaluation was performed comparing automatic phobic hate speech. The system leverages the O-Dang metrics with human assessment. knowledge graph, which contains information about named entities in the Italian HT context. The approach 3.2. Dataset showed promising results, outperforming baseline scores.

Compared to the reviewed literature, our approach represents a step forward, particularly in the area of Italian language hate speech detection. While most prior work has focused on the detection of implicit hate speech, our study shifts the emphasis toward the explanatory capabilities of LLMs, specifically investigating how these can be enhanced through the integration of structured knowledge. Furthermore, by focusing on stereotypes and adopting and hybrid evaluation approach (automatic and human-based), our work also provides valuable insights into the ability of LLMs to uncover sound and coherent stereotypes from implicit language, as well as into the reliability of the evaluation metrics used.

To address the task of subtle stereotype generation, we

leveraged the Open Stereotype Corpus2 [34] containing 3,578 Italian tweets collected between October 2018 and June 2019 from the Contro l’Odio dataset [35]. The dataset was annotated by five diferent annotators. For each message, the annotators identified the specific chunk (trigger) containing the hate content, the implicit stereotype (if present) and the stereotype cluster (a more general class aiming at creating a stereotype categorization). In the original dataset the authors automatically distinguished between agent and patient parsing each rationale, we chose to simplify this distinction aggregating the two columns under a unique class named "target". An example of the dataset structure along with a subset of annotations is presented in Figure 3. From the dataset

3. Methodology 3.3. Ontology Design

two subclasses: Group and Person. These subclasses represent diferent types of targets and are connected to specific situations via the hasTarget relation, which links a message to its corresponding target. The class Type is designed to provide a taxonomy for both targets (e.g., racial target, religious target) and stereotypes (e.g., ‘are dangerous’, ‘are unclean’). The ontology was subsequently populated using SPARQLAnything4 [37] leveraging the datasets described in the previous section as data source. After this process we obtained a knowledge graph containing triples as to the followings: ster:_803176483174780929 rdf:type dul:Situation ; rdfs:label "Forza ragazzi, 180mila clandestini all anno, rom da tutte le parti, illegalita totale, Coop rosse e bianche che lucrano. ora sapete cosa votare" ; dul:hasTarget ster:immigrati ; ster:hasStereoManifestation ster:180mila-clandestini

-allanno ; ster:hasStereotype ster:invadendo-italia .

For the ontology design process we adopted a fully manual approach to ensure the quality of the resulting re- ster:invadendo-italia source through several means: aligning it with foun- rdf:type ster:Stereotype ; dational ontologies and related semantic resources, en- srtdefrs::hlaasbTeylpe ""SionnvoaIdnevnadsooirtia"l.ia" ; suring the conceptual correctness of the defined classes, and minimizing the potential introduction of bias. The ster:immigrati rdf:type foaf:Group . ontology includes four top-level classes: Situation, Stereotype, Agent, and Type. The class Situation is aligned with the homonymous class from the foundational ontology DOLCE [36]. Its purpose is to link a given target and its associated stereotype to a specific occurrence, such as a Twitter post, in order to avoid the introduction of bias or overly generic statements about stereotypes. The class Stereotype captures the implicit assertions conveyed in a given sentence. The class Agent, aligned with the FOAF (Friend of a Friend) ontology3, has

This means that a specific post, identified by the

ID ster:_803176483174780929, is an instance of the class Situation. It has a specific content, expressed trough the relation rdfs:label, and it is associated to a specific stereotype chunk trough the relation ster:hasStereoManifestation. The tweet is then associated with a particular target, ster:immigrati, as well as a stereotype, ster:invadendoitalia. The stereotype is then defined as an instance of the class

3http://xmlns.com/foaf/spec/ 4https://sparql-anything.cc/

SELECT ?s ?stereotype

WHERE {{ ?s a dul:Situation ; dul:hasTarget <{target_uri}> ; ster:hasStereotype ?stereotype .

4.3. Evaluation

ster:Stereotype and linked to a specific cluster their ability to understand the subtle stereotype embedSonoInvasori through the relation ster:hasType. ded in the message. We selected these three distinct LLMs because they are state-of-the-art, multilingual, open-source models with comparable architecture and 4. Experiment Setting medium scale size.

The task is conducted in the Italian language. For the In the next sections, the experimental setting is presented. baseline, we used a few-shot learning approach and for The following approach consists of three main steps: the prompt construction we adopt a vanilla structure Knowledge retrieval, where relevant information is re- setup; the prompt is written in Italian. Additionally, it trieved from the KG (Section 4.1); Prompting, where three includes instructions on how to structure the output senmodels are prompted using both a few-shot baseline and tence, explicitly asking the models to generate output in a few-shot KG-enriched approach 4.2; and Evaluation (4.3), where the results are assessed using both automatic tkhneofworlemdagte[-esnuhbajneccetd]ap[parroea/chdoin]co[rpproerdaitecsaatep]ro. mThpet metrics and manual evaluation. containing information about the target entity from the KG. For each target, we associate the relevant retrieved 4.1. Knowledge Extraction stereotypes. The full prompt is presented in Appendix For every sentence of the test set we extracted relevant A. The output produced by the LLM was preprocessed knowledge from the Knowledge graph leveraging the before the evaluation, removing generic elements profollowing SPARQL query: vided by the LLM, such as the usual formulaic closing statements (e.g., asking if it can assist further). models

and explore

For the evaluation phase, we leverage BLEU [41],

BERTScore [42] and ROUGE [43]. BLEU measures how }} many n-grams in the generated text appear in the reference text, focusing on precision and penalizing very short outputs. ROUGE focuses on recall, checking how much

Using this query we were able to retrieve all the stereo- of the n-grams or sequences of the reference text aptype associated with a certain tweet that has the specified pear in the generated text, often used for summarization. target. For example using "immigrati" as target we are BERTScore compares the generated and reference texts able to extract triples like the followings, in which the using deep contextual embeddings from BERT, capturing ifrst element is the ID, the second the gold stereotype semantic similarity beyond exact word matches. and the third hateful span: Since recent studies [44, 45] have highlighted the limister:_id sono-irregolari clandestini-musulmani tations of automated evaluation methods and some scholster:_id non-rispettano-legge nn-amano-subire-le ars [46, 47] are beginning to emphasize the potential -nostre-leggi-sti-migranti of hybrid approaches and aware of the fact that stereoster:_id spacciano immigrati-spacciatori-e- types are characterized by high subjectivity and culturestupratori specific variation, we conducted also a human-based evalSince our goal is to prove that this integrated informa- uation to better understand the reliability of the metrics tion could improve implicit stereotype generation, we used in such contexts. To this end, we designed a twofold rely on the gold-standard targets provided in the dataset. manual assessment.

This avoids the noise introduced by potential errors in First, aiming at replicating the automatic evaluation, target prediction. One limitation encountered is the over- we compared the generated (baseline and KG-enhanced) representation of certain targets, which appear with a outputs with the gold standard, assessing semantic and high number of samples. To reduce the impact of the “lost syntactic similarity on a scale from 1 to 5. The annotain the middle” phenomenon [38] and to balance the quan- tors were presented with the following questions: (Q1) tity of information, we randomly sample 20 stereotypes To what extent, on a scale from 1 to 5, is the baseline per target. output semantically and syntactically comparable to the gold standard? (Q2) To what extent, on a scale from 1 to 5, is the KG-enhanced output semantically and syntacti4.2. Prompt Construction cally comparable to the gold standard? Then, aware that to stereotypes are not universally shared, and that some are more prevalent or culturally dependent than others [48], we aimed to evaluate the extent to which the generated

We decided test three diferent

LLaMA-3.1-8B, gemma-2-9b-it [39] Mistral-7B-Instruct-v0.2 [40] to

Gold I rom sono trufatori

Baseline I rom sono falsi invalidi per commuovere.

KG rom_sinti sono ladri i migranti non sono profughi gli avvocati pagano i migranti.

gli immigrati sono criminali i migranti sono criminali gli immigrati sono violenti gli immigrati delinquono stereotype might be culturally recognizable from our own perspective as white Italian researchers aged between 25 and 30. The evaluation of generated stereotypes was conducted only on content produced by the baseline model, as the KG-enhanced method provides the model with additional contextually relevant information. Annotators were asked to assess whether, in their own perspective, the generated stereotype reflects commonly held beliefs or societal biases (Q3). For example, the stereotype "gli avvocati pagano i migranti" ("Lawyers pay the migrants") was judged unrealistic by all three annotators. In contrast, "gli immigrati delinquono" ("Immigrants commit crimes") received two positive evaluations out of three, suggesting that this stereotype may reflect a commonly held bias in the Italian context. The human evaluation was conducted by three annotators on a subset of 50 sentences. An example of the conducted manual evaluation is presented in Table 1.

LLaMA3.1, gemma2, and Mistral7B, across BLEU, Rouge,

and BERT-based scores. Gemma2 benefits the most, with its BLEU score more than doubling and a big gain in Rouge. LLaMA3.1 and Mistral7B also show consistent, though smaller, improvements. The BERT-based scores 5. Results indicate better semantic relevance with KG. Overall, the KG helps the models produce more accurate and meanIn the next sections the experiment results are provided. ingful results.

While automated methods are eficient, they often lack precision. In contrast, human evaluation ofers greater 5.2. Human-based Analysis contextual understanding but is time-consuming and costly. To balance accuracy and eficiency, we applied an automatic method to the full dataset and selected a smaller subset for manual evaluation.

The annotators were provided with answers from both

the baseline and the KG-enhanced method. Each answer was evaluated on the basis of its similarity to the gold standard, the normalized results are presented in Table 3. Furthermore, for the baseline generation only, annota5.1. Computer-Based Analysis tors were asked to assess whether the stereotypes reflect In the Table 2 are presented the result of the genera- commonly held beliefs or communal biases. LLaMA 3.1 tion task comparing the three models across the two the highest average scores for both baseline and KGapproaches, i.e. baseline versus knowledge graph en- enhanced outputs, demonstrating strong overall perforhanced. The Results shows that adding the information mances. Gemma 2 shows lower results across all metrics, from KG improves the performance of all three models, while Mistral7B performs the lowest on both baseline and KG averages. Human evaluation further confirms 5.4. Error Analysis that incorporating knowledge from the graph improves model performance across all models and annotators. In To gain deeper insight into the functionality and limitaaddition, the variation in annotators’ scores highlights tions of our approach, and to identify areas for potential the subjective nature of the task and the challenge of future improvements, we conducted an error analysis achieving consistent judgments. Annotator 2, for exam- on the tweets where the KG-enhanced method showed ple, generally rates outputs higher, particularly for KG- the lowest performance. Overall we observed that errors enhanced responses, while Annotator 3 is more critical. frequently occurred when the input contained named Human-evaluated results confirm the trends observed in entities or subjects that difered from the primary target. computer-based scores (for all the models and the annota- For example, in the tweet: tors the score are higher in the case of the KG-enhanced Finanzia l’invasione degli immigrati: ecco approach), demonstrating how our method improves the la prova. La vergogna di George Soros, model’s ability to explicitly address implicit hate speech "Epnagdlirsohn:"eH" edf’uItnadlisat.he immigrant invasion: here is and suggesting that automatic measures can be informa- the proof. The shame of George Soros, the ’master’ tive for this type of task. of Italy."

Regarding the assessment of the generated stereotype the human evaluation reveals divergence tendency: the KG-enhanced output was: "George Soros finanzia LLaMA shows the average highest scores across the three l’invasione degli immigrati" (English: "George Soros annotators, and the value appears to be high especially funds the immigrant invasion"), while not conceptually according to Annotators 1 and 3. Gemma2 shows a simi- incorrect, this difers from the gold standard:"i migranti lar tendency, especially regarding the annotators 2 and vogliono invadere l’Italia" (English : "The migrants want 3. Finally, Mistral tends to have an overall lower score to invade Italy."). A similar issue occurred in the tweet: about the stereotypes soundness, suggesting that it may produce less biased or not realistic content.

5.3. Human-based vs Computer-based metric To better understand the relationship between automatic

metrics and human judgment, we compared the results of BLEU, ROUGE and BERT Score with human evaluation over a sample of 50 sentences, as seen in Figure 3. The three plots help identify which metric aligns more closely with human evaluation.

From the plots, it is evident that the BERT Score metric (shown in the third plot) correlates more consistently with the annotators’ evaluation, suggesting it is a more reliable indicator of quality for this task. This is due to the nature of BERT score, which leverages contextual embeddings to measure similarity on a semantic level. Conversely, BLEU and ROUGE metrics (depicted in the ifrst and second plots, respectively), which operate more on the lexical-syntactic level, show more variability and several limitations in accurately matching human judgment.

Understanding the relationship between automatic and manual assessment is crucial for contextualizing the values obtained from each metric and evaluating model performance in a meaningful way. The comparison also helps to understand which metrics are more robust and reliable, especially for tasks requiring deep contextual and pragmatic understanding.

Che senso ha ministro Trenta rispettare chi

non rispetta noi? Che senso ha difendere la loro cultura o presunta cultura quando essi disprezzano la nostra? La ministra Trenta contro Salvini: sbagliato dire che l’Islam è terrorismo English: What’s the point, minister Trenta, of respecting thos who don’t respect us? What’s the point of defending their culture or so-called culture when they despise ours? Minister Trenta against Salvini: it’s wrong to say that Islam is terrorism" The KG-enhanced output was "la ministra Trenta disprezza la cultura italiana." (English: Minister Trenta despises Italian culture.) whereas the gold standard was: "i musulmani vanno contro i valori dell’Occidente" (English: Muslims go against Western values). In other cases, when the model encounters a target associated with a high number of stereotypes, it tends to concatenate many of them into a generic and incoherent output.

In some cases, both the baseline and the KG-enhanced approaches struggle to recognize irony and fail to produce a reliable underlying stereotype. For example, consider the following sentence: #Dimartedi Stasera indottrinamento pro Europa. Alla bisogna sono benvenuti anche gli stranieri. Bravo #Floris, vai a cager English: #dimartedi tonight: pro-Europe indoctrination. If needed, even foreigners are welcome. Well done #Floris, go to hell.

Both the baseline and the KG-enhanced approaches generate the "gli stranieri sono benvenuti" (English: Immigrants are welcome), failing to detect the subtle irony in the original message. (a) BLEU scores compared to all annotators (b) ROUGE-L scores compared to all annotators (c) BERTScore compared to all annotators

Finally, we observed challenges in tweets with com- identifying abusive language. Specifically, we explore plex hypotactic structures and multiple subjects. In such the role that additional information from a knowledge cases, models often fail to correctly identify the primary graph may play in the understanding and generation of target and to produce relevant output. Furthermore, the underlying stereotypes. We compare a baseline few-shot KG-enhanced method tends to generate overly long re- approach with a knowledge-enhanced method, leveragsponses in these situations, which can reduce the coher- ing diferent LLMs. We observed that prompts enhanced ence and precision of the generated content. In summary, with additional information outperformed the baseline the worst-performing examples often occur because the approach. To better assess the reliability of the automatic model misidentifies the target of the hate tweet, lead- evaluation metrics, we also conducted a manual evaluaing to reduced accuracy. However, in many cases, the tion, replicating the task performed by the automatic metmodel still manages to extract a correct implicit message, rics. The human evaluation confirmed the results, showwhich, while diferent from the gold standard, is present ing higher scores for the knowledge graph-enhanced in the tweet. In such cases, the prediction is valid, but approach. While the manual assessment was aligned the reference annotation fails to recognize it as correct. with the automated results, we observed a high degree of variability in the scores. This suggests that evaluating such generated content is inherently subjective and 6. Conclusion can vary based on the annotators’ culture, age, or beliefs. These findings highlight the importance of contextualIn this work, we aim to investigate whether large lan- izing evaluation metrics and recognizing that they may guage models are able to uncover implicit stereotypes carry biases or oversimplify complex phenomena. From embedded in hate speech messages. This task is impor- the error analysis, we observed that the KG-enhanced tant as it helps uncover the subtle content of hate speech approach occasionally struggles to manage the quantity messages and supports hate speech detection models in of information provided, suggesting that further studies are needed to better understand the extent to which such models can efectively integrate additional knowledge.

To sum up, the findings of this research suggest that knowledge graph-based approaches are highly promising, even in the hate speech domain, where they remain largely underexplored.

7. Limitation and Future Work

In this work we focused on the integration of stereotypes, retrieving targets from the gold standard. This allows us to concentrate the analysis on the knowledge insertion process within the LLM, minimizing the introduction of noise. As future work, we intend to test the approach using a state-of-the-art target detection model. Although this may introduce errors due to target misclassifications, it would enable full autonomy for the proposed method and enhance its applicability in real-world scenarios. Target detection methods can also return multiple potential targets in cases of uncertainty, providing a fuller stereotype context for posts that may involve more than one target. While we noticed that diferent stereotypes are associated to the same target, as a future work we may consider an approach based on semantic similarity to select the most contextually relevant stereotypes. This approach could ofer a more focused context for the prompt and reduce the likelihood of model misunderstandings. During the error analysis phase, we identified errors potentially caused by the ‘lost-in-the-middle’ phenomenon. Future work should explore in greater depth how models manage diferent quantities of input information. Finally, it is important to highlight that the manual evaluation we conducted—particularly regarding the cultural shareability of the generated stereotypes, is inherently biased and reflects the perspectives of the researchers involved in this study. As future work, it would be interesting to carry out a large-scale, prospectivist survey to explore the diversity of opinions on stereotypes and to investigate the dominant worldview conveyed by diferent large language models.

Ethical Considerations

We acknowledge that when dealing with hate speech, particularly stereotypes targeting minorities, it is essential to be mindful of the potential of introducing bias or unintentionally amplifying hateful content. We made eforts to control and reduce the presence of bias and to remain aware of its potential introduction. During the experimental phase, we prompted LLMs to generate implied stereotypes, which in some cases resulted in the generation of hateful or ofensive content. The generated hateful content is intended solely to remain within the context of this experimental research. Its occurrence also provides additional insights into how LLMs can produce harmful language despite safety filters. lation, 2002, pp. 311–318. [42] T. Zhang*, V. Kishore*, F. Wu*, K. Q. Weinberger,

Y. Artzi, Bertscore: Evaluating text generation with bert, in: International Conference on Learning Representations, 2020. URL: https://openreview.net/ forum?id=SkeHuCVFDr. [43] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL: https://aclanthology.org/W04-1013/. [44] B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong,

Y. Zhang, S. Tang, Graph retrieval-augmented generation: A survey, arXiv preprint arXiv:2408.08921 (2024). [45] H. Yu, A. Gan, K. Zhang, S. Tong, Q. Liu, Z. Liu,

Evaluation of retrieval-augmented generation: A survey, in: CCF Conference on Big Data, Springer, 2024, pp. 102–120. [46] S. Tsaneva, D. Dessì, F. Osborne, M. Sabou, Knowledge graph validation by integrating llms and human-in-the-loop, Information Processing & Management 62 (2025) 104145. [47] G. Faggioli, L. Dietz, C. L. Clarke, G. Demartini,

M. Hagen, C. Hauf, N. Kando, E. Kanoulas, M. Potthast, B. Stein, et al., Perspectives on large language models for relevance judgment, in: Proceedings of the 2023 ACM SIGIR International Conference on

Theory of Information Retrieval, 2023, pp. 39–50. [48] S. T. Fiske, Prejudices in cultural contexts: Shared stereotypes (gender, age) versus variable stereotypes (race, ethnicity, religion), Perspectives on psychological science 12 (2017) 791–799. (baseline: context = ∅, KG-method: context = [[target, hasStereotype, stereotype],. . . ]) // Examples: post = "a roma nuovi campi nomadi e gli italiani abbandonano le abitazioni. e noi abbiamo gente come la boldrini che difende tutto questo." implicit_message = i rom sono pericolosi post = "Papampero a San Vittore e festa del Pd? Te credo! I detenuti votano tutti a sinistra, pure rom e islamici." implicit_message = la sinistra protegge i criminali stranieri // Task: Extract from the following post Declaration on Generative AI

[1]

Poletto ,

Basile ,

Sanguinetti ,

Bosco ,

Patti , Resources and benchmark corpora for hate speech detection: a systematic review , Language Resources and Evaluation 55 ( 2021 ) 477 - 523 .

[2]

J. S.

Malik ,

Qiao ,

Pang , A. van den Hengel, Deep learning for hate speech detection: a comparative study , International Journal of Data Science and Analytics ( 2024 ) 1 - 16 .

[3]

Nozza ,

Bianchi , G. Attanasio, Hate-ita: Hate speech detection in italian social media text , in: Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH) , 2022 , pp. 252 - 260 .

[4]

N. B.

Ocampo ,

Sviridova ,

Cabrio , S. Villata, An in-depth analysis of implicit and subtle hate speech messages , in: A. Vlachos , I. Augenstein (Eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , Association for Computational Linguistics, Dubrovnik, Croatia, 2023 , pp. 1997 - 2013 . URL: https://aclanthology.org/ 2023 .eacl-main. 147 /. doi: 10 .18653/v1/ 2023 .eacl-main. 147 .

[5]

Mun ,

Allaway ,

Yerukola ,

Vianna ,

S.-J.

Leslie ,

Sap , Beyond denouncing hate: Strategies for countering implied biases and stereotypes in language , in: H. Bouamor , J. Pino , K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , Association for Computational Linguistics , Singapore, 2023 , pp. 9759 - 9777 . URL: https://aclanthology. org/ 2023 .findings-emnlp. 653 /. doi: 10 .18653/v1/ 2023 .findings-emnlp. 653 .

[6]

Zhang ,

Nanduri ,

Jiang , T. Wu, M. Sap, BiasX: “ thinking slow” in toxic content moderation with explanations of implied social biases , in: H. Bouamor , J. Pino , K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Singapore, 2023 , pp. 4920 - 4932 . URL: https: //aclanthology.org/ 2023 .emnlp-main. 300 /. doi: 10 . 18653/v1/ 2023 .emnlp-main. 300 .

[7]

Bombieri ,

Fiorini ,

S. P.

Ponzetto ,

Rospocher , Do llms dream of ontologies? , ACM Trans. Intell. Syst. Technol . ( 2025 ). URL: https://doi.org/10.1145/ 3725852. doi: 10 .1145/3725852.

[8]

Edge ,

Trinh , N. Cheng, J. Bradley , A.

Chao , A.

Mody , S.

Truitt , D.

Metropolitansky , R. O.

Ness , J.

Larson , From local to global: A graph rag approach to query-focused summarization , arXiv [17]

M. S.

Jahan ,

Oussalah ,

D. R.

Beddia , N. Arhab, preprint arXiv: 2404 .16130 ( 2024 ). et al., A comprehensive study on nlp data augmen-

[9]

Xu ,

M. J.

Cruz ,

Guevara ,

Wang , M. Desh- tation for hate speech detection: Legacy methods , pande,

Wang ,

Li , Retrieval-augmented gener- bert, and llms , arXiv preprint arXiv:2404 . 00303 ation with knowledge graphs for customer service ( 2024 ). question answering , in: Proceedings of the 47th [18]

Zhang , J. He,

Ji , C.-T. Lu, Don't go to extremes: International ACM SIGIR Conference on Research Revealing the excessive sensitivity and calibration and Development in Information Retrieval , 2024 , limitations of LLMs in implicit hate speech detecpp . 2905 - 2909 . tion, in: L. -W. Ku , A. Martins , V. Srikumar (Eds.),

[10]

Vu ,

Iyyer ,

Wang ,

Constant ,

Wei , Proceedings of the 62nd Annual Meeting of the AsJ . Wei,

Tar ,

Y.-H.

Sung ,

Zhou ,

Le , T. Luong, sociation for Computational Linguistics (Volume 1: FreshLLMs: Refreshing large language models with Long Papers), Association for Computational Linsearch engine augmentation , in: L. -W. Ku , A . Mar- guistics, Bangkok, Thailand, 2024 , pp. 12073 - 12086 . tins, V. Srikumar (Eds.), Findings of the Association URL: https://aclanthology.org/ 2024 . acl-long . 652 /. for Computational Linguistics: ACL 2024 , Asso- doi:10.18653/v1/ 2024 . acl-long.652. ciation for Computational Linguistics , Bangkok, [19]

Ghosh ,

Suri ,

Chiniya ,

Tyagi ,

Kumar , Thailand, 2024 , pp. 13697 - 13720 . URL: https: D. Manocha, Cosyn: Detecting implicit hate speech //aclanthology.org/ 2024 .findings-acl. 813 /. doi: 10. in online conversations using a context synergized 18653/v1/2024.findings-acl.813. hyperbolic network , in: Proceedings of the 2023

[11]

Gao ,

Xiong ,

Gao ,

Jia ,

Pan ,

Bi , Y. Dai, Conference on Empirical Methods in Natural LanJ . Sun,

Wang ,

Wang , Retrieval-augmented guage Processing , 2023 , pp. 6159 - 6173 . generation for large language models: A survey , [20]

Ahn ,

Kim ,

Kim , Y.-S. Han, SharedCon: arXiv preprint arXiv:2312.10997 2 ( 2023 ). Implicit hate speech detection using shared se-

[12]

Dadvar ,

Trieschnigg ,

Ordelman , F. De Jong, mantics, in: L. -W. Ku , A.

Martins , V.

Srikumar Improving cyberbullying detection with user con- (Eds.), Findings of the Association for Computatext , in: European conference on information re- tional Linguistics: ACL 2024 , Association for Comtrieval, Springer, 2013 , pp. 693 - 696 . putational Linguistics, Bangkok, Thailand, 2024 ,

[13]

Lin , Leveraging world knowledge in implicit pp. 10444 - 10455 . URL: https://aclanthology.org/ hate speech detection, in: L. Biester , D. Dem- 2024.findings-acl. 622 /. doi: 10 .18653/v1/ 2024 . szky,

Jin ,

Sachan ,

Tetreault , S. Wilson, findings-acl.622. L. Xiao , J . Zhao (Eds.), Proceedings of the Second [21]

Zhao ,

Zhu ,

Xu ,

Li , Enhancing llmWorkshop on NLP for Positive Impact (NLP4PI), based hatred and toxicity detection with meta-toxic Association for Computational Linguistics, Abu knowledge graph , 2024 . URL: https://arxiv.org/abs/ Dhabi, United Arab Emirates (Hybrid), 2022 , pp. 31 - 2412 .15268. arXiv: 2412 . 15268 . 39. URL: https://aclanthology.org/ 2022 .nlp4pi- 1 .4/. [22]

Sap , S. Gabriel, L. Qin,

Jurafsky ,

N. A.

Smith , doi:10.18653/v1/ 2022 .nlp4pi- 1 .4.

Choi , Social bias frames: Reasoning about social

[14]

Lee ,

Jung ,

Myung ,

Jin , J. Camacho- and power implications of language , in: D. Jurafsky, Collados,

Kim ,

Oh , Exploring cross-cultural dif- J.

Chai , N.

Schluter , J. Tetreault (Eds.), Proceedings ferences in english hate speech annotations: From of the 58th Annual Meeting of the Association for dataset construction to analysis, 2024 . URL: https: Computational Linguistics, Association for Com//arxiv.org/abs/2308.16705. arXiv: 2308 .16705. putational Linguistics, Online, 2020 , pp. 5477 - 5490 .

[15]

Albladi ,

Islam , A. Das , M.

Bigonah , Z.

Zhang , URL: https://aclanthology.org/ 2020 .acl-main. 486 /. F. Jamshidi,

Rahgouy , N. Raychawdhary, doi:10.18653/v1/ 2020 .acl-main.486. D. Marghitu , C.

Seals , Hate speech detection using [23] J.

Kim , B.

Lee , K.-A.

Sohn , Why is it hate speech? large language models: A comprehensive review, masked rationale prediction for explainable hate IEEE Access ( 2025 ). speech detection , in: N. Calzolari , C. -R. Huang,

[16] M. ElSherief , C.

Ziems , D.

Muchlinski , V.

Anupindi , H.

Kim , J.

Pustejovsky , L.

Wanner , K.-S. Choi, P.- J.

Seybolt , M. De Choudhury , D. Yang , Latent ha- M. Ryu , H. -H. Chen , L.

Donatelli , H.

Ji , S.

Kurotred: A benchmark for understanding implicit hate hashi , P. Paggio,

Xue ,

Kim ,

Hahm ,

He , speech, in: Proceedings of the 2021 Conference T. K. Lee , E.

Santus , F.

Bond , S.-H. Na (Eds.), Proon Empirical Methods in Natural Language Pro- ceedings of the 29th International Conference on cessing, Association for Computational Linguis- Computational Linguistics , International Committics, Online and Punta Cana , Dominican Republic, tee on Computational Linguistics, Gyeongju, Re2021, pp. 345 - 363 . URL: https://aclanthology.org/ public of Korea, 2022 , pp. 6644 - 6655 . URL: https: 2021 .emnlp-main. 29 . //aclanthology.org/ 2022 .coling- 1 .577/.

[24]

Huang ,

Kwak ,

An , Chain of explanation: T. Chakraborty, Tox-BART: Leveraging toxicNew prompting method to generate quality natural ity attributes for explanation generation of imlanguage explanation for implicit hate speech, in: plicit hate speech , in: L. -W. Ku , A. Martins , Companion Proceedings of the ACM Web Confer- V. Srikumar (Eds.), Findings of the Association ence 2023 , WWW '23, ACM , 2023 , p. 90 - 93 . URL: for Computational Linguistics: ACL 2024 , Assohttp://dx.doi.org/10.1145/3543873.3587320. doi:10. ciation for Computational Linguistics, Bangkok, 1145 /3543873.3587320. Thailand , 2024 , pp. 13967 - 13983 . URL: https:

[25]

Yang ,

Kim ,

Ho ,

Thorne , S.-Y. //aclanthology.org/ 2024 .findings-acl. 831 /. doi: 10. Yun , HARE: Explainable hate speech detection 18653 /v1/ 2024 . findings-acl.831. with step-by-step reasoning , in: H. Bouamor, [33]

Di Bonaventura ,

Muti ,

M. A.

Stranisci , O-dang

Pino , K. Bali (Eds.), Findings of the Association at hodi and haspeede3: A knowledge-enhanced apfor Computational Linguistics: EMNLP 2023, Asso- proach to homotransphobia and hate speech detecciation for Computational Linguistics, Singapore, tion in italian , in: CEUR Workshop Proceedings , 2023 , pp. 5490 - 5505 . URL: https://aclanthology. volume 3473 , CEUR-WS , 2023 . org/ 2023 .findings-emnlp. 365 /. doi: 10 .18653/v1/ [34]

S. M.

Lo ,

M. A.

Stranisci ,

A. T.

Cignarella ,

Frenda , 2023 .findings-emnlp.365. V. Basile , C.

Bosco , E.

Jezek , V.

Patti , Subjectivity

[26]

Tonini ,

Frenda ,

M. A.

Stranisci ,

Patti , How do in stereotypes against migrants in italian: An exwe counter dangerous speech in italy?, in: CEUR perimental annotation procedure , in: Proceedings Workshop Proceedings , volume 3878 , CEUR-WS, of the 11th Italian Conference on Computational 2024 , p. 103 . Linguistics ( CLiC-it 2025 ), CEUR Workshop Pro-

[27]

W. W.

Schmeisser-Nieto ,

Ricci , S. Frenda, ceedings, Cagliari, Italy, 2025 . M. Taulé , C.

Bosco , Implicit stereotypes: A corpus- [35] A.

Capozzi , M.

LAI , V.

Basile , F.

Poletto , M.

Sanbased study for italian , in: Proceedings of the 10th guinetti ,

Bosco ,

Patti ,

G. F.

RUFFO , C. Musto, Italian Conference on Computational Linguistics M. Polignano , et al., Computational linguistics (CLiC-it 2024 ), 2024 , pp. 997 - 1004 . against hate: Hate speech detection and visualiza-

[28]

Poletto ,

Stranisci ,

Sanguinetti , V. Patti, tion on social media in the" contro l'odio" project, C. Bosco , et al., Hate speech annotation: Anal- in: 6th Italian Conference on Computational Linysis of an italian twitter corpus , in: Ceur workshop guistics, CLiC-it 2019 , 2019 . proceedings, volume 2006 , CEUR- WS , 2017 , pp. 1 - 6 . [36]

Borgo ,

Ferrario ,

Gangemi , N. Guarino,

[29]

Cristina ,

Marinella ,

Benamara ,

C. P.

Gio- C. Masolo ,

Porello ,

E. M.

Sanfilippo , L. Vieu, vanni, P. Viviana,

Véronique ,

Mariona , et al., Dolce: A descriptive ontology for linguistic and Sterheotypes project. detecting and countering eth- cognitive engineering , Applied ontology 17 ( 2022 ) nic stereotypes emerging from italian, spanish and 45-69. french racial hoaxes , in: Proceedings of the Sem - [37]

Asprino ,

Daga ,

Gangemi , P. Mulholland, inar of the Spanish Society for Natural Language Knowledge graph construction with a façade: a uniProcessing: Projects and System Demonstrations ifed method to access heterogeneous data sources (SEPLN-CEDI-PD 2024 ), 2024 . on the web , ACM Transactions on Internet Tech-

[30]

Muti ,

Ruggeri ,

K. A.

Khatib ,

Barrón-Cedeño , nology 23 ( 2023 ) 1 - 31 . T. Caselli, Language is scary when over-analyzed: [38]

N. F.

Liu ,

Lin ,

Hewitt ,

Paranjape , M. BevilacUnpacking implied misogynistic reasoning with qua , F. Petroni,

Liang , Lost in the middle: How argumentation theory-driven prompts , in: Y. Al - language models use long contexts , Transactions Onaizan,

Bansal ,

Y.-N.

Chen (Eds.), Proceedings of the Association for Computational Linguistics of the 2024 Conference on Empirical Methods in 12 ( 2024 ) 157 - 173 . URL: https://aclanthology.org/ Natural Language Processing, Association for Com- 2024.tacl-1 .9/. doi: 10 .1162/tacl_a_00638. putational Linguistics, Miami, Florida, USA, 2024 , [39] Gemma

Team

, Gemma ( 2024 ). URL: pp. 21091 - 21107 . URL: https://aclanthology.org/ https://www.kaggle.com/m/3301. doi: 10 .34740/ 2024 .emnlp-main. 1174 /. doi: 10 .18653/v1/ 2024 . KAGGLE/M/3301. emnlp-main. 1174 . [40]

A. Q.

Jiang ,

Sablayrolles ,

Mensch , C. Bam-

[31]

Lewis ,

Perez ,

Piktus ,

Petroni , ford,

D. S.

Chaplot , D. de las Casas,

Bressand ,

Karpukhin ,

Goyal ,

Küttler ,

Lewis , W.-t. G. Lengyel,

Lample ,

Saulnier ,

L. R.

Lavaud , M. -

Yih , T.

Rocktäschel , et al., Retrieval-augmented

Lachaux ,

Stock ,

T. L.

Scao ,

Lavril , T. Wang, generation for knowledge-intensive nlp tasks , T. Lacroix,

W. E.

Sayed , Mistral 7b, 2023 . URL: https: Advances in neural information processing systems //arxiv.org/abs/2310 .06825. arXiv: 2310 . 06825 . 33 ( 2020 ) 9459 - 9474 . [41]

Papineni ,

Roukos ,

Ward , W. jing Zhu, Bleu: a

[32]

Yadav ,

Masud ,

Goyal ,

M. S.

Akhtar , method for automatic evaluation of machine trans-