Beyond Headlines: A Corpus of Femicides News Coverage in Italian Newspapers Eleonora Cappuccio1,2,3,*,† , Benedetta Muscato1,4,† , Laura Pollacci1,2 , Marta Marchiori Manerba1,2 , Clara Punzi1,4 , Chandana Sree Mala1,4 , Margherita Lalli4 , Gizem Gezici3 , Michela Natilli2 and Fosca Giannotti4 1 Università di Pisa, Pisa, Italy 2 ISTI-CNR, Pisa, Italy 3 Università degli Studi di Bari Aldo Moro, Bari 4 Scuola Normale Superiore, Pisa, Italy Abstract How newspapers cover news significantly impacts how facts are understood, perceived, and processed by the public. This is especially crucial when serious crimes are reported, e.g., in the case of femicides, where the description of the perpetrator and the victim builds a strong, often polarized opinion of this severe societal issue. This paper presents FMNews, a new dataset of articles reporting femicides extracted from Italian newspapers. Our core contribution aims to promote the development of a deeper framing and awareness of the phenomenon through an original resource available and accessible to the research community, facilitating further analyses on the topic. The paper also provides a preliminary study of the resulting collection through several example use cases and scenarios. Keywords Italian Dataset, Newspapers, Information Extraction, Information Retrieval, AI for Social Good, Femicides 1. Introduction of women by males due to their gender. Successively, the term femicide, translated in Castillian as femicidio or fem- How newspapers and journalists present news plays a inicide by the anthropologist Marcela Lagarde to attract crucial role in shaping public understanding and percep- political attention on the dire situation faced by women tion of information. This is especially important when in Mexico [3], has gained global traction with varying reporting serious crimes, such as femicides, where de- interpretations, yet consistently denotes a patriarchal im- scriptions of the perpetrator and victim can create po- petus behind homicides and other forms of male violence larized opinions influencing readers’ perceptions and against women, primarily emphasising the sociological interpretations of the event. According to Bouzerdan dimensions of abuse and the socio-political ramifications and Whitten-Woodring [1], news media often report inci- of the phenomenon. In the Italian language, the term dents of women’s homicides in a sensationalised manner, femminicidio has been almost exclusively adopted, as treating these crimes as isolated events rather than situat- evidenced by a Google Trends analysis comparing the ing them within the bigger framework of violence against search terms "femicidio" and "femminicidio" to queries women. This narrative defies the global demands of hu- regarding "femicide"1 . man rights organisations to acknowledge and address this An analysis of the phenomenon of femicide in the Ital- phenomenon as demanded by its intricate dynamics. Nu- ian context and, in particular, a linguistic investigation merous countries have followed such recommendations of it, are particularly relevant. Feminicide, a term used only partially through the formal adoption of specific ter- by the feminist movement in Italy since 2005, gained minology such as femicide and feminicide in legal frame- prominence in the media in 2011, especially thanks to works and public discourse. The two terms have related the works of Barbara Spinelli [4]. The CEDAW Com- but distinct nuances of meaning. Femicide, a criminolog- mittee2 , based on data from the Shadow Report on the ical concept initially coined in English by the feminist Implementation of CEDAW in Italy, addressed recom- criminologist Diana H. Russell [2], denotes the murder mendations to the Italian government on feminicide in its Concluding Observations. This was the first time the CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, committee addressed a European state on feminicide, a Dec 04 — 06, 2024, Pisa, Italy category previously reserved for warnings to Central * Corresponding author. † These authors contributed equally. 1 The conducted analysis included news web searches in Italy since $ eleonora.cappuccio@phd.unipi.it (E. Cappuccio); 2022, i.e., since when the service implemented an enhanced data benedetta.muscato@sns.it (B. Muscato) collection methodology. © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 2 Attribution 4.0 International (CC BY 4.0). Committee on the Elimination of Discrimination Against Women. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Cappuccio, Muscato et al. American countries. The challenges in accurately contex- national4 and local5 level, with local editions span- tualising feminicide in Italy also stem from a prolonged ning across the whole Italian territory. absence of official data, resulting in sensationalism and • Political, which was granted by choosing na- the perception of a dramatic rise in the crime. This may tional newspaper with varying political leanings. induce an emergency narrative that obscures the inher- • Temporal, where the time frame of national ent structural dimensions of the phenomenon, thereby newspapers extends from November 2009 to undermining the very essence of the term [5]. Media February 2024, whilst that of the local ones ranges interpretations are essential for shaping a shared under- from November 2010 to February 20246 . standing across a vast audience, such as a whole country; hence, the examination of media discourse emerges as a significant analytical instrument on top of statistical 2. Related Work evaluation of femicide data to understand the achieve- ments and directions of state intervention towards the According to frame analysis, the ways in which newspa- substantial granting of women’s right to life [6]. pers cover news significantly impact how facts are un- In this regard, Aldrete and Fernández-Ardèvol [7] derstood, perceived, and processed by the public [10, 11]. showed that there is a large body of empirical studies Framing narratives means strategically including or omit- on femicide discourse across different socio-cultural con- ting elements (such as problem definitions, explanations texts, which often justify the perpetrator’s actions. Given and evaluations) of a given situation in a communica- the complexity of the phenomenon, a comprehensive tive text [12, 13, 14]. This process aims to advocate for investigation could be achieved by integrating media specific interpretations, assess moral responsibilities of analysis with external data, such as demographics and individuals involved and propose solutions while also current events, blending together researchers from dif- eliciting nuanced emotional responses from the audi- ferent fields like computer science, social sciences, and ence, thereby affecting their perceptions and attitudes. It complex systems science. The lack of accessible and is worth noting that in the case of news articles, media relevant data specific to socio-culturally context where framing can be seen as a demonstration of political power femicide is notably prevalent, such as in Italy, makes the [10], influencing which actors or interests are involved task particularly challenging [8]. shape narratives, often unnoticed by the audience [11]. This paper presents FMNews, a new dataset of articles The process of news framing becomes especially cru- reporting femicides extracted from Italian newspapers3 . cial when reporting serious crimes, such as femicides, as We conduct a preliminary analysis of the resulting col- understanding femicide requires analyzing its evolution lection through several example use cases and scenarios. from both statistical and social perspectives, as discussed The primary contribution is to deepen understanding and in the Manifesto delle Giornaliste e dei Giornalisti per il awareness of femicide from a socio-technical perspective. Rispetto e la Parita’ di Genere nell’Informazione7 (Man- We seek to examine how prominent Italian news sources ifesto of Journalists for Respect and Gender Equality in report on the issue in connection to the shaping of public News Reporting, our translation). perception, while also offering an innovative and acces- The acknowledged impact of language on how read- sible resource to facilitate future investigation within ers perceive information has prompted researchers to the research community. Furthermore, this study was explore how the language surrounding femicide has designed to enable a multifaceted investigation covering changed and how this influences individuals’ respon- the following three dimensions: sibility perception [15], which can vary based on the way femicides are reported [1, 16, 9, 17]. Moreover, an initia- • Geographical, with the aim to explore poten- tial variations in framing between local and na- 4 The selected national newspapers are the following: Corriere della tional media outlets. Indeed, previous research Sera, La Repubblica, La Stampa, Il Fatto Quotidiano, Il Giornale and has shown that Italian local daily newspaper of- 5 Il Post. The selected local newspapers are the local editions of the CityNews ten suppress the agency of the perpetrator, por- group, which cover the following cities: Agrigento, Ancona, Arezzo, traying the events as mere occurrences [9]. We Avellino, Bari, Bologna, Brescia, Brindisi, Caserta, Catania, Cesena, selecting newspapers reporting news at both the Chieti, Como, Ferrara, Firenze, Foggia, Forlì, Frosinone, Genova, Pescara, Piacenza, Latina, Lecce, Lecco, Livorno, Messina, Milano, Modena, Monza, Napoli, Novara, Padova, Palermo, Parma, Perugia, Pisa, Pordenone, Ravenna, Reggio, Rimini, Roma, Salerno, Sondrio, Terni, Torino, Trento, Treviso, Trieste, Udine, Venezia, Verona, Vicenza, Viterbo. 3 6 The choice of newspapers was dictated by the circulation volume In Fig. 3 in the Appendix, we report the distribution of articles released by Audipress, a company that collects data on the reading across time. 7 habits of daily and periodical press in Italy: https://audipress.it/ https://www.sindacatogiornalistiveneto.it/wp-content/uploads/ quotidiani/. 2020/12/MANIFESTO-DI-VENEZIA.pdf. Cappuccio, Muscato et al. tive by University of Bologna seeks to identify the main Selenium 10 and Beautiful Soup11 . Data scraping discursive features employed in discussions about femi- was performed in two subsequent phases. Firstly, a com- cide in public spaces, including media and legal speech8 . prehensive list of article links was extracted by querying Recognizing the significant role of linguistic expres- the internal search engine of the newspaper websites sion in depicting incidents of gender-based violence, with the keywords femminicidio, femminicidi, previous research has explored various NLP techniques. femminicida: the first word stands for the Italian term These studies aim to discern how NLP models can effec- "femicide", the second is its plural form, and the third tively predict and analyze human perception judgments indicates the "person who commits a femicide". The key- concerning the sensitive issue of gender-based violence words were selected to concentrate our analysis on the events. Following previous works on the impact of spe- media’s representation and discourse surrounding this cific grammatical constructions and semantic frames [18] phenomenon. This choice intentionally excludes articles in describing the same event but with various nuances, that discuss such crimes in general terms, allowing for a Minnema et al. [19] introduced the first multilingual tool, more focused examination of the femicide narratives. In based on Frame Semantics and Cognitive Linguistics, for the second phase, the web pages corresponding to such detecting the focus or perspective depicted in an event, links were scraped to extract the text of the articles and called Socio Fillmore. Furthermore, building on the lin- other metadata to build the raw version of the dataset. guistic analysis provided by Socio Fillmore, Minnema et al. [20] demonstrated that various linguistic choices 3.2. Data Cleaning trigger different perceptions of responsibility, which can be modeled automatically. As a result, their series of We implemented a supervised and semi-supervised data regression models revealed that these distinct linguis- cleaning process, consisting of two phases, to prepare tic choices significantly influence human perceptions of the data. In the first step, the same pipeline was applied responsibility. Additionally, to promote awareness of to both FMNews-Nat and FMNews-Loc. We initially re- perspective-based writing, Minnema et al. [21] intro- moved all duplicate articles from the collected data, i.e., those with identical texts (title and body), metadata (e.g., duced the novel task of responsibility perspective transfer. The task involves the automatic rewriting of descriptionsdate), and source publication. Additionally, we converted of gender-based violence to alter the perceived level of the dates into the format of yyyy-mm-dd and removed blame attributed to the perpetrator. Both works lever- articles where at least one of the following elements was missing: publication date, title, or body. Despite the re- aged one of the limited resources available for the Italian community, the RAI Femicide Corpus, a collection moval of duplicates, certain articles had identical text of 2.734 news articles covering 937 confirmed femicide bodies, albeit with minor variations primarily due to spe- cases in Italy happened between 2015 and 2017 [22]. Ad- cial character encoding (e.g., accents and apostrophes) or differences in web crawling (e.g., one article included ditional online resources, both official and unofficial, con- taining further statistics on the phenomenon of femicide the website menu or footer while the other did not). To in Italy are listed in the Appendix A. address this issue, we implemented a method to iden- tify and handle articles with identical or highly similar text bodies sharing the same title. In details, we first 3. FMNews Corpus employed a TF-IDF12 vectorizer to convert the raw text data into numerical vectors and then use them to com- The main contribution brought by this paper is the pro- pute the cosine similarities between all pairs of texts duction of two datasets derived from Italian newspapers: in the dataset. For more details on the parameters and the FMNews9 corpus. The corpus consists of the following thresholds employed, we refer to Appendix B. Finally, we components: FMNews-Nat, reporting data from national utilized Beautiful Soup to remove any HTML tags newspapers, and FMNews-Loc, which gathers articles that could have been mistakenly included in the article from local newspapers in 53 Italian cities. body during the collection phase. The second step of the data cleaning process entailed 3.1. Data Extraction supervised cleaning of the article texts and headlines. The article texts from national newspapers in FMNews-Nat Despite the heterogeneous HTML structures of the news- displayed various noise patterns specific to each news papers involved, it was feasible to generalise the data media outlet. To address this issue, we manually created extraction process via the open source Python libraries 10 https://selenium-python.readthedocs.io/. 11 https://www.crummy.com/software/BeautifulSoup/bs4/doc/. 8 12 https://site.unibo.it/osservatorio-femminicidio/it. Term Frequency-Inverse Document Frequency, in short TF-IDF, is 9 The collection can be accessed for research purposes by requesting a measure of the importance of a word to a document in a collection it by email from the authors. or corpus [23]. Cappuccio, Muscato et al. Column Description Quotidiano has the largest number of articles, with a total Url URL of the original newspaper of 2,861, followed by La Repubblica with 2,837 articles. article Corriere is next, with a total of 968 articles. La Stampa Title Title of the article has a more limited presence, with 292 articles. Il Post Text Main section of the newspaper article contributes 244 articles, and Il Giornale has the fewest Newspaper Name of the media outlet where entries in this set, with 241 articles. For FMNews-Loc, the article was published. In the time span after data cleaning ranges from November FMNews-Loc, it reports the 2010 to February 2024. name of the city to which the local edition refers to. Keyword Keyword used to collect the arti- 4. Use Cases and Scenarios cle Date Publication date of the article in Since the two datasets share the same structure and we the format yyyy-mm-dd are interested in studying the phenomenon of femicide from both a national and local perspective, the analyses Table 1 exemplified in the following were conducted on both Description of the FMNews Corpus. datasets without distinction. After a textual analysis based on the tokenization, removal of stopwords, Dataset Raw Data Step I Step II extraction of lemmas and a straightforward assessment FMNews-Nat 12,790 7,511 7,443 of the lexical diversity (as detailed in the Appendix C), FMNews-Loc 8,397 7,728 7,728 we approached a viable keyword extraction method to Table 2 uncover relevant patterns in the documents. Dimensions of the dataset in terms of number of articles from national news outlets (FMNews-Nat) and local newspaper edi- tions (FMNews-Loc). Keyword Extraction According to Firoozeh et al. [24], specific criteria must be met for keywords to meet eli- gibility standards. In our case study, we emphasize the a list of replacements for each outlet, employing regular importance of keywords that show representativity and expressions for targeted removal of articles or specific exhaustivity, aiming for terms that capture significant sub-strings from article titles or bodies (we refer to Ap- rather than marginal aspects of the subject matter. To pendix B for additional details). In this stage, we also assess the significance of words within our collection excluded articles whose text bodies did not contain infor- of documents, a standard approach involves the Term mation directly related to femicides, such as television Frequency - Inverse Document Frequency (TF-IDF). programme listings or podcast episode agendas. For a deeper analysis, we calculate TF-IDF for each On the other hand, the articles from local newspapers news outlet. We utilize Spacy’s Italian pipeline to pre- in FMNews-Loc exhibited minimal noise within their text. process texts by tokenizing, lemmatizing, and selecting Therefore, the data preparation phase focused on poorly only lemmas that are full words from specific part-of- encoded symbols and domain-specific substrings such speech classes (nouns, adjectives, verbs). By focusing as copyright indications and external contributions, e.g., only on content lemmas and excluding function words government press releases. Unlike national newspapers, (like articles and prepositions), we eliminate noise and for journalistic publications, this ad-hoc cleaning did not improve accuracy in analyzing relationships between result in data loss. documents and word relevance. The lists of lemmas do not include words containing numbers or Italian stop- 3.3. Final Dataset words obtained from Nltk and Spacy, with additional crawling-dependent stopwords such as "it," "https," "min," Table 1 provides a detailed explanation of the data format and the names of months. Also, we preserve multi-word for both datasets after the completion of the data prepara- expressions identified by the lemmatizer by concatenat- tion process. The number of entries for the two datasets ing them to treat them as unique words during TF-IDF cal- is shown in Table 2. The table also shows the number of culation. Articles are then grouped by news outlet, each articles after two steps of data cleaning exemplified in B. acting as a single document for the TF-IDF computation. The analysis of FMNews-Nat after the last cleaning We use the TF-IDF Vectorizer from the scikit-learn13 steps reveals the following summary statistics. The library to transform the lemmatized tokens into numeri- dataset covers a time span of 14 years, from November cal features that reflect their importance within the text. 2009 to February 2024. Regarding the distribution of arti- cles across different newspapers in FMNews-Nat: Il Fatto 13 https://scikit-learn.org/stable/index.html. Cappuccio, Muscato et al. (a) Il Post (b) Corriere della Sera (c) Il Giornale (d) Fatto Quotidiano (e) La Repubblica (f) La Stampa Figure 1: Top 10 keywords in descending order for each news outlet FMNews-Nat. Thus, TF-IDF measures the significance of terms concern- ing the news outlets. Fig. 1 illustrates the most relevant keywords extracted from FMNews-Nat by news outlet. As expected, terms like "woman," "violence," and "kill" (along with "femicide") are central to the narrative of femi- cide and are common across all outlets. Other keywords vary in relevance among multiple outlets; for example, "son" appears in all outlets except Il Post. Specific key- words are unique to one or two outlets: "gender," "right," and "sexual" appear only in Il Post; "family" is relevant in Corriere della Sera and La Stampa; and "man" is found in Il Post and Il Giornale. Due to the number of local news outlets in FMNews-Loc (50), Fig. 2 shows the top 20 keywords with the highest average TF-IDF, calculated as the mean of the TF-IDF values of the terms with re- spect to the news outlets. As expected, the highest ranks are occupied by the same relevant keywords found in na- Figure 2: Top 20 Keywords by average TF-IDF in tional news outlets, such as "woman," "violence," "victim," FMNews-Loc. and "femicide". Additionally, some keywords relevant to specific national news outlets show high relevance for local media, although with lower average TF-IDFs, Semantic Vector Extraction For an additional layer such as "gender". Conversely, the distribution reveals of analysis, we chose to train a word embedding model to previously unseen keywords, such as "young," "school," explore semantic relationships among words. This model and "association". represents words as continuous space vectors, where the proximity of vectors indicates the semantic similar- ity between the words they represent: closer vectors Cappuccio, Muscato et al. Table 3 Most similar word embeddings to (a) "uccidere" (to kill) in FMNews-Nat (b) "vittima" (victim) in FMNews-Loc Word Similarity score Word Similarity score 𝑎𝑚𝑚𝑎𝑧𝑧𝑎𝑟𝑒 (to murder) 0.77 𝑟𝑎𝑔𝑎𝑧𝑧𝑎 (girl) 0.69 𝑢𝑐𝑐𝑖𝑑𝑒𝑟𝑙𝑎 (to kill - her) 0.71 𝑔𝑖𝑜𝑣𝑎𝑛𝑒 (young) 0.69 𝑎𝑚𝑚𝑎𝑧𝑧𝑎𝑡𝑜 (murdered - him) 0.66 𝑑𝑜𝑛𝑛𝑎 (woman) 0.67 𝑢𝑐𝑐𝑖𝑠𝑜 (killed - him) 0.66 𝑚𝑎𝑑𝑟𝑒 (mother) 0.67 𝑠𝑢𝑖𝑐𝑖𝑑𝑎𝑟𝑠𝑖 (to commit suicide) 0.63 𝑓 𝑖𝑔𝑙𝑖𝑎 (daughter) 0.64 𝑠𝑡𝑟𝑎𝑛𝑔𝑜𝑙𝑎𝑡𝑜 (strangled - him) 0.62 𝑠𝑐𝑜𝑚𝑝𝑎𝑟𝑠𝑎 (disappearance) 0.62 𝑓 𝑢𝑟𝑖𝑎 (fury) 0.60 𝑢𝑐𝑐𝑖𝑠𝑎 (killed - her) 0.62 𝑓 𝑢𝑐𝑖𝑙𝑒 (rifle) 0.59 26𝑒𝑛𝑛𝑒 (26 years old) 0.61 𝑠𝑝𝑎𝑟𝑎𝑟𝑒 (to shoot) 0.59 𝑚𝑎𝑠𝑠𝑎𝑐𝑟𝑎𝑡𝑎 (massacred - her) 0.59 𝑎𝑐𝑐𝑜𝑙𝑡𝑒𝑙𝑙𝑎𝑡𝑜 (stabbed - him) 0.59 𝑝𝑜𝑣𝑒𝑟𝑎 (poor) 0.59 correspond to words with more similar meanings. We would expect, nearly all terms are associated and high- employed Word2Vec (W2V) [25], which operates by light that the victim is a woman. In this regard, a draw- mapping words to high-dimensional vectors within a back to consider is that the specific selection of the terms given vocabulary. This mapping is designed to represent used for the data collection query may have hindered semantic relationships between words in the vectorial our analysis from uncovering insights about homicides space. W2V has been implemented through Gensim14 , committed against individuals who do not identify as a powerful tool set for NLP tasks. A key parameter in woman or fit into the traditional gender binary. Indeed, W2V is the "window", i.e., the number of context words the discussion around gender-based violence in Italy is to be considered, which we defined as 10 to consider still predominantly centred on women, while other gen- a contextual window that extends neither too far nor ders remain significantly neglected15 . too close to the current word, thereby striking a balance between contextual relevance and computational effi- ciency. To discover the semantic associations within our 5. Conclusion dataset, we leveraged the "most similar" method from In this contribution, we provided a novel dataset concern- Gensim, which computes the cosine similarity between ing the critical issue of femicide in Italy. Considering the word vectors to identify words with the closest seman- absence of resources for conducting in-depth analyses on tic proximity. For both datasets the size of the training the subject, our intent was to bridge this gap and provide embeddings for the W2D model is fixed to 100 while an original perspective for understanding and raising the vocabulary size change accordingly to the dataset, in awareness about this severe phenomenon. FMNews-Nat is 6809, in FMNews-Loc is 6064. As suggested by Dobbe et al. [26], proposing a con- In FMNews-Nat, the word "donna" (woman) yielded tribution within the Machine Learning domain respon- semantically related terms such as "vittima" (victim) and sibly and consciously means foremost acknowledging "prostituta" (whore). The term "femminicidio" (femicide) our own biases. In particular, we are referring to both elicited associations like "violenza" (violence), "impres- the newspaper selection and choice of the terms used to sionante" (impressive), and "dramma" (drama). In Table extract the data, that certainly shaped the results (all de- 3a, the analysis of "uccidere" (to kill) encompasses related sign choices are justified in detail in Section 3). A future terms such as "ammazzare" (to murder), "ucciderla" (to kill outlook concerns the investigation of how both victims her), "ammazzato" (murdered, masculine form), "ucciso" and perpetrators are framed from a linguistic perspective. (killed, masculine form), "suicidarsi" (to commit suicide), Further analyses could regard identifying temporal and and "strangolato" (strangled, masculine form). These geographical patterns arising from media attention man- terms may collectively pertain to the perpetrator’s ac- ifested through the coverage of femicides and comparing tions against the victim. Fig. 5 in the Appendix provides the framing of these events with the political leaning of a comprehensive overview of word vectors closely asso- the respective newspapers. ciated with the previously extracted keywords, which were identified as the most significant in FMNews-Nat. 15 As a matter of fact, there is no official collection of statistics In Table 3b, the words correlated in meaning to "vit- regarding this specific kind of event. The only organisation that records the gender of the victims in its database is the Ob- tima" (victim) in FMNews-Loc are presented. As we servatory Femicides Lesbicides Transcides managed by Non una di meno, the Italian section of movement Ni una menos (https: 14 https://pypi.org/project/gensim/. //osservatorionazionale.nonunadimeno.net/). Cappuccio, Muscato et al. Acknowledgments (1993) 51–58. doi:10.1111/j.1460-2466.1993. tb01304.x. This work has been supported by the European Union [11] J. James W.Tankard, The empirical approach to the under ERC-2018-ADG GA 834756 (XAI), by HumanE-AI- study of media framing, in: S. D. Reese, J. Gandy, Net GA 952026, by the Partnership Extended PE00000013 A. E. Grant (Eds.), Framing public life, Taylor & - “FAIR - Future Artificial Intelligence Research” - Spoke 1 Francis, Philadelphia, PA, 2001. “Human-centered AI”, and by SoBigData.it that receives [12] M. Edelman, Contestable categories and funding from European Union – NextGenerationEU – public opinion, Political Communication 10 National Recovery and Resilience Plan (Piano Nazionale (1993) 231–242. doi:10.1080/10584609.1993. di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it 9962981. – Strengthening the Italian RI for Social Mining and Big [13] D. Kahneman, A. Tversky, Choices, values, and Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del frames., American Psychologist 39 (1984) 341–350. 28/12/2021. doi:10.1037/0003-066x.39.4.341. [14] P. M. Sniderman, R. A. Brody, P. E. Tetlock, Cam- bridge studies in public opinion and political psy- References chology: Reasoning and choice: Explorations in [1] C. Bouzerdan, J. Whitten-Woodring, Killings in con- political psychology, Cambridge University Press, text: An analysis of the news framing of femicide, Cambridge, England, 1993. Human Rights Review 19 (2018) 211–228. [15] C. Corradi, C. Marcuello-Servós, S. Boira, S. Weil, [2] J. Radford, D. Russell, Femicide: The Politics of Theories of femicide and their significance for social Woman Killing, Post-Contemporary Interventions, research, Current sociology 64 (2016) 975–995. Twayne, 1992. [16] J. Fairbairn, C. Boyd, Y. Jiwani, M. Dawson, Chang- [3] M. M. L. y de los Ríos, Por la vida y la libertad de ing media representations of femicide as primary las mujeres: fin al feminicidio, Cámara de Diputa- prevention, in: The Routledge International Hand- dos del Congreso de la Unión, LIX Legislatura, book on Femicide and Feminicide, Routledge, 2023, Comisión Especial para Conocer y Dar Seguimiento pp. 554–564. a las Investigaciones Relacionadas con los Femini- [17] E. Pinelli, C. Zanchi, Gender-based violence in cidios en la República Mexicana y a la Procuración italian local newspapers: How argument structure de Justicia Vinculada, 2006. constructions can diminish a perpetrator’s responsi- [4] B. Spinelli, Femminicidio: dalla denuncia sociale bility, in: Discourse Processes between Reason and al riconoscimento giuridico internazionale, Franco Emotion: A Post-disciplinary Perspective, Springer, Angeli, 2008. 2021, pp. 117–143. [5] B. Spinelli, L’italia rispetta la CEDAW? il femmini- [18] G. Minnema, S. Gemelli, C. Zanchi, V. Patti, cidio in italia alla luce delle raccomandazioni delle T. Caselli, M. Nissim, et al., Frame semantics for so- nazioni unite, in: I. Corti (Ed.), Universo femminile. cial nlp in italian: Analyzing responsibility framing La CEDAW tra diritto e politiche, eum edizioni uni- in femicide news reports, in: CEUR WORKSHOP versità di Macerata, 2012. PROCEEDINGS, volume 3033, CEUR-WS, 2021, pp. [6] S. Abis, P. Orrù, et al., Il femminicidio nella stampa 1–8. italiana: un’indagine linguistica, gender/sexuali- [19] G. Minnema, S. Gemelli, C. Zanchi, T. Caselli, ty/italy 3 (2016) 18–33. M. Nissim, Sociofillmore: a tool for discovering per- [7] M. Aldrete, M. Fernández-Ardèvol, Framing femi- spectives, arXiv preprint arXiv:2203.03438 (2022). cide in the news, a paradoxical story: A compre- [20] G. Minnema, S. Gemelli, C. Zanchi, T. Caselli, hensive analysis of thematic and episodic frames, M. Nissim, Dead or murdered? predicting responsi- Crime, Media, Culture (2023) 17416590231199771. bility perception in femicide news reports, in: Pro- [8] A. Forciniti, E. Zavarrone, Data quality and violence ceedings of the 2nd Conference of the Asia-Pacific against women: The causes and actors of femicide, Chapter of the Association for Computational Lin- Social Indicators Research (2023) 1–25. guistics and the 12th International Joint Confer- [9] C. Meluzzi, E. Pinelli, E. Valvason, C. Zanchi, Re- ence on Natural Language Processing (Volume 1: sponsibility attribution in gender-based domestic vi- Long Papers), Association for Computational Lin- olence: A study bridging corpus-assisted discourse guistics, Online only, 2022, pp. 1078–1090. URL: analysis and readers’ perception, Journal of prag- https://aclanthology.org/2022.aacl-main.79. matics 185 (2021) 73–92. [21] G. Minnema, H. Lai, B. Muscato, M. Nissim, Re- [10] R. M. Entman, Framing: Toward clarification of a sponsibility perspective transfer for Italian femi- fractured paradigm, Journal of Communication 43 cide news, in: Findings of the Association for Com- putational Linguistics: ACL 2023, Association for Cappuccio, Muscato et al. Computational Linguistics, Toronto, Canada, 2023, A. Additional Resources pp. 7907–7918. URL: https://aclanthology.org/2023. findings-acl.501. Official Resources [22] M. Belluati, Femminicidio, Una lettura tra realtà e Official statistics on femicide cases in Italy can be ac- interpretazione. Biblioteca di testi e studi. Carocci cessed through ISTAT16 and the Ministry of the Interior (2021). through the Department of Public Security website17 . In [23] A. Rajaraman, J. D. Ullman, Data mining, in: particular, ISTAT provides data on victims of voluntary Mining of Massive Datasets, Cambridge University homicide, divided by gender, from 1992 to 2020, with- Press, Cambridge, 2011, pp. 1–17. doi:10.1017/ out additional information. In contrast, the Department CBO9781139058452.002. of Public Security offers more detailed data covering a [24] N. Firoozeh, A. Nazarenko, F. Alizon, B. Daille, Key- limited time range, i.e., from 2002 to 2022: victims are word extraction: Issues and methods, Natural Lan- categorized by their relationship to the murderer. These guage Engineering 26 (2020) 259–291. categories include: Partner (husband/wife, domestic part- [25] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient ner, boyfriend/girlfriend), Former partner (former hus- estimation of word representations in vector space, band/wife, former domestic partner, former boyfriend/- arXiv preprint arXiv:1301.3781 (2013). girlfriend), Other relative, Other acquaintance, Perpetrator [26] R. Dobbe, S. Dean, T. K. Gilbert, N. Kohli, A broader unknown to the victim, and Perpetrator unidentified. view on bias in automated decision-making: Re- flecting on epistemology and dynamics, CoRR abs/1807.00553 (2018). URL: http://arxiv.org/abs/ Unofficial Resources 1807.00553. arXiv:1807.00553. Unofficial data and statistics regarding femicides in Italy are also available, typically compiled by non- governmental or grassroots organisations. One notable example is the open database18 managed by the Italian activists of Ni una menos19 , an international feminist movement that campaigns against gender-based violence. Although it covers a shorter time frame, this database offers disaggregated and more detailed information than the official statistics. For example, in addition to the names of the victims, the collection also includes impor- tant characteristics such as the age and nationality of the individuals involved, the geographical dimension, and the gender of the victim, including non-binary framings. While not readily accessible, a combined examination of both official and non-official data is essential for a more thorough and comprehensive analysis of the issues of femicide in Italy. B. Data Preparation We applied a supervised and semi-supervised cleaning phase divided into two steps to prepare the data. In the first step, the same pipeline was applied to both datasets, primarily aimed at removing duplicate articles, format- ting metadata, and reducing data and metadata sparsity. The second step entailed supervised cleaning of the arti- cle texts and headlines. We observed different types of noise in the texts of the national newspapers compared 16 https://www.istat.it/it/violenza-sulle-donne/il-fenomeno/ omicidi-di-donne. 17 https://www.interno.gov.it/it/stampa-e-comunicazione/ dati-e-statistiche/omicidi-volontari-e-violenza-genere. 18 https://osservatorionazionale.nonunadimeno.net/anno/. 19 https://nonunadimeno.wordpress.com/. Cappuccio, Muscato et al. Figure 3: Number of articles throughout the years (2008-2024) for both FMNews-Nat and FMNews-Loc. to the local ones. Hence, given that the two datasets are solely arise from symbols, we set a tolerance threshold released and usable separately, we implemented a similar of 0.89 to determine text equality. If two text bodies had pipeline for both datasets, albeit customized for each. a cosine similarity greater than 0.89, we considered them duplicates and retained only the first occurrence, remov- Data Preparation - Step I: Cleaning ing the second found in the dataset. Finally, we utilized Beautiful Soup to remove any HTML tags that could We first removed all duplicate articles from the collected have been mistakenly included in the article body during data (just under 12,800 articles from national newspapers the collection phase. This step ensured that our text data and approximately 8,400 articles from local ones), i.e., was free from any undesired HTML tags before further those with identical texts (title and body), metadata (e.g., processing or analysis. date), and source publication. Additionally, we converted the dates into the format of yyyy-mm-dd and removed Data Preparation - Step II: FMNews-Nat articles where at least one of the following elements was missing: publication date, title, or body. Despite the The article texts from national newspapers displayed var- removal of duplicates, some articles had identical text ious noise patterns specific to each news media outlet. To bodies, albeit with minor variations primarily due to spe- address this issue, we manually created a list of replace- cial character encoding (e.g., accents and apostrophes) ments for each outlet, employing regular expressions or differences in web crawling (e.g., one article included for targeted removal of articles or specific sub-strings the website menu or footer while the other did not). To from article titles or bodies. In particular, the body of address this issue, we implemented a method to identify articles from Il Post, La Repubblica and Il Fatto Quotidiano and handle articles with identical or highly similar text included parts of webpage menus and footers, as well as bodies, but only if they share the same title. The method various types of news media outlet sponsorship, such as relies on cosine similarity to determine whether two texts subscriptions, newsletter sign-ups, and agendas/lists of are the same. In particular, we first employed a TF-IDF podcast episodes. On the other hand, articles from Cor- vectorizer to convert the raw text data into numerical vec- riere della sera included text substrings associated with tors. These vectors were then used to compute the cosine the journalistic domain, such as headings containing the similarities between all pairs of texts in the dataset. Co- name of the correspondent, reporter, or photographer. sine similarity produces a value between 0 and 1, where We observed that the texts of the articles published by 1 indicates identical texts and values closer to 0 indicate Corriere della sera often, but not always, follow a par- less similar texts. Since text preprocessing had not been ticular structure: "by Author_name Author_surname" performed yet and differences between text bodies could (where can be a nat- Cappuccio, Muscato et al. ural person or abbreviations with one dot) or "Editorial • Emails and URLS. Emails and URLs found team", followed by a city or "online", in either uppercase within the body of the articles are replaced with or lowercase. Occasionally, this structure is followed a placeholder tag, such as "[[URL]]". by another city, for instance, "Bologna Online Editorial • Uppercase words. Words entirely in uppercase Staff". Additionally, this "basic" structure may or may not are not replaced or modified, as the text will be be followed by "inviato a <(Province)>", or "in- normalized in subsequent stages of the work, i.e., viata", "foto di ". We converted to lowercase. Uppercase words are generally excluded articles whose text bodies did not extracted and saved for further analysis. contain information directly related to femicides, such as • Punctuation, symbols, numbers. Punctua- television programme listings or podcast episode agen- tion, symbols, and numbers are removed from das. We retained the article whenever feasible, removing the texts. irrelevant substrings from the text bodies, such as menus • Stopwords. We remove the stopwords included and footers. The resulting FMNews-Nat dataset includes in the list provided by NLTK 20 and Spacy21 li- 7, 443 articles: in Fig. 4 we report the distribution of braries, along with a brief, manually compiled articles by media outlet. list of stopwords. This latter list includes domain- specific and context-related keywords, such as Data Preparation - Step II: FMNews-Loc "Link Embed", "FOTO", "FOTOGRAMMA". It is important to note that the "ad hoc" stopwords The articles from local newspapers exhibited minimal were removed from the non-normalized text to noise within their text. Therefore, the data preparation mitigate the impact of stopwords removal. Indeed, phase focused on poorly encoded symbols and domain- during the analysis, we observed that some arti- specific substrings such as copyright indications and ex- cles from national newspapers contained certain ternal contributions, e.g., government press releases. Un- keywords entirely in uppercase to indicate ele- like national newspapers, for journalistic publications, ments attached to the article. Thus, we chose to this ad-hoc cleaning did not result in data loss . There- compile the list of stopwords to be case-sensitive, fore, the resulting FMNews-Loc dataset includes 7, 728 aiming to avoid removing words within the body articles. of the article. After extracting the features from the raw texts, we proceeded with the following steps. First, we tokenized the body of articles using the Spacy library with the Italian module, selecting only words. Next, we extracted tokens that are not included in the stopwords. Then, we extracted the lemmas, again excluding stopwords. Finally, we further refined our selection by retaining from the to- kens only words belonging to what is commonly referred to as "full" classes of speech, such as nouns, verbs, adjec- tives, and adverbs. This process of extracting "full" words aimed to focus our analysis on linguistically significant elements of the text. This approach allows us to study meaningful linguistic units, facilitating a more accurate understanding of the semantic content and structure of the text. Figure 4: Final number of articles of FMNews-Nat extracted After tokenization, removal of stopwords, and extrac- from the national newspapers. tion of lemmas, we computed the Type-Token Ratio (TTR) for the articles, a measure of the lexical diversity in a text. This is given by the proportion of unique words in a text, or "types", to the total number of words, or "tokens" and C. Textual Analysis reads: 𝑁types Although applying NLP models typically requires stan- 𝑇𝑇𝑅 = (1) dardized and structured text, it is important to acknowl- 𝑁tokens edge that such preprocessing may result in the loss of some information. We believe it is important to keep 20 https://www.nltk.org/. track into texts of the elements we manipulate. 21 https://spacy.io/. Cappuccio, Muscato et al. Figure 5: Similar word vectors in FMNews-Nat. Where 𝑁types is the number of unique types and 𝑁tokens is the number of tokens in the text. TTR values range from 0 to 1, where a higher value indicates greater lexical variety, whereas a lower value implies more repetition of words in the text. This is a straightforward measure which nevertheless allows us to form an initial assess- ment of the lexical richness in the narrative surrounding femicides. The newspaper Il Post, along with Il Fatto Quo- tidiano and La Repubblica, exhibited a notable variation in terms of TTR. While FMNews-Nat shows variation in lexicon usage, FMNews-Loc exhibits a uniformity in language .