1. Introduction

Beyond Headlines: A Corpus of Femicides News Coverage in Italian Newspapers

Eleonora Cappuccio

0 2 3

Benedetta Muscato

1 3

Laura Pollacci

0 3

Marta Marchiori Manerba

0 3

Clara Punzi

1 3

Chandana Sree Mala

1 3

Margherita Lalli

Gizem Gezici

Michela Natilli

Fosca Giannotti

1 0 ISTI-CNR , Pisa , Italy 1 Scuola Normale Superiore , Pisa , Italy 2 Università degli Studi di Bari Aldo Moro , Bari 3 Università di Pisa , Pisa , Italy

How newspapers cover news significantly impacts how facts are understood, perceived, and processed by the public. This is especially crucial when serious crimes are reported, e.g., in the case of femicides, where the description of the perpetrator and the victim builds a strong, often polarized opinion of this severe societal issue. This paper presents FMNews, a new dataset of articles reporting femicides extracted from Italian newspapers. Our core contribution aims to promote the development of a deeper framing and awareness of the phenomenon through an original resource available and accessible to the research community, facilitating further analyses on the topic. The paper also provides a preliminary study of the resulting collection through several example use cases and scenarios.

eol>Italian Dataset Newspapers Information Extraction Information Retrieval AI for Social Good Femicides

1. Introduction

of women by males due to their gender. Successively, the term femicide, translated in Castillian as femicidio or femHow newspapers and journalists present news plays a inicide by the anthropologist Marcela Lagarde to attract crucial role in shaping public understanding and percep- political attention on the dire situation faced by women tion of information. This is especially important when in Mexico [ 3 ], has gained global traction with varying reporting serious crimes, such as femicides, where de- interpretations, yet consistently denotes a patriarchal imscriptions of the perpetrator and victim can create po- petus behind homicides and other forms of male violence larized opinions influencing readers’ perceptions and against women, primarily emphasising the sociological interpretations of the event. According to Bouzerdan dimensions of abuse and the socio-political ramifications and Whitten-Woodring [ 1 ], news media often report inci- of the phenomenon. In the Italian language, the term dents of women’s homicides in a sensationalised manner, femminicidio has been almost exclusively adopted, as treating these crimes as isolated events rather than situat- evidenced by a Google Trends analysis comparing the ing them within the bigger framework of violence against search terms "femicidio" and "femminicidio" to queries women. This narrative defies the global demands of hu- regarding "femicide"1. man rights organisations to acknowledge and address this An analysis of the phenomenon of femicide in the Italphenomenon as demanded by its intricate dynamics. Nu- ian context and, in particular, a linguistic investigation merous countries have followed such recommendations of it, are particularly relevant. Feminicide, a term used only partially through the formal adoption of specific ter- by the feminist movement in Italy since 2005, gained minology such as femicide and feminicide in legal frame- prominence in the media in 2011, especially thanks to works and public discourse. The two terms have related the works of Barbara Spinelli [ 4 ]. The CEDAW Combut distinct nuances of meaning. Femicide, a criminolog- mittee2, based on data from the Shadow Report on the ical concept initially coined in English by the feminist Implementation of CEDAW in Italy, addressed recomcriminologist Diana H. Russell [ 2 ], denotes the murder mendations to the Italian government on feminicide in its Concluding Observations. This was the first time the committee addressed a European state on feminicide, a category previously reserved for warnings to Central CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, Dec 04 — 06, 2024, Pisa, Italy * Corresponding author. † These authors contributed equally. 1The conducted analysis included news web searches in Italy since $ eleonora.cappuccio@phd.unipi.it (E. Cappuccio); 2022, i.e., since when the service implemented an enhanced data benedetta.muscato@sns.it (B. Muscato) collection methodology.

Attribution 4.0 International (CC BY 4.0).

American countries. The challenges in accurately contex- national4 and local5 level, with local editions spantualising feminicide in Italy also stem from a prolonged ning across the whole Italian territory. absence of oficial data, resulting in sensationalism and • Political, which was granted by choosing nathe perception of a dramatic rise in the crime. This may tional newspaper with varying political leanings. induce an emergency narrative that obscures the inher- • Temporal, where the time frame of national ent structural dimensions of the phenomenon, thereby newspapers extends from November 2009 to undermining the very essence of the term [ 5 ]. Media February 2024, whilst that of the local ones ranges interpretations are essential for shaping a shared under- from November 2010 to February 20246. standing across a vast audience, such as a whole country; hence, the examination of media discourse emerges as a significant analytical instrument on top of statistical 2. Related Work evaluation of femicide data to understand the achievements and directions of state intervention towards the According to frame analysis, the ways in which newspasubstantial granting of women’s right to life [ 6 ]. pers cover news significantly impact how facts are un

In this regard, Aldrete and Fernández-Ardèvol [ 7 ] derstood, perceived, and processed by the public [ 10, 11 ]. showed that there is a large body of empirical studies Framing narratives means strategically including or omiton femicide discourse across diferent socio-cultural con- ting elements (such as problem definitions, explanations texts, which often justify the perpetrator’s actions. Given and evaluations) of a given situation in a communicathe complexity of the phenomenon, a comprehensive tive text [ 12, 13, 14 ]. This process aims to advocate for investigation could be achieved by integrating media specific interpretations, assess moral responsibilities of analysis with external data, such as demographics and individuals involved and propose solutions while also current events, blending together researchers from dif- eliciting nuanced emotional responses from the audiferent fields like computer science, social sciences, and ence, thereby afecting their perceptions and attitudes. It complex systems science. The lack of accessible and is worth noting that in the case of news articles, media relevant data specific to socio-culturally context where framing can be seen as a demonstration of political power femicide is notably prevalent, such as in Italy, makes the [ 10 ], influencing which actors or interests are involved task particularly challenging [ 8 ]. shape narratives, often unnoticed by the audience [ 11 ].

This paper presents FMNews, a new dataset of articles The process of news framing becomes especially crureporting femicides extracted from Italian newspapers3. cial when reporting serious crimes, such as femicides, as We conduct a preliminary analysis of the resulting col- understanding femicide requires analyzing its evolution lection through several example use cases and scenarios. from both statistical and social perspectives, as discussed The primary contribution is to deepen understanding and in the Manifesto delle Giornaliste e dei Giornalisti per il awareness of femicide from a socio-technical perspective. Rispetto e la Parita’ di Genere nell’Informazione7 (ManWe seek to examine how prominent Italian news sources ifesto of Journalists for Respect and Gender Equality in report on the issue in connection to the shaping of public News Reporting, our translation). perception, while also ofering an innovative and acces- The acknowledged impact of language on how readsible resource to facilitate future investigation within ers perceive information has prompted researchers to the research community. Furthermore, this study was explore how the language surrounding femicide has designed to enable a multifaceted investigation covering changed and how this influences individuals’ responthe following three dimensions: sibility perception [ 15 ], which can vary based on the way femicides are reported [ 1, 16, 9, 17 ]. Moreover, an initia3The choice of newspapers was dictated by the circulation volume released by Audipress, a company that collects data on the reading habits of daily and periodical press in Italy: https://audipress.it/ quotidiani/.

• Geographical, with the aim to explore potential variations in framing between local and na- 4The selected national newspapers are the following: Corriere della tional media outlets. Indeed, previous research Sera, La Repubblica, La Stampa, Il Fatto Quotidiano, Il Giornale and has shown that Italian local daily newspaper of- Il Post. ten suppress the agency of the perpetrator, por- 5Tgrhoeuspe,lwechteicdhlococavlenrethwespfoalploewrsinargectihtieeslo:cAaglreidgietniotno,sAonfcthoenaC,iAtyrNezewzos, traying the events as mere occurrences [ 9 ]. We Avellino, Bari, Bologna, Brescia, Brindisi, Caserta, Catania, Cesena, selecting newspapers reporting news at both the Chieti, Como, Ferrara, Firenze, Foggia, Forlì, Frosinone, Genova, Pescara, Piacenza, Latina, Lecce, Lecco, Livorno, Messina, Milano, Modena, Monza, Napoli, Novara, Padova, Palermo, Parma, Perugia, Pisa, Pordenone, Ravenna, Reggio, Rimini, Roma, Salerno, Sondrio, Terni, Torino, Trento, Treviso, Trieste, Udine, Venezia, Verona, Vicenza, Viterbo. 6In Fig. 3 in the Appendix, we report the distribution of articles across time. 7https://www.sindacatogiornalistiveneto.it/wp-content/uploads/ 2020/12/MANIFESTO-DI-VENEZIA.pdf. tive by University of Bologna seeks to identify the main Selenium 10 and Beautiful Soup11. Data scraping discursive features employed in discussions about femi- was performed in two subsequent phases. Firstly, a comcide in public spaces, including media and legal speech8. prehensive list of article links was extracted by querying

Recognizing the significant role of linguistic expres- the internal search engine of the newspaper websites sion in depicting incidents of gender-based violence, with the keywords femminicidio, femminicidi, previous research has explored various NLP techniques. femminicida: the first word stands for the Italian term These studies aim to discern how NLP models can efec- "femicide", the second is its plural form, and the third tively predict and analyze human perception judgments indicates the "person who commits a femicide". The keyconcerning the sensitive issue of gender-based violence words were selected to concentrate our analysis on the events. Following previous works on the impact of spe- media’s representation and discourse surrounding this cific grammatical constructions and semantic frames [ 18 ] phenomenon. This choice intentionally excludes articles in describing the same event but with various nuances, that discuss such crimes in general terms, allowing for a Minnema et al. [ 19 ] introduced the first multilingual tool, more focused examination of the femicide narratives. In based on Frame Semantics and Cognitive Linguistics, for the second phase, the web pages corresponding to such detecting the focus or perspective depicted in an event, links were scraped to extract the text of the articles and called Socio Fillmore. Furthermore, building on the lin- other metadata to build the raw version of the dataset. guistic analysis provided by Socio Fillmore, Minnema et al. [ 20 ] demonstrated that various linguistic choices 3.2. Data Cleaning trigger diferent perceptions of responsibility, which can be modeled automatically. As a result, their series of We implemented a supervised and semi-supervised data regression models revealed that these distinct linguis- cleaning process, consisting of two phases, to prepare tic choices significantly influence human perceptions of the data. In the first step, the same pipeline was applied responsibility. Additionally, to promote awareness of to both FMNews-Nat and FMNews-Loc. We initially reperspective-based writing, Minnema et al. [ 21 ] intro- moved all duplicate articles from the collected data, i.e., duced the novel task of responsibility perspective transfer. those with identical texts (title and body), metadata (e.g., The task involves the automatic rewriting of descriptions date), and source publication. Additionally, we converted of gender-based violence to alter the perceived level of the dates into the format of yyyy-mm-dd and removed blame attributed to the perpetrator. Both works lever- articles where at least one of the following elements was aged one of the limited resources available for the Italian missing: publication date, title, or body. Despite the recommunity, the RAI Femicide Corpus, a collection moval of duplicates, certain articles had identical text of 2.734 news articles covering 937 confirmed femicide bodies, albeit with minor variations primarily due to specases in Italy happened between 2015 and 2017 [22]. Ad- cial character encoding (e.g., accents and apostrophes) ditional online resources, both oficial and unoficial, con- or diferences in web crawling (e.g., one article included taining further statistics on the phenomenon of femicide the website menu or footer while the other did not). To in Italy are listed in the Appendix A. address this issue, we implemented a method to identify and handle articles with identical or highly similar text bodies sharing the same title. In details, we first 3. FMNews Corpus employed a TF-IDF12 vectorizer to convert the raw text data into numerical vectors and then use them to comThe main contribution brought by this paper is the pro- pute the cosine similarities between all pairs of texts duction of two datasets derived from Italian newspapers: in the dataset. For more details on the parameters and the FMNews9 corpus. The corpus consists of the following thresholds employed, we refer to Appendix B. Finally, we components: FMNews-Nat, reporting data from national utilized Beautiful Soup to remove any HTML tags newspapers, and FMNews-Loc, which gathers articles that could have been mistakenly included in the article from local newspapers in 53 Italian cities. body during the collection phase.

The second step of the data cleaning process entailed 3.1. Data Extraction supervised cleaning of the article texts and headlines. The article texts from national newspapers in FMNews-Nat displayed various noise patterns specific to each news media outlet. To address this issue, we manually created Despite the heterogeneous HTML structures of the newspapers involved, it was feasible to generalise the data extraction process via the open source Python libraries 8https://site.unibo.it/osservatorio-femminicidio/it. 9The collection can be accessed for research purposes by requesting it by email from the authors. 10https://selenium-python.readthedocs.io/. 11https://www.crummy.com/software/BeautifulSoup/bs4/doc/. 12Term Frequency-Inverse Document Frequency, in short TF-IDF, is a measure of the importance of a word to a document in a collection or corpus [23].

Column

Url

Title Text Newspaper Keyword Date Description URL of the original newspaper article Title of the article Main section of the newspaper article Name of the media outlet where

the article was published. In

FMNews-Loc, it reports the name of the city to which the local edition refers to. Keyword used to collect the article Publication date of the article in

the format yyyy-mm-dd

Quotidiano has the largest number of articles, with a total of 2,861, followed by La Repubblica with 2,837 articles. Corriere is next, with a total of 968 articles. La Stampa has a more limited presence, with 292 articles. Il Post contributes 244 articles, and Il Giornale has the fewest entries in this set, with 241 articles. For FMNews-Loc, the time span after data cleaning ranges from November 2010 to February 2024.

4. Use Cases and Scenarios

Since the two datasets share the same structure and we are interested in studying the phenomenon of femicide from both a national and local perspective, the analyses exemplified in the following were conducted on both datasets without distinction. After a textual analysis based on the tokenization, removal of stopwords, extraction of lemmas and a straightforward assessment of the lexical diversity (as detailed in the Appendix C), we approached a viable keyword extraction method to uncover relevant patterns in the documents.

Keyword Extraction According to Firoozeh et al. [24], specific criteria must be met for keywords to meet eligibility standards. In our case study, we emphasize the a list of replacements for each outlet, employing regular importance of keywords that show representativity and expressions for targeted removal of articles or specific exhaustivity, aiming for terms that capture significant sub-strings from article titles or bodies (we refer to Ap- rather than marginal aspects of the subject matter. To pendix B for additional details). In this stage, we also assess the significance of words within our collection excluded articles whose text bodies did not contain infor- of documents, a standard approach involves the Term mation directly related to femicides, such as television Frequency - Inverse Document Frequency (TF-IDF). programme listings or podcast episode agendas. For a deeper analysis, we calculate TF-IDF for each

On the other hand, the articles from local newspapers news outlet. We utilize Spacy’s Italian pipeline to prein FMNews-Loc exhibited minimal noise within their text. process texts by tokenizing, lemmatizing, and selecting Therefore, the data preparation phase focused on poorly only lemmas that are full words from specific part-ofencoded symbols and domain-specific substrings such speech classes (nouns, adjectives, verbs). By focusing as copyright indications and external contributions, e.g., only on content lemmas and excluding function words government press releases. Unlike national newspapers, (like articles and prepositions), we eliminate noise and for journalistic publications, this ad-hoc cleaning did not improve accuracy in analyzing relationships between result in data loss. documents and word relevance. The lists of lemmas do not include words containing numbers or Italian stop3.3. Final Dataset words obtained from Nltk and Spacy, with additional crawling-dependent stopwords such as "it," "https," "min," and the names of months. Also, we preserve multi-word expressions identified by the lemmatizer by concatenating them to treat them as unique words during TF-IDF calculation. Articles are then grouped by news outlet, each acting as a single document for the TF-IDF computation.

We use the TF-IDF Vectorizer from the scikit-learn13 library to transform the lemmatized tokens into numerical features that reflect their importance within the text.

(a) Il Post (b) Corriere della Sera (c) Il Giornale (d) Fatto Quotidiano (e) La Repubblica (f) La Stampa

Thus, TF-IDF measures the significance of terms concerning the news outlets. Fig. 1 illustrates the most relevant keywords extracted from FMNews-Nat by news outlet.

As expected, terms like "woman," "violence," and "kill" (along with "femicide") are central to the narrative of femicide and are common across all outlets. Other keywords vary in relevance among multiple outlets; for example, "son" appears in all outlets except Il Post. Specific keywords are unique to one or two outlets: "gender," "right," and "sexual" appear only in Il Post; "family" is relevant in Corriere della Sera and La Stampa; and "man" is found in Il Post and Il Giornale. Due to the number of local news outlets in FMNews-Loc (50), Fig. 2 shows the top 20 keywords with the highest average TF-IDF, calculated as the mean of the TF-IDF values of the terms with respect to the news outlets. As expected, the highest ranks are occupied by the same relevant keywords found in national news outlets, such as "woman," "violence," "victim," and "femicide". Additionally, some keywords relevant to specific national news outlets show high relevance for local media, although with lower average TF-IDFs, such as "gender". Conversely, the distribution reveals previously unseen keywords, such as "young," "school," and "association".

Semantic Vector Extraction For an additional layer

of analysis, we chose to train a word embedding model to explore semantic relationships among words. This model represents words as continuous space vectors, where the proximity of vectors indicates the semantic similarity between the words they represent: closer vectors correspond to words with more similar meanings. We would expect, nearly all terms are associated and highemployed Word2Vec (W2V) [25], which operates by light that the victim is a woman. In this regard, a drawmapping words to high-dimensional vectors within a back to consider is that the specific selection of the terms given vocabulary. This mapping is designed to represent used for the data collection query may have hindered semantic relationships between words in the vectorial our analysis from uncovering insights about homicides space. W2V has been implemented through Gensim14, committed against individuals who do not identify as a powerful tool set for NLP tasks. A key parameter in woman or fit into the traditional gender binary. Indeed, W2V is the "window", i.e., the number of context words the discussion around gender-based violence in Italy is to be considered, which we defined as 10 to consider still predominantly centred on women, while other gena contextual window that extends neither too far nor ders remain significantly neglected 15. too close to the current word, thereby striking a balance between contextual relevance and computational eficiency. To discover the semantic associations within our 5. Conclusion dataset, we leveraged the "most similar" method from In this contribution, we provided a novel dataset concernGensim, which computes the cosine similarity between ing the critical issue of femicide in Italy. Considering the word vectors to identify words with the closest seman- absence of resources for conducting in-depth analyses on tic proximity. For both datasets the size of the training the subject, our intent was to bridge this gap and provide embeddings for the W2D model is fixed to 100 while an original perspective for understanding and raising the vocabulary size change accordingly to the dataset, in awareness about this severe phenomenon. FMNews-Nat is 6809, in FMNews-Loc is 6064. As suggested by Dobbe et al. [26], proposing a con

In FMNews-Nat, the word "donna" (woman) yielded tribution within the Machine Learning domain responsemantically related terms such as "vittima" (victim) and sibly and consciously means foremost acknowledging "prostituta" (whore). The term "femminicidio" (femicide) our own biases. In particular, we are referring to both elicited associations like "violenza" (violence), "impres- the newspaper selection and choice of the terms used to sionante" (impressive), and "dramma" (drama). In Table extract the data, that certainly shaped the results (all de3a, the analysis of "uccidere" (to kill) encompasses related sign choices are justified in detail in Section 3). A future terms such as "ammazzare" (to murder), "ucciderla" (to kill outlook concerns the investigation of how both victims her), "ammazzato" (murdered, masculine form), "ucciso" and perpetrators are framed from a linguistic perspective. (killed, masculine form), "suicidarsi" (to commit suicide), Further analyses could regard identifying temporal and and "strangolato" (strangled, masculine form). These geographical patterns arising from media attention manterms may collectively pertain to the perpetrator’s ac- ifested through the coverage of femicides and comparing tions against the victim. Fig. 5 in the Appendix provides the framing of these events with the political leaning of a comprehensive overview of word vectors closely asso- the respective newspapers. ciated with the previously extracted keywords, which were identified as the most significant in FMNews-Nat.

In Table 3b, the words correlated in meaning to "vittima" (victim) in FMNews-Loc are presented. As we 15As a matter of fact, there is no oficial collection of statistics regarding this specific kind of event. The only organisation that records the gender of the victims in its database is the Observatory Femicides Lesbicides Transcides managed by Non una di meno, the Italian section of movement Ni una menos (https: //osservatorionazionale.nonunadimeno.net/). 14https://pypi.org/project/gensim/.

Acknowledgments

This work has been supported by the European Union under ERC-2018-ADG GA 834756 (XAI), by HumanE-AINet GA 952026, by the Partnership Extended PE00000013 - “FAIR - Future Artificial Intelligence Research” - Spoke 1 “Human-centered AI”, and by SoBigData.it that receives funding from European Union – NextGenerationEU – National Recovery and Resilience Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it – Strengthening the Italian RI for Social Mining and Big Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del 28/12/2021.

A. Additional Resources

Computational Linguistics, Toronto, Canada, 2023, pp. 7907–7918. URL: https://aclanthology.org/2023.

ifndings-acl.501. Oficial Resources [22] M. Belluati, Femminicidio, Una lettura tra realtà e interpretazione. Biblioteca di testi e studi. Carocci Oficial statistics on femicide cases in Italy can be ac(2021). cessed through ISTAT16 and the Ministry of the Interior [23] A. Rajaraman, J. D. Ullman, Data mining, in: through the Department of Public Security website17. In Mining of Massive Datasets, Cambridge University particular, ISTAT provides data on victims of voluntary Press, Cambridge, 2011, pp. 1–17. doi:10.1017/ homicide, divided by gender, from 1992 to 2020, withCBO9781139058452.002. out additional information. In contrast, the Department [24] N. Firoozeh, A. Nazarenko, F. Alizon, B. Daille, Key- of Public Security ofers more detailed data covering a word extraction: Issues and methods, Natural Lan- limited time range, i.e., from 2002 to 2022: victims are guage Engineering 26 (2020) 259–291. categorized by their relationship to the murderer. These [25] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eficient categories include: Partner (husband/wife, domestic partestimation of word representations in vector space, ner, boyfriend/girlfriend), Former partner (former husarXiv preprint arXiv:1301.3781 (2013). band/wife, former domestic partner, former boyfriend/[26] R. Dobbe, S. Dean, T. K. Gilbert, N. Kohli, A broader girlfriend), Other relative, Other acquaintance, Perpetrator view on bias in automated decision-making: Re- unknown to the victim, and Perpetrator unidentified . lfecting on epistemology and dynamics, CoRR abs/1807.00553 (2018). URL: http://arxiv.org/abs/ Unoficial Resources 1807.00553. arXiv:1807.00553.

Unoficial data and statistics regarding femicides in Italy are also available, typically compiled by nongovernmental or grassroots organisations. One notable example is the open database18 managed by the Italian activists of Ni una menos19, an international feminist movement that campaigns against gender-based violence.

Although it covers a shorter time frame, this database ofers disaggregated and more detailed information than the oficial statistics. For example, in addition to the names of the victims, the collection also includes important characteristics such as the age and nationality of the individuals involved, the geographical dimension, and the gender of the victim, including non-binary framings.

While not readily accessible, a combined examination of both oficial and non-oficial data is essential for a more thorough and comprehensive analysis of the issues of femicide in Italy.

B. Data Preparation

We applied a supervised and semi-supervised cleaning phase divided into two steps to prepare the data. In the ifrst step, the same pipeline was applied to both datasets, primarily aimed at removing duplicate articles, formatting metadata, and reducing data and metadata sparsity.

The second step entailed supervised cleaning of the article texts and headlines. We observed diferent types of noise in the texts of the national newspapers compared 16https://www.istat.it/it/violenza-sulle-donne/il-fenomeno/

omicidi-di-donne. 17https://www.interno.gov.it/it/stampa-e-comunicazione/

dati-e-statistiche/omicidi-volontari-e-violenza-genere. 18https://osservatorionazionale.nonunadimeno.net/anno/. 19https://nonunadimeno.wordpress.com/. to the local ones. Hence, given that the two datasets are solely arise from symbols, we set a tolerance threshold released and usable separately, we implemented a similar of 0.89 to determine text equality. If two text bodies had pipeline for both datasets, albeit customized for each. a cosine similarity greater than 0.89, we considered them duplicates and retained only the first occurrence, removData Preparation - Step I: Cleaning ing the second found in the dataset. Finally, we utilized Beautiful Soup to remove any HTML tags that could We first removed all duplicate articles from the collected have been mistakenly included in the article body during data (just under 12,800 articles from national newspapers the collection phase. This step ensured that our text data and approximately 8,400 articles from local ones), i.e., was free from any undesired HTML tags before further those with identical texts (title and body), metadata (e.g., processing or analysis. date), and source publication. Additionally, we converted the dates into the format of yyyy-mm-dd and removed Data Preparation - Step II: FMNews-Nat articles where at least one of the following elements was missing: publication date, title, or body. Despite the The article texts from national newspapers displayed varremoval of duplicates, some articles had identical text ious noise patterns specific to each news media outlet. To bodies, albeit with minor variations primarily due to spe- address this issue, we manually created a list of replacecial character encoding (e.g., accents and apostrophes) ments for each outlet, employing regular expressions or diferences in web crawling (e.g., one article included for targeted removal of articles or specific sub-strings the website menu or footer while the other did not). To from article titles or bodies. In particular, the body of address this issue, we implemented a method to identify articles from Il Post, La Repubblica and Il Fatto Quotidiano and handle articles with identical or highly similar text included parts of webpage menus and footers, as well as bodies, but only if they share the same title. The method various types of news media outlet sponsorship, such as relies on cosine similarity to determine whether two texts subscriptions, newsletter sign-ups, and agendas/lists of are the same. In particular, we first employed a TF-IDF podcast episodes. On the other hand, articles from Corvectorizer to convert the raw text data into numerical vec- riere della sera included text substrings associated with tors. These vectors were then used to compute the cosine the journalistic domain, such as headings containing the similarities between all pairs of texts in the dataset. Co- name of the correspondent, reporter, or photographer. sine similarity produces a value between 0 and 1, where We observed that the texts of the articles published by 1 indicates identical texts and values closer to 0 indicate Corriere della sera often, but not always, follow a parless similar texts. Since text preprocessing had not been ticular structure: "by Author_name Author_surname" performed yet and diferences between text bodies could (where <Author_name Author_surname> can be a natural person or abbreviations with one dot) or "Editorial team", followed by a city or "online", in either uppercase or lowercase. Occasionally, this structure is followed by another city, for instance, "Bologna Online Editorial Staf". Additionally, this "basic" structure may or may not be followed by "inviato a <City> <(Province)>", or "inviata", "foto di <Author_name Author_surname>". We generally excluded articles whose text bodies did not contain information directly related to femicides, such as television programme listings or podcast episode agendas. We retained the article whenever feasible, removing irrelevant substrings from the text bodies, such as menus and footers. The resulting FMNews-Nat dataset includes 7, 443 articles: in Fig. 4 we report the distribution of articles by media outlet.

Data Preparation - Step II: FMNews-Loc The articles from local newspapers exhibited minimal noise within their text. Therefore, the data preparation phase focused on poorly encoded symbols and domainspecific substrings such as copyright indications and external contributions, e.g., government press releases. Unlike national newspapers, for journalistic publications, this ad-hoc cleaning did not result in data loss . Therefore, the resulting FMNews-Loc dataset includes 7, 728 articles.

C. Textual Analysis

Although applying NLP models typically requires standardized and structured text, it is important to acknowledge that such preprocessing may result in the loss of some information. We believe it is important to keep track into texts of the elements we manipulate. • Emails and URLS. Emails and URLs found within the body of the articles are replaced with a placeholder tag, such as "[[URL]]". • Uppercase words. Words entirely in uppercase are not replaced or modified, as the text will be normalized in subsequent stages of the work, i.e., converted to lowercase. Uppercase words are extracted and saved for further analysis. • Punctuation, symbols, numbers. Punctuation, symbols, and numbers are removed from the texts. • Stopwords. We remove the stopwords included in the list provided by NLTK 20 and Spacy21 libraries, along with a brief, manually compiled list of stopwords. This latter list includes domainspecific and context-related keywords, such as "Link Embed", "FOTO", "FOTOGRAMMA". It is important to note that the "ad hoc" stopwords were removed from the non-normalized text to mitigate the impact of stopwords removal. Indeed, during the analysis, we observed that some articles from national newspapers contained certain keywords entirely in uppercase to indicate elements attached to the article. Thus, we chose to compile the list of stopwords to be case-sensitive, aiming to avoid removing words within the body of the article.

After extracting the features from the raw texts, we proceeded with the following steps. First, we tokenized the body of articles using the Spacy library with the Italian module, selecting only words. Next, we extracted tokens that are not included in the stopwords. Then, we extracted the lemmas, again excluding stopwords. Finally, we further refined our selection by retaining from the tokens only words belonging to what is commonly referred to as "full" classes of speech, such as nouns, verbs, adjectives, and adverbs. This process of extracting "full" words aimed to focus our analysis on linguistically significant elements of the text. This approach allows us to study meaningful linguistic units, facilitating a more accurate understanding of the semantic content and structure of the text.

After tokenization, removal of stopwords, and extraction of lemmas, we computed the Type-Token Ratio (TTR) for the articles, a measure of the lexical diversity in a text. This is given by the proportion of unique words in a text, or "types", to the total number of words, or "tokens" and reads: = types tokens (1) 20https://www.nltk.org/. 21https://spacy.io/.

Where types is the number of unique types and tokens is the number of tokens in the text. TTR values range from 0 to 1, where a higher value indicates greater lexical variety, whereas a lower value implies more repetition of words in the text. This is a straightforward measure which nevertheless allows us to form an initial assessment of the lexical richness in the narrative surrounding femicides. The newspaper Il Post, along with Il Fatto Quotidiano and La Repubblica, exhibited a notable variation in terms of TTR. While FMNews-Nat shows variation in lexicon usage, FMNews-Loc exhibits a uniformity in language .

[1]

Bouzerdan ,

Whitten-Woodring , Killings in context: An analysis of the news framing of femicide , Human Rights Review 19 ( 2018 ) 211 - 228 .

[2]

Radford ,

Russell , Femicide: The Politics of Woman Killing, Post-Contemporary

Interventions

, Twayne, 1992 .

[3] M. M. L . y de los Ríos, Por la vida y la libertad de las mujeres: fin al feminicidio , Cámara de Diputados del Congreso de la Unión , LIX Legislatura, Comisión Especial para Conocer y Dar Seguimiento a las Investigaciones Relacionadas con los Feminicidios en la República Mexicana y a la Procuración de Justicia Vinculada , 2006 .

[4]

Spinelli , Femminicidio: dalla denuncia sociale al riconoscimento giuridico internazionale , Franco Angeli , 2008 .

[5]

Spinelli , L' italia rispetta la CEDAW? il femminicidio in italia alla luce delle raccomandazioni delle nazioni unite , in: I. Corti (Ed.), Universo femminile . La CEDAW tra diritto e politiche, eum edizioni università di Macerata , 2012 .

[6]

Abis ,

Orrù , et al., Il femminicidio nella stampa italiana: un'indagine linguistica , gender/sexuality/italy 3 ( 2016 ) 18 - 33 .

[7]

Aldrete ,

Fernández-Ardèvol , Framing femicide in the news, a paradoxical story: A comprehensive analysis of thematic and episodic frames , Crime, Media, Culture ( 2023 ) 17416590231199771 .

[8]

Forciniti , E. Zavarrone, Data quality and violence against women: The causes and actors of femicide , Social Indicators Research ( 2023 ) 1 - 25 .

[9]

Meluzzi ,

Pinelli ,

Valvason ,

Zanchi , Responsibility attribution in gender-based domestic violence: A study bridging corpus-assisted discourse analysis and readers' perception , Journal of pragmatics 185 ( 2021 ) 73 - 92 .

[10] R. M. Entman , Framing: Toward clarification of a fractured paradigm , Journal of Communication 43 ( 1993 ) 51 - 58 . doi: 10 .1111/j.1460- 2466 . 1993 . tb01304.x.

[11]

J. James W.

Tankard , The empirical approach to the study of media framing , in: S. D. Reese , J.

Gandy , A. E.

Grant (Eds.), Framing public life , Taylor & Francis, Philadelphia, PA, 2001 .

[12]

Edelman , Contestable categories and public opinion , Political Communication 10 ( 1993 ) 231 - 242 . doi: 10 .1080/10584609. 1993 . 9962981 .

[13]

Kahneman ,

Tversky , Choices, values, and frames., American Psychologist 39 ( 1984 ) 341 - 350 . doi: 10 .1037/ 0003 - 066x . 39 .4.341.

[14] P. M. Sniderman , R. A.

Brody , P. E.

Tetlock , Cambridge studies in public opinion and political psychology: Reasoning and choice: Explorations in political psychology , Cambridge University Press, Cambridge, England, 1993 .

[15]

Corradi ,

Marcuello-Servós ,

Boira ,

Weil , Theories of femicide and their significance for social research , Current sociology 64 ( 2016 ) 975 - 995 .

[16]

Fairbairn ,

Boyd ,

Jiwani ,

Dawson , Changing media representations of femicide as primary prevention , in: The Routledge International Handbook on Femicide and Feminicide , Routledge, 2023 , pp. 554 - 564 .

[17]

Pinelli ,

Zanchi , Gender-based violence in italian local newspapers: How argument structure constructions can diminish a perpetrator's responsibility, in: Discourse Processes between Reason and Emotion: A Post- disciplinary Perspective, Springer, 2021 , pp. 117 - 143 .

[18]

Minnema ,

Gemelli ,

Zanchi ,

Patti ,

Caselli ,

Nissim , et al., Frame semantics for social nlp in italian: Analyzing responsibility framing in femicide news reports , in: CEUR WORKSHOP PROCEEDINGS , volume 3033 , CEUR-WS , 2021 , pp. 1 - 8 .

[19]

Minnema ,

Gemelli ,

Zanchi ,

Caselli ,

Nissim , Sociofillmore: a tool for discovering perspectives , arXiv preprint arXiv:2203.03438 ( 2022 ).

[20]

Minnema ,

Gemelli ,

Zanchi ,

Caselli ,

Nissim , Dead or murdered? predicting responsibility perception in femicide news reports, in: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Online only, 2022 , pp. 1078 - 1090 . URL: https://aclanthology.org/ 2022 .aacl-main. 79 .

[21]

Minnema ,

Lai ,

Muscato ,

Nissim , Responsibility perspective transfer for Italian femicide news , in: Findings of the Association for Computational Linguistics: ACL 2023 , Association for