Beyond Headlines: A Corpus of Femicides News Coverage in
                                Italian Newspapers
                                Eleonora Cappuccio1,2,3,*,† , Benedetta Muscato1,4,† , Laura Pollacci1,2 ,
                                Marta Marchiori Manerba1,2 , Clara Punzi1,4 , Chandana Sree Mala1,4 , Margherita Lalli4 ,
                                Gizem Gezici3 , Michela Natilli2 and Fosca Giannotti4
                                1
                                  Università di Pisa, Pisa, Italy
                                2
                                  ISTI-CNR, Pisa, Italy
                                3
                                  Università degli Studi di Bari Aldo Moro, Bari
                                4
                                  Scuola Normale Superiore, Pisa, Italy


                                                Abstract
                                                How newspapers cover news significantly impacts how facts are understood, perceived, and processed by the public. This is
                                                especially crucial when serious crimes are reported, e.g., in the case of femicides, where the description of the perpetrator and
                                                the victim builds a strong, often polarized opinion of this severe societal issue. This paper presents FMNews, a new dataset of
                                                articles reporting femicides extracted from Italian newspapers. Our core contribution aims to promote the development of
                                                a deeper framing and awareness of the phenomenon through an original resource available and accessible to the research
                                                community, facilitating further analyses on the topic. The paper also provides a preliminary study of the resulting collection
                                                through several example use cases and scenarios.

                                                Keywords
                                                Italian Dataset, Newspapers, Information Extraction, Information Retrieval, AI for Social Good, Femicides


                                1. Introduction                                                                                          of women by males due to their gender. Successively, the
                                                                                                                                         term femicide, translated in Castillian as femicidio or fem-
                                How newspapers and journalists present news plays a                                                      inicide by the anthropologist Marcela Lagarde to attract
                                crucial role in shaping public understanding and percep-                                                 political attention on the dire situation faced by women
                                tion of information. This is especially important when                                                   in Mexico [3], has gained global traction with varying
                                reporting serious crimes, such as femicides, where de-                                                   interpretations, yet consistently denotes a patriarchal im-
                                scriptions of the perpetrator and victim can create po-                                                  petus behind homicides and other forms of male violence
                                larized opinions influencing readers’ perceptions and                                                    against women, primarily emphasising the sociological
                                interpretations of the event. According to Bouzerdan                                                     dimensions of abuse and the socio-political ramifications
                                and Whitten-Woodring [1], news media often report inci-                                                  of the phenomenon. In the Italian language, the term
                                dents of women’s homicides in a sensationalised manner,                                                  femminicidio has been almost exclusively adopted, as
                                treating these crimes as isolated events rather than situat-                                             evidenced by a Google Trends analysis comparing the
                                ing them within the bigger framework of violence against                                                 search terms "femicidio" and "femminicidio" to queries
                                women. This narrative defies the global demands of hu-                                                   regarding "femicide"1 .
                                man rights organisations to acknowledge and address this                                                    An analysis of the phenomenon of femicide in the Ital-
                                phenomenon as demanded by its intricate dynamics. Nu-                                                    ian context and, in particular, a linguistic investigation
                                merous countries have followed such recommendations                                                      of it, are particularly relevant. Feminicide, a term used
                                only partially through the formal adoption of specific ter-                                              by the feminist movement in Italy since 2005, gained
                                minology such as femicide and feminicide in legal frame-                                                 prominence in the media in 2011, especially thanks to
                                works and public discourse. The two terms have related                                                   the works of Barbara Spinelli [4]. The CEDAW Com-
                                but distinct nuances of meaning. Femicide, a criminolog-                                                 mittee2 , based on data from the Shadow Report on the
                                ical concept initially coined in English by the feminist                                                 Implementation of CEDAW in Italy, addressed recom-
                                criminologist Diana H. Russell [2], denotes the murder                                                   mendations to the Italian government on feminicide in
                                                                                                                                         its Concluding Observations. This was the first time the
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                     committee addressed a European state on feminicide, a
                                Dec 04 — 06, 2024, Pisa, Italy                                                                           category previously reserved for warnings to Central
                                *
                                  Corresponding author.
                                †
                                  These authors contributed equally.                                                                     1
                                                                                                                                           The conducted analysis included news web searches in Italy since
                                $ eleonora.cappuccio@phd.unipi.it (E. Cappuccio);                                                          2022, i.e., since when the service implemented an enhanced data
                                benedetta.muscato@sns.it (B. Muscato)                                                                      collection methodology.
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   2
                                          Attribution 4.0 International (CC BY 4.0).                                                       Committee on the Elimination of Discrimination Against Women.


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Cappuccio, Muscato et al.


American countries. The challenges in accurately contex-                           national4 and local5 level, with local editions span-
tualising feminicide in Italy also stem from a prolonged                           ning across the whole Italian territory.
absence of official data, resulting in sensationalism and                        • Political, which was granted by choosing na-
the perception of a dramatic rise in the crime. This may                           tional newspaper with varying political leanings.
induce an emergency narrative that obscures the inher-                           • Temporal, where the time frame of national
ent structural dimensions of the phenomenon, thereby                               newspapers extends from November 2009 to
undermining the very essence of the term [5]. Media                                February 2024, whilst that of the local ones ranges
interpretations are essential for shaping a shared under-                          from November 2010 to February 20246 .
standing across a vast audience, such as a whole country;
hence, the examination of media discourse emerges as
a significant analytical instrument on top of statistical                  2. Related Work
evaluation of femicide data to understand the achieve-
ments and directions of state intervention towards the                     According to frame analysis, the ways in which newspa-
substantial granting of women’s right to life [6].                         pers cover news significantly impact how facts are un-
   In this regard, Aldrete and Fernández-Ardèvol [7]                       derstood, perceived, and processed by the public [10, 11].
showed that there is a large body of empirical studies                     Framing narratives means strategically including or omit-
on femicide discourse across different socio-cultural con-                 ting elements (such as problem definitions, explanations
texts, which often justify the perpetrator’s actions. Given                and evaluations) of a given situation in a communica-
the complexity of the phenomenon, a comprehensive                          tive text [12, 13, 14]. This process aims to advocate for
investigation could be achieved by integrating media                       specific interpretations, assess moral responsibilities of
analysis with external data, such as demographics and                      individuals involved and propose solutions while also
current events, blending together researchers from dif-                    eliciting nuanced emotional responses from the audi-
ferent fields like computer science, social sciences, and                  ence, thereby affecting their perceptions and attitudes. It
complex systems science. The lack of accessible and                        is worth noting that in the case of news articles, media
relevant data specific to socio-culturally context where                   framing can be seen as a demonstration of political power
femicide is notably prevalent, such as in Italy, makes the                 [10], influencing which actors or interests are involved
task particularly challenging [8].                                         shape narratives, often unnoticed by the audience [11].
   This paper presents FMNews, a new dataset of articles                   The process of news framing becomes especially cru-
reporting femicides extracted from Italian newspapers3 .                   cial when reporting serious crimes, such as femicides, as
We conduct a preliminary analysis of the resulting col-                    understanding femicide requires analyzing its evolution
lection through several example use cases and scenarios.                   from both statistical and social perspectives, as discussed
The primary contribution is to deepen understanding and                    in the Manifesto delle Giornaliste e dei Giornalisti per il
awareness of femicide from a socio-technical perspective.                  Rispetto e la Parita’ di Genere nell’Informazione7 (Man-
We seek to examine how prominent Italian news sources                      ifesto of Journalists for Respect and Gender Equality in
report on the issue in connection to the shaping of public                 News Reporting, our translation).
perception, while also offering an innovative and acces-                      The acknowledged impact of language on how read-
sible resource to facilitate future investigation within                   ers perceive information has prompted researchers to
the research community. Furthermore, this study was                        explore how the language surrounding femicide has
designed to enable a multifaceted investigation covering                   changed and how this influences individuals’ respon-
the following three dimensions:                                            sibility perception [15], which can vary based on the way
                                                                           femicides are reported [1, 16, 9, 17]. Moreover, an initia-
         • Geographical, with the aim to explore poten-
           tial variations in framing between local and na-                4
                                                                             The selected national newspapers are the following: Corriere della
           tional media outlets. Indeed, previous research                   Sera, La Repubblica, La Stampa, Il Fatto Quotidiano, Il Giornale and
           has shown that Italian local daily newspaper of-                5
                                                                             Il Post.
                                                                             The selected local newspapers are the local editions of the CityNews
           ten suppress the agency of the perpetrator, por-                  group, which cover the following cities: Agrigento, Ancona, Arezzo,
           traying the events as mere occurrences [9]. We                    Avellino, Bari, Bologna, Brescia, Brindisi, Caserta, Catania, Cesena,
           selecting newspapers reporting news at both the                   Chieti, Como, Ferrara, Firenze, Foggia, Forlì, Frosinone, Genova,
                                                                             Pescara, Piacenza, Latina, Lecce, Lecco, Livorno, Messina, Milano,
                                                                             Modena, Monza, Napoli, Novara, Padova, Palermo, Parma, Perugia,
                                                                             Pisa, Pordenone, Ravenna, Reggio, Rimini, Roma, Salerno, Sondrio,
                                                                             Terni, Torino, Trento, Treviso, Trieste, Udine, Venezia, Verona,
                                                                             Vicenza, Viterbo.
3                                                                          6
    The choice of newspapers was dictated by the circulation volume          In Fig. 3 in the Appendix, we report the distribution of articles
    released by Audipress, a company that collects data on the reading       across time.
                                                                           7
    habits of daily and periodical press in Italy: https://audipress.it/     https://www.sindacatogiornalistiveneto.it/wp-content/uploads/
    quotidiani/.                                                             2020/12/MANIFESTO-DI-VENEZIA.pdf.
Cappuccio, Muscato et al.


tive by University of Bologna seeks to identify the main                 Selenium 10 and Beautiful Soup11 . Data scraping
discursive features employed in discussions about femi-                  was performed in two subsequent phases. Firstly, a com-
cide in public spaces, including media and legal speech8 .               prehensive list of article links was extracted by querying
   Recognizing the significant role of linguistic expres-                the internal search engine of the newspaper websites
sion in depicting incidents of gender-based violence,                    with the keywords femminicidio, femminicidi,
previous research has explored various NLP techniques.                   femminicida: the first word stands for the Italian term
These studies aim to discern how NLP models can effec-                   "femicide", the second is its plural form, and the third
tively predict and analyze human perception judgments                    indicates the "person who commits a femicide". The key-
concerning the sensitive issue of gender-based violence                  words were selected to concentrate our analysis on the
events. Following previous works on the impact of spe-                   media’s representation and discourse surrounding this
cific grammatical constructions and semantic frames [18]                 phenomenon. This choice intentionally excludes articles
in describing the same event but with various nuances,                   that discuss such crimes in general terms, allowing for a
Minnema et al. [19] introduced the first multilingual tool,              more focused examination of the femicide narratives. In
based on Frame Semantics and Cognitive Linguistics, for                  the second phase, the web pages corresponding to such
detecting the focus or perspective depicted in an event,                 links were scraped to extract the text of the articles and
called Socio Fillmore. Furthermore, building on the lin-                 other metadata to build the raw version of the dataset.
guistic analysis provided by Socio Fillmore, Minnema
et al. [20] demonstrated that various linguistic choices                 3.2. Data Cleaning
trigger different perceptions of responsibility, which can
be modeled automatically. As a result, their series of   We implemented a supervised and semi-supervised data
regression models revealed that these distinct linguis-  cleaning process, consisting of two phases, to prepare
tic choices significantly influence human perceptions of the data. In the first step, the same pipeline was applied
responsibility. Additionally, to promote awareness of    to both FMNews-Nat and FMNews-Loc. We initially re-
perspective-based writing, Minnema et al. [21] intro-    moved all duplicate articles from the collected data, i.e.,
                                                         those with identical texts (title and body), metadata (e.g.,
duced the novel task of responsibility perspective transfer.
The task involves the automatic rewriting of descriptionsdate), and source publication. Additionally, we converted
of gender-based violence to alter the perceived level of the dates into the format of yyyy-mm-dd and removed
blame attributed to the perpetrator. Both works lever-   articles where at least one of the following elements was
                                                         missing: publication date, title, or body. Despite the re-
aged one of the limited resources available for the Italian
community, the RAI Femicide Corpus, a collection         moval of duplicates, certain articles had identical text
of 2.734 news articles covering 937 confirmed femicide   bodies, albeit with minor variations primarily due to spe-
cases in Italy happened between 2015 and 2017 [22]. Ad-  cial character encoding (e.g., accents and apostrophes)
                                                         or differences in web crawling (e.g., one article included
ditional online resources, both official and unofficial, con-
taining further statistics on the phenomenon of femicide the website menu or footer while the other did not). To
in Italy are listed in the Appendix A.                   address this issue, we implemented a method to iden-
                                                         tify and handle articles with identical or highly similar
                                                         text bodies sharing the same title. In details, we first
3. FMNews Corpus                                         employed a TF-IDF12 vectorizer to convert the raw text
                                                         data into numerical vectors and then use them to com-
The main contribution brought by this paper is the pro- pute the cosine similarities between all pairs of texts
duction of two datasets derived from Italian newspapers: in the dataset. For more details on the parameters and
the FMNews9 corpus. The corpus consists of the following thresholds employed, we refer to Appendix B. Finally, we
components: FMNews-Nat, reporting data from national utilized Beautiful Soup to remove any HTML tags
newspapers, and FMNews-Loc, which gathers articles that could have been mistakenly included in the article
from local newspapers in 53 Italian cities.              body during the collection phase.
                                                            The second step of the data cleaning process entailed
3.1. Data Extraction                                     supervised cleaning of the article texts and headlines. The
                                                         article texts from national newspapers in FMNews-Nat
Despite the heterogeneous HTML structures of the news-
                                                         displayed various noise patterns specific to each news
papers involved, it was feasible to generalise the data
                                                         media outlet. To address this issue, we manually created
extraction process via the open source Python libraries
                                                                         10
                                                                            https://selenium-python.readthedocs.io/.
                                                                         11
                                                                            https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
8                                                                        12
    https://site.unibo.it/osservatorio-femminicidio/it.                     Term Frequency-Inverse Document Frequency, in short TF-IDF, is
9
    The collection can be accessed for research purposes by requesting      a measure of the importance of a word to a document in a collection
    it by email from the authors.                                           or corpus [23].
Cappuccio, Muscato et al.


     Column            Description                              Quotidiano has the largest number of articles, with a total
     Url               URL of the original newspaper
                                                                of 2,861, followed by La Repubblica with 2,837 articles.
                       article
                                                                Corriere is next, with a total of 968 articles. La Stampa
     Title             Title of the article
                                                                has a more limited presence, with 292 articles. Il Post
     Text              Main section of the newspaper
                       article                                  contributes 244 articles, and Il Giornale has the fewest
     Newspaper         Name of the media outlet where           entries in this set, with 241 articles. For FMNews-Loc,
                       the article was published. In            the time span after data cleaning ranges from November
                       FMNews-Loc, it reports the               2010 to February 2024.
                       name of the city to which the
                       local edition refers to.
     Keyword           Keyword used to collect the arti-        4. Use Cases and Scenarios
                       cle
     Date              Publication date of the article in       Since the two datasets share the same structure and we
                       the format yyyy-mm-dd                    are interested in studying the phenomenon of femicide
                                                                from both a national and local perspective, the analyses
Table 1                                                         exemplified in the following were conducted on both
Description of the FMNews Corpus.
                                                                datasets without distinction. After a textual analysis
                                                                based on the tokenization, removal of stopwords,
      Dataset          Raw Data      Step I     Step II         extraction of lemmas and a straightforward assessment
      FMNews-Nat       12,790        7,511      7,443           of the lexical diversity (as detailed in the Appendix C),
      FMNews-Loc       8,397         7,728      7,728           we approached a viable keyword extraction method to
Table 2                                                         uncover relevant patterns in the documents.
Dimensions of the dataset in terms of number of articles from
national news outlets (FMNews-Nat) and local newspaper edi-
tions (FMNews-Loc).                                             Keyword Extraction According to Firoozeh et al. [24],
                                                                specific criteria must be met for keywords to meet eli-
                                                                gibility standards. In our case study, we emphasize the
a list of replacements for each outlet, employing regular       importance of keywords that show representativity and
expressions for targeted removal of articles or specific        exhaustivity, aiming for terms that capture significant
sub-strings from article titles or bodies (we refer to Ap-      rather than marginal aspects of the subject matter. To
pendix B for additional details). In this stage, we also        assess the significance of words within our collection
excluded articles whose text bodies did not contain infor-      of documents, a standard approach involves the Term
mation directly related to femicides, such as television        Frequency - Inverse Document Frequency (TF-IDF).
programme listings or podcast episode agendas.                     For a deeper analysis, we calculate TF-IDF for each
   On the other hand, the articles from local newspapers        news outlet. We utilize Spacy’s Italian pipeline to pre-
in FMNews-Loc exhibited minimal noise within their text.        process texts by tokenizing, lemmatizing, and selecting
Therefore, the data preparation phase focused on poorly         only lemmas that are full words from specific part-of-
encoded symbols and domain-specific substrings such             speech classes (nouns, adjectives, verbs). By focusing
as copyright indications and external contributions, e.g.,      only on content lemmas and excluding function words
government press releases. Unlike national newspapers,          (like articles and prepositions), we eliminate noise and
for journalistic publications, this ad-hoc cleaning did not     improve accuracy in analyzing relationships between
result in data loss.                                            documents and word relevance. The lists of lemmas do
                                                                not include words containing numbers or Italian stop-
3.3. Final Dataset                                              words obtained from Nltk and Spacy, with additional
                                                                crawling-dependent stopwords such as "it," "https," "min,"
Table 1 provides a detailed explanation of the data format      and the names of months. Also, we preserve multi-word
for both datasets after the completion of the data prepara-     expressions identified by the lemmatizer by concatenat-
tion process. The number of entries for the two datasets        ing them to treat them as unique words during TF-IDF cal-
is shown in Table 2. The table also shows the number of         culation. Articles are then grouped by news outlet, each
articles after two steps of data cleaning exemplified in B.     acting as a single document for the TF-IDF computation.
   The analysis of FMNews-Nat after the last cleaning           We use the TF-IDF Vectorizer from the scikit-learn13
steps reveals the following summary statistics. The             library to transform the lemmatized tokens into numeri-
dataset covers a time span of 14 years, from November           cal features that reflect their importance within the text.
2009 to February 2024. Regarding the distribution of arti-
cles across different newspapers in FMNews-Nat: Il Fatto        13
                                                                     https://scikit-learn.org/stable/index.html.
Cappuccio, Muscato et al.


               (a) Il Post                         (b) Corriere della Sera                    (c) Il Giornale


          (d) Fatto Quotidiano                       (e) La Repubblica                        (f) La Stampa

Figure 1: Top 10 keywords in descending order for each news outlet FMNews-Nat.


Thus, TF-IDF measures the significance of terms concern-
ing the news outlets. Fig. 1 illustrates the most relevant
keywords extracted from FMNews-Nat by news outlet.
As expected, terms like "woman," "violence," and "kill"
(along with "femicide") are central to the narrative of femi-
cide and are common across all outlets. Other keywords
vary in relevance among multiple outlets; for example,
"son" appears in all outlets except Il Post. Specific key-
words are unique to one or two outlets: "gender," "right,"
and "sexual" appear only in Il Post; "family" is relevant
in Corriere della Sera and La Stampa; and "man" is found
in Il Post and Il Giornale. Due to the number of local
news outlets in FMNews-Loc (50), Fig. 2 shows the top
20 keywords with the highest average TF-IDF, calculated
as the mean of the TF-IDF values of the terms with re-
spect to the news outlets. As expected, the highest ranks
are occupied by the same relevant keywords found in na-         Figure 2:     Top 20 Keywords by average TF-IDF in
tional news outlets, such as "woman," "violence," "victim,"     FMNews-Loc.
and "femicide". Additionally, some keywords relevant
to specific national news outlets show high relevance
for local media, although with lower average TF-IDFs,           Semantic Vector Extraction For an additional layer
such as "gender". Conversely, the distribution reveals          of analysis, we chose to train a word embedding model to
previously unseen keywords, such as "young," "school,"          explore semantic relationships among words. This model
and "association".                                              represents words as continuous space vectors, where
                                                                the proximity of vectors indicates the semantic similar-
                                                                ity between the words they represent: closer vectors
Cappuccio, Muscato et al.


Table 3
Most similar word embeddings to

                 (a) "uccidere" (to kill) in FMNews-Nat                          (b) "vittima" (victim) in FMNews-Loc
     Word                                     Similarity score             Word                                 Similarity score
     𝑎𝑚𝑚𝑎𝑧𝑧𝑎𝑟𝑒 (to murder)                    0.77                         𝑟𝑎𝑔𝑎𝑧𝑧𝑎 (girl)                       0.69
     𝑢𝑐𝑐𝑖𝑑𝑒𝑟𝑙𝑎 (to kill - her)                0.71                         𝑔𝑖𝑜𝑣𝑎𝑛𝑒 (young)                      0.69
     𝑎𝑚𝑚𝑎𝑧𝑧𝑎𝑡𝑜 (murdered - him)               0.66                         𝑑𝑜𝑛𝑛𝑎 (woman)                        0.67
     𝑢𝑐𝑐𝑖𝑠𝑜 (killed - him)                    0.66                         𝑚𝑎𝑑𝑟𝑒 (mother)                       0.67
     𝑠𝑢𝑖𝑐𝑖𝑑𝑎𝑟𝑠𝑖 (to commit suicide)           0.63                         𝑓 𝑖𝑔𝑙𝑖𝑎 (daughter)                   0.64
     𝑠𝑡𝑟𝑎𝑛𝑔𝑜𝑙𝑎𝑡𝑜 (strangled - him)            0.62                         𝑠𝑐𝑜𝑚𝑝𝑎𝑟𝑠𝑎 (disappearance)            0.62
     𝑓 𝑢𝑟𝑖𝑎 (fury)                            0.60                         𝑢𝑐𝑐𝑖𝑠𝑎 (killed - her)                0.62
     𝑓 𝑢𝑐𝑖𝑙𝑒 (rifle)                          0.59                         26𝑒𝑛𝑛𝑒 (26 years old)                0.61
     𝑠𝑝𝑎𝑟𝑎𝑟𝑒 (to shoot)                       0.59                         𝑚𝑎𝑠𝑠𝑎𝑐𝑟𝑎𝑡𝑎 (massacred - her)         0.59
     𝑎𝑐𝑐𝑜𝑙𝑡𝑒𝑙𝑙𝑎𝑡𝑜 (stabbed - him)             0.59                         𝑝𝑜𝑣𝑒𝑟𝑎 (poor)                        0.59


correspond to words with more similar meanings. We               would expect, nearly all terms are associated and high-
employed Word2Vec (W2V) [25], which operates by                  light that the victim is a woman. In this regard, a draw-
mapping words to high-dimensional vectors within a               back to consider is that the specific selection of the terms
given vocabulary. This mapping is designed to represent          used for the data collection query may have hindered
semantic relationships between words in the vectorial            our analysis from uncovering insights about homicides
space. W2V has been implemented through Gensim14 ,               committed against individuals who do not identify as
a powerful tool set for NLP tasks. A key parameter in            woman or fit into the traditional gender binary. Indeed,
W2V is the "window", i.e., the number of context words           the discussion around gender-based violence in Italy is
to be considered, which we defined as 10 to consider             still predominantly centred on women, while other gen-
a contextual window that extends neither too far nor             ders remain significantly neglected15 .
too close to the current word, thereby striking a balance
between contextual relevance and computational effi-
ciency. To discover the semantic associations within our         5. Conclusion
dataset, we leveraged the "most similar" method from
                                                                 In this contribution, we provided a novel dataset concern-
Gensim, which computes the cosine similarity between
                                                                 ing the critical issue of femicide in Italy. Considering the
word vectors to identify words with the closest seman-
                                                                 absence of resources for conducting in-depth analyses on
tic proximity. For both datasets the size of the training
                                                                 the subject, our intent was to bridge this gap and provide
embeddings for the W2D model is fixed to 100 while
                                                                 an original perspective for understanding and raising
the vocabulary size change accordingly to the dataset, in
                                                                 awareness about this severe phenomenon.
FMNews-Nat is 6809, in FMNews-Loc is 6064.
                                                                    As suggested by Dobbe et al. [26], proposing a con-
   In FMNews-Nat, the word "donna" (woman) yielded
                                                                 tribution within the Machine Learning domain respon-
semantically related terms such as "vittima" (victim) and
                                                                 sibly and consciously means foremost acknowledging
"prostituta" (whore). The term "femminicidio" (femicide)
                                                                 our own biases. In particular, we are referring to both
elicited associations like "violenza" (violence), "impres-
                                                                 the newspaper selection and choice of the terms used to
sionante" (impressive), and "dramma" (drama). In Table
                                                                 extract the data, that certainly shaped the results (all de-
3a, the analysis of "uccidere" (to kill) encompasses related
                                                                 sign choices are justified in detail in Section 3). A future
terms such as "ammazzare" (to murder), "ucciderla" (to kill
                                                                 outlook concerns the investigation of how both victims
her), "ammazzato" (murdered, masculine form), "ucciso"
                                                                 and perpetrators are framed from a linguistic perspective.
(killed, masculine form), "suicidarsi" (to commit suicide),
                                                                 Further analyses could regard identifying temporal and
and "strangolato" (strangled, masculine form). These
                                                                 geographical patterns arising from media attention man-
terms may collectively pertain to the perpetrator’s ac-
                                                                 ifested through the coverage of femicides and comparing
tions against the victim. Fig. 5 in the Appendix provides
                                                                 the framing of these events with the political leaning of
a comprehensive overview of word vectors closely asso-
                                                                 the respective newspapers.
ciated with the previously extracted keywords, which
were identified as the most significant in FMNews-Nat.           15
                                                                      As a matter of fact, there is no official collection of statistics
   In Table 3b, the words correlated in meaning to "vit-              regarding this specific kind of event. The only organisation
                                                                      that records the gender of the victims in its database is the Ob-
tima" (victim) in FMNews-Loc are presented. As we
                                                                      servatory Femicides Lesbicides Transcides managed by Non una
                                                                      di meno, the Italian section of movement Ni una menos (https:
14
     https://pypi.org/project/gensim/.                                //osservatorionazionale.nonunadimeno.net/).
Cappuccio, Muscato et al.


Acknowledgments                                                     (1993) 51–58. doi:10.1111/j.1460-2466.1993.
                                                                    tb01304.x.
This work has been supported by the European Union             [11] J. James W.Tankard, The empirical approach to the
 under ERC-2018-ADG GA 834756 (XAI), by HumanE-AI-                  study of media framing, in: S. D. Reese, J. Gandy,
 Net GA 952026, by the Partnership Extended PE00000013              A. E. Grant (Eds.), Framing public life, Taylor &
- “FAIR - Future Artificial Intelligence Research” - Spoke 1        Francis, Philadelphia, PA, 2001.
“Human-centered AI”, and by SoBigData.it that receives         [12] M. Edelman,            Contestable categories and
 funding from European Union – NextGenerationEU –                   public opinion,        Political Communication 10
 National Recovery and Resilience Plan (Piano Nazionale             (1993) 231–242. doi:10.1080/10584609.1993.
 di Ripresa e Resilienza, PNRR) – Project: “SoBigData.it            9962981.
– Strengthening the Italian RI for Social Mining and Big       [13] D. Kahneman, A. Tversky, Choices, values, and
 Data Analytics” – Prot. IR0000013 – Avviso n. 3264 del             frames., American Psychologist 39 (1984) 341–350.
 28/12/2021.                                                        doi:10.1037/0003-066x.39.4.341.
                                                               [14] P. M. Sniderman, R. A. Brody, P. E. Tetlock, Cam-
                                                                    bridge studies in public opinion and political psy-
References                                                          chology: Reasoning and choice: Explorations in
 [1] C. Bouzerdan, J. Whitten-Woodring, Killings in con-            political psychology, Cambridge University Press,
     text: An analysis of the news framing of femicide,             Cambridge, England, 1993.
     Human Rights Review 19 (2018) 211–228.                    [15] C. Corradi, C. Marcuello-Servós, S. Boira, S. Weil,
 [2] J. Radford, D. Russell, Femicide: The Politics of              Theories of femicide and their significance for social
     Woman Killing, Post-Contemporary Interventions,                research, Current sociology 64 (2016) 975–995.
     Twayne, 1992.                                             [16] J. Fairbairn, C. Boyd, Y. Jiwani, M. Dawson, Chang-
 [3] M. M. L. y de los Ríos, Por la vida y la libertad de           ing media representations of femicide as primary
     las mujeres: fin al feminicidio, Cámara de Diputa-             prevention, in: The Routledge International Hand-
     dos del Congreso de la Unión, LIX Legislatura,                 book on Femicide and Feminicide, Routledge, 2023,
     Comisión Especial para Conocer y Dar Seguimiento               pp. 554–564.
     a las Investigaciones Relacionadas con los Femini-        [17] E. Pinelli, C. Zanchi, Gender-based violence in
     cidios en la República Mexicana y a la Procuración             italian local newspapers: How argument structure
     de Justicia Vinculada, 2006.                                   constructions can diminish a perpetrator’s responsi-
 [4] B. Spinelli, Femminicidio: dalla denuncia sociale              bility, in: Discourse Processes between Reason and
     al riconoscimento giuridico internazionale, Franco             Emotion: A Post-disciplinary Perspective, Springer,
     Angeli, 2008.                                                  2021, pp. 117–143.
 [5] B. Spinelli, L’italia rispetta la CEDAW? il femmini-      [18] G. Minnema, S. Gemelli, C. Zanchi, V. Patti,
     cidio in italia alla luce delle raccomandazioni delle          T. Caselli, M. Nissim, et al., Frame semantics for so-
     nazioni unite, in: I. Corti (Ed.), Universo femminile.         cial nlp in italian: Analyzing responsibility framing
     La CEDAW tra diritto e politiche, eum edizioni uni-            in femicide news reports, in: CEUR WORKSHOP
     versità di Macerata, 2012.                                     PROCEEDINGS, volume 3033, CEUR-WS, 2021, pp.
 [6] S. Abis, P. Orrù, et al., Il femminicidio nella stampa         1–8.
     italiana: un’indagine linguistica, gender/sexuali-        [19] G. Minnema, S. Gemelli, C. Zanchi, T. Caselli,
     ty/italy 3 (2016) 18–33.                                       M. Nissim, Sociofillmore: a tool for discovering per-
 [7] M. Aldrete, M. Fernández-Ardèvol, Framing femi-                spectives, arXiv preprint arXiv:2203.03438 (2022).
     cide in the news, a paradoxical story: A compre-          [20] G. Minnema, S. Gemelli, C. Zanchi, T. Caselli,
     hensive analysis of thematic and episodic frames,              M. Nissim, Dead or murdered? predicting responsi-
     Crime, Media, Culture (2023) 17416590231199771.                bility perception in femicide news reports, in: Pro-
 [8] A. Forciniti, E. Zavarrone, Data quality and violence          ceedings of the 2nd Conference of the Asia-Pacific
     against women: The causes and actors of femicide,              Chapter of the Association for Computational Lin-
     Social Indicators Research (2023) 1–25.                        guistics and the 12th International Joint Confer-
 [9] C. Meluzzi, E. Pinelli, E. Valvason, C. Zanchi, Re-            ence on Natural Language Processing (Volume 1:
     sponsibility attribution in gender-based domestic vi-          Long Papers), Association for Computational Lin-
     olence: A study bridging corpus-assisted discourse             guistics, Online only, 2022, pp. 1078–1090. URL:
     analysis and readers’ perception, Journal of prag-             https://aclanthology.org/2022.aacl-main.79.
     matics 185 (2021) 73–92.                                  [21] G. Minnema, H. Lai, B. Muscato, M. Nissim, Re-
[10] R. M. Entman, Framing: Toward clarification of a               sponsibility perspective transfer for Italian femi-
     fractured paradigm, Journal of Communication 43                cide news, in: Findings of the Association for Com-
                                                                    putational Linguistics: ACL 2023, Association for
Cappuccio, Muscato et al.


     Computational Linguistics, Toronto, Canada, 2023,       A. Additional Resources
     pp. 7907–7918. URL: https://aclanthology.org/2023.
     findings-acl.501.                                       Official Resources
[22] M. Belluati, Femminicidio, Una lettura tra realtà e
                                                             Official statistics on femicide cases in Italy can be ac-
     interpretazione. Biblioteca di testi e studi. Carocci
                                                             cessed through ISTAT16 and the Ministry of the Interior
     (2021).
                                                             through the Department of Public Security website17 . In
[23] A. Rajaraman, J. D. Ullman, Data mining, in:
                                                             particular, ISTAT provides data on victims of voluntary
     Mining of Massive Datasets, Cambridge University
                                                             homicide, divided by gender, from 1992 to 2020, with-
     Press, Cambridge, 2011, pp. 1–17. doi:10.1017/
                                                             out additional information. In contrast, the Department
     CBO9781139058452.002.
                                                             of Public Security offers more detailed data covering a
[24] N. Firoozeh, A. Nazarenko, F. Alizon, B. Daille, Key-
                                                             limited time range, i.e., from 2002 to 2022: victims are
     word extraction: Issues and methods, Natural Lan-
                                                             categorized by their relationship to the murderer. These
     guage Engineering 26 (2020) 259–291.
                                                             categories include: Partner (husband/wife, domestic part-
[25] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient
                                                             ner, boyfriend/girlfriend), Former partner (former hus-
     estimation of word representations in vector space,
                                                             band/wife, former domestic partner, former boyfriend/-
     arXiv preprint arXiv:1301.3781 (2013).
                                                             girlfriend), Other relative, Other acquaintance, Perpetrator
[26] R. Dobbe, S. Dean, T. K. Gilbert, N. Kohli, A broader
                                                             unknown to the victim, and Perpetrator unidentified.
     view on bias in automated decision-making: Re-
     flecting on epistemology and dynamics, CoRR
     abs/1807.00553 (2018). URL: http://arxiv.org/abs/       Unofficial Resources
     1807.00553. arXiv:1807.00553.
                                                             Unofficial data and statistics regarding femicides in
                                                             Italy are also available, typically compiled by non-
                                                             governmental or grassroots organisations. One notable
                                                             example is the open database18 managed by the Italian
                                                             activists of Ni una menos19 , an international feminist
                                                             movement that campaigns against gender-based violence.
                                                             Although it covers a shorter time frame, this database
                                                             offers disaggregated and more detailed information than
                                                             the official statistics. For example, in addition to the
                                                             names of the victims, the collection also includes impor-
                                                             tant characteristics such as the age and nationality of the
                                                             individuals involved, the geographical dimension, and
                                                             the gender of the victim, including non-binary framings.
                                                             While not readily accessible, a combined examination of
                                                             both official and non-official data is essential for a more
                                                             thorough and comprehensive analysis of the issues of
                                                             femicide in Italy.


                                                             B. Data Preparation
                                                             We applied a supervised and semi-supervised cleaning
                                                             phase divided into two steps to prepare the data. In the
                                                             first step, the same pipeline was applied to both datasets,
                                                             primarily aimed at removing duplicate articles, format-
                                                             ting metadata, and reducing data and metadata sparsity.
                                                             The second step entailed supervised cleaning of the arti-
                                                             cle texts and headlines. We observed different types of
                                                             noise in the texts of the national newspapers compared

                                                             16
                                                                https://www.istat.it/it/violenza-sulle-donne/il-fenomeno/
                                                                omicidi-di-donne.
                                                             17
                                                                https://www.interno.gov.it/it/stampa-e-comunicazione/
                                                                dati-e-statistiche/omicidi-volontari-e-violenza-genere.
                                                             18
                                                                https://osservatorionazionale.nonunadimeno.net/anno/.
                                                             19
                                                                https://nonunadimeno.wordpress.com/.
Cappuccio, Muscato et al.


Figure 3: Number of articles throughout the years (2008-2024) for both FMNews-Nat and FMNews-Loc.


to the local ones. Hence, given that the two datasets are    solely arise from symbols, we set a tolerance threshold
released and usable separately, we implemented a similar     of 0.89 to determine text equality. If two text bodies had
pipeline for both datasets, albeit customized for each.      a cosine similarity greater than 0.89, we considered them
                                                             duplicates and retained only the first occurrence, remov-
Data Preparation - Step I: Cleaning                          ing the second found in the dataset. Finally, we utilized
                                                             Beautiful Soup to remove any HTML tags that could
We first removed all duplicate articles from the collected have been mistakenly included in the article body during
data (just under 12,800 articles from national newspapers the collection phase. This step ensured that our text data
and approximately 8,400 articles from local ones), i.e., was free from any undesired HTML tags before further
those with identical texts (title and body), metadata (e.g., processing or analysis.
date), and source publication. Additionally, we converted
the dates into the format of yyyy-mm-dd and removed
                                                             Data Preparation - Step II: FMNews-Nat
articles where at least one of the following elements was
missing: publication date, title, or body. Despite the The article texts from national newspapers displayed var-
removal of duplicates, some articles had identical text ious noise patterns specific to each news media outlet. To
bodies, albeit with minor variations primarily due to spe- address this issue, we manually created a list of replace-
cial character encoding (e.g., accents and apostrophes) ments for each outlet, employing regular expressions
or differences in web crawling (e.g., one article included for targeted removal of articles or specific sub-strings
the website menu or footer while the other did not). To from article titles or bodies. In particular, the body of
address this issue, we implemented a method to identify articles from Il Post, La Repubblica and Il Fatto Quotidiano
and handle articles with identical or highly similar text included parts of webpage menus and footers, as well as
bodies, but only if they share the same title. The method various types of news media outlet sponsorship, such as
relies on cosine similarity to determine whether two texts subscriptions, newsletter sign-ups, and agendas/lists of
are the same. In particular, we first employed a TF-IDF podcast episodes. On the other hand, articles from Cor-
vectorizer to convert the raw text data into numerical vec- riere della sera included text substrings associated with
tors. These vectors were then used to compute the cosine the journalistic domain, such as headings containing the
similarities between all pairs of texts in the dataset. Co- name of the correspondent, reporter, or photographer.
sine similarity produces a value between 0 and 1, where We observed that the texts of the articles published by
1 indicates identical texts and values closer to 0 indicate Corriere della sera often, but not always, follow a par-
less similar texts. Since text preprocessing had not been ticular structure: "by Author_name Author_surname"
performed yet and differences between text bodies could (where <Author_name Author_surname> can be a nat-
Cappuccio, Muscato et al.


ural person or abbreviations with one dot) or "Editorial               • Emails and URLS. Emails and URLs found
team", followed by a city or "online", in either uppercase               within the body of the articles are replaced with
or lowercase. Occasionally, this structure is followed                   a placeholder tag, such as "[[URL]]".
by another city, for instance, "Bologna Online Editorial               • Uppercase words. Words entirely in uppercase
Staff". Additionally, this "basic" structure may or may not              are not replaced or modified, as the text will be
be followed by "inviato a <City> <(Province)>", or "in-                  normalized in subsequent stages of the work, i.e.,
viata", "foto di <Author_name Author_surname>". We                       converted to lowercase. Uppercase words are
generally excluded articles whose text bodies did not                    extracted and saved for further analysis.
contain information directly related to femicides, such as             • Punctuation, symbols, numbers. Punctua-
television programme listings or podcast episode agen-                   tion, symbols, and numbers are removed from
das. We retained the article whenever feasible, removing                 the texts.
irrelevant substrings from the text bodies, such as menus              • Stopwords. We remove the stopwords included
and footers. The resulting FMNews-Nat dataset includes                   in the list provided by NLTK 20 and Spacy21 li-
7, 443 articles: in Fig. 4 we report the distribution of                 braries, along with a brief, manually compiled
articles by media outlet.                                                list of stopwords. This latter list includes domain-
                                                                         specific and context-related keywords, such as
Data Preparation - Step II: FMNews-Loc                                   "Link Embed", "FOTO", "FOTOGRAMMA". It is
                                                                         important to note that the "ad hoc" stopwords
The articles from local newspapers exhibited minimal
                                                                         were removed from the non-normalized text to
noise within their text. Therefore, the data preparation
                                                                         mitigate the impact of stopwords removal. Indeed,
phase focused on poorly encoded symbols and domain-
                                                                         during the analysis, we observed that some arti-
specific substrings such as copyright indications and ex-
                                                                         cles from national newspapers contained certain
ternal contributions, e.g., government press releases. Un-
                                                                         keywords entirely in uppercase to indicate ele-
like national newspapers, for journalistic publications,
                                                                         ments attached to the article. Thus, we chose to
this ad-hoc cleaning did not result in data loss . There-
                                                                         compile the list of stopwords to be case-sensitive,
fore, the resulting FMNews-Loc dataset includes 7, 728
                                                                         aiming to avoid removing words within the body
articles.
                                                                         of the article.

                                                                 After extracting the features from the raw texts, we
                                                              proceeded with the following steps. First, we tokenized
                                                              the body of articles using the Spacy library with the
                                                              Italian module, selecting only words. Next, we extracted
                                                              tokens that are not included in the stopwords. Then, we
                                                              extracted the lemmas, again excluding stopwords. Finally,
                                                              we further refined our selection by retaining from the to-
                                                              kens only words belonging to what is commonly referred
                                                              to as "full" classes of speech, such as nouns, verbs, adjec-
                                                              tives, and adverbs. This process of extracting "full" words
                                                              aimed to focus our analysis on linguistically significant
                                                              elements of the text. This approach allows us to study
                                                              meaningful linguistic units, facilitating a more accurate
                                                              understanding of the semantic content and structure of
                                                              the text.
Figure 4: Final number of articles of FMNews-Nat extracted       After tokenization, removal of stopwords, and extrac-
from the national newspapers.                                 tion of lemmas, we computed the Type-Token Ratio (TTR)
                                                              for the articles, a measure of the lexical diversity in a text.
                                                              This is given by the proportion of unique words in a text,
                                                              or "types", to the total number of words, or "tokens" and
C. Textual Analysis                                           reads:
                                                                                                𝑁types
Although applying NLP models typically requires stan-                                   𝑇𝑇𝑅 =                            (1)
dardized and structured text, it is important to acknowl-                                       𝑁tokens
edge that such preprocessing may result in the loss of
some information. We believe it is important to keep          20
                                                                   https://www.nltk.org/.
track into texts of the elements we manipulate.               21
                                                                   https://spacy.io/.
Cappuccio, Muscato et al.


Figure 5: Similar word vectors in FMNews-Nat.


   Where 𝑁types is the number of unique types and 𝑁tokens
is the number of tokens in the text. TTR values range
from 0 to 1, where a higher value indicates greater lexical
variety, whereas a lower value implies more repetition
of words in the text. This is a straightforward measure
which nevertheless allows us to form an initial assess-
ment of the lexical richness in the narrative surrounding
femicides. The newspaper Il Post, along with Il Fatto Quo-
tidiano and La Repubblica, exhibited a notable variation
in terms of TTR. While FMNews-Nat shows variation
in lexicon usage, FMNews-Loc exhibits a uniformity in
language .