=Paper=
{{Paper
|id=Vol-3878/121_calamita_long
|storemode=property
|title=GATTINA - GenerAtion of TiTles for Italian News Articles: A CALAMITA Challenge
|pdfUrl=https://ceur-ws.org/Vol-3878/121_calamita_long.pdf
|volume=Vol-3878
|authors=Maria Francis,Matteo Rinaldi,Jacopo Gili,Leonardo De Cosmo,Sandro Iannaccone,Malvina Nissim,Viviana Patti
|dblpUrl=https://dblp.org/rec/conf/clic-it/FrancisRGCINP24
}}
==GATTINA - GenerAtion of TiTles for Italian News Articles: A CALAMITA Challenge==
GATTINA - GenerAtion of TiTles for Italian News Articles:
A CALAMITA Challenge
Maria Francis1,2,*,† , Matteo Rinaldi3,† , Jacopo Gili3,† , Leonardo De Cosmo4 , Sandro Iannaccone5 ,
Malvina Nissim1,‡ and Viviana Patti3,‡
1
CLCG, University of Groningen
2
University of Trento
3
University of Turin
4
ANSA
5
Galileo
Abstract
We introduce a new benchmark designed to evaluate the ability of Large Language Models (LLMs) to generate Italian-language
headlines for science news articles. The benchmark is based on a large dataset of science news articles obtained from Ansa
Scienza and Galileo, two important Italian media outlets. Effective headline generation requires more than summarizing
article content; headlines must also be informative, engaging, and suitable for the topic and target audience, making automatic
evaluation particularly challenging. To address this, we propose two novel transformer-based metrics to assess headline
quality. We aim for this benchmark to support the evaluation of Italian LLMs and to foster the development of tools to assist
in editorial workflows.
Keywords
CALAMITA Challenge, Italian, Benchmarking, Headline generation, Summarisation, LLMs
1. Introduction and Motivation sensitivity, balance, a sense of measure, and a deep un-
derstanding of the readers. There are no precise and
The title is undoubtedly one of the most important and inescapable "rules" – save, of course, for the usual de-
crucial components of a journalistic article. A good title ontological norms of pertinence and truth that regulate
intrigues the reader, synthesises the news without an- the journalistic profession – but in fact, the operation
ticipating its details, encourages further reading, and is depends almost exclusively on the author’s expertise and
simultaneously pleasant to read or hear. Often, the fate must be evaluated on a case-by-case basis.
of an article is inextricably linked to the quality of its Factors that can influence the composition of a title
accompanying title: it is not uncommon for inherently include, for example, the topic and the "tone of voice" of
interesting, in-depth, and factually correct articles to go the article (a piece reporting a crime news story, for in-
unnoticed simply because they are accompanied by an stance, requires a measured, discreet, and respectful title;
inappropriate or unattractive title. Composing adequate conversely, a piece on lifestyle can and should be paired
titles is not a simple operation; it requires experience, with a lighter, ironic, and more colorful title); the style
of the publication hosting the article; the destination for-
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, mat (the same article printed in a paper newspaper and
Dec 04 — 06, 2024, Pisa, Italy published on an online outlet, for example, typically has
*
Corresponding author. two different titles); potential "conflicts" with other titles
†
Shared first authorship. present on the same page (for instance: repetitions of the
‡
Shared supervision. same word or phrase, or the enunciation of contradic-
$ maria.francis@unitn.it (M. Francis); matteo.rinaldi@unito.it tory concepts); space limitations; prescriptions related
(M. Rinaldi); jacopo.gili584@edu.unito.it (J. Gili);
leodecosmo@gmail.com (L. D. Cosmo); iannaccone@galileonet.it
to search engine optimisation (for example, the use of
(S. Iannaccone); m.nissim@rug.nl (M. Nissim); a particular word or expression particularly popular at
viviana.patti@unito.it (V. Patti) the time of publication, or a specific position of words
https://github.com/rosakun (M. Francis); within the title).
https://github.com/mrinaldi97 (M. Rinaldi); It is in this context that the journalist’s toolkit has re-
https://github.com/Jj-source (J. Gili);
https://github.com/malvinanissim (M. Nissim);
cently been enriched with a powerful new tool: Large
https://github.com/vivpatti (V. Patti) language models (LLMs) undoubtedly have an important
0009-0007-7638-9963 (M. Francis); 0009-0004-7488-8855 role in the world of journalism, including quality jour-
(M. Rinaldi); 0009-0007-1343-3760 (J. Gili); 0000-0001-5289-0971 nalism. Although incapable of "understanding" content
(M. Nissim); 0000-0001-5991-370X (V. Patti) as a human journalist would, as well as the meaning of
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
words, LLMs are naturally capable of producing fluent, automatically generated headlines which were observed
complex, plausible, and credible texts in a matter of mo- in previous work, such as lack of fluency and creativity
ments. These models not only can improve the efficiency [2], might not affect LLM-based generations.
of editorial processes but also offer new creative and in- The second aim is to provide a reliable, high quality
novative possibilities for content creation, including the dataset of articles and corresponding headlines in Italian,
automatic generation of journalistic headlines. Analysing developed through a direct collaboration of language
why it may be useful for journalism to have an LLM ca- technology experts and journalists, which can be used
pable of generating titles leads us to consider numerous and analysed well beyond the CALAMITA challenge.
factors, such as time optimisation, content personaliza- Although similar datasets exist for other languages [4, 5],
tion, and the ability to maintain a high level of quality, this resource is still lacking for Italian.
coherence, and communicative impact. However, these Overall, experimenting with the use of LLMs for title
tools also present many limitations and some dangers, generation can also be considered a first step towards
particularly the risk of blindly relying on them. the introduction of more extensive and comprehensive
Timing and speed, in particular, are one of the great artificial intelligence agents, which assist the journalist
challenges of journalism - being the first to publish a in all phases of the creative process, from news research
story, especially online, is often essential to attract read- to drafting an outline, to writing the actual piece, and
ers - however, as we have seen, generating effective and finally to its promotion. Indeed, a close interaction of
incisive titles requires skill and time, which is not always language models and humans in this task has recently
available. An LLM can drastically reduce the time needed been shown to be key [6].
to create appropriate titles, for example by suggesting
to the author a series of reasoned choices or proposing
modifications and corrections to an already written title, 2. Challenge Description
always keeping in mind preset criteria such as length,
The task of headline generation has often been treated
tone, attractiveness, clarity, and the publication’s style.
as equal to an extreme summarization task [3, 7]. How-
Furthermore, if trained on the corpus of a particular pub-
ever, simply synthesising the content of the article into
lication, an LLM can suggest titles consistent with its
a brief description is not enough to provide a satisfying
tone of voice and editorial history.
title. Additional characteristics such as attractiveness,
Another important advantage that the use of LLMs
creativeness, and many others also play a role. Writing
can offer is the ability to personalise content for different
appropriate headlines is challenging, even for current
platforms and audiences. In today’s newsrooms, journal-
state-of-the-art LLMs.
ists no longer have to worry only about print media but
Evaluating LLMs on the task of headline generation
must also consider the web, social media, newsletters,
for Italian news articles thus serves multiple purposes.
and other digital distribution platforms. Each platform
On one hand, it tests models’ capacity to properly under-
requires a different type of language, style, and length
stand, that is, to reprocess large source texts in a way that
for titles. For example, a title optimised for Twitter (or
is faithful to the content of the text. On the other hand,
X) must be short and incisive, while a title for a news
it acts as a means to assess the performance of LLMs in
website can be more descriptive. An LLM is capable of
many complex dimensions, such as attractiveness, cre-
generating variants of a title based on the medium of
ativity, or adherence to tone. Finally, this benchmark
dissemination, allowing newsrooms to adapt their con-
could prove useful in practical applications. For instance,
tent precisely and in a targeted manner. Moreover, using
it may help guide decisions on whether, and to what ex-
reader behavioural data, the LLM can generate more
tent, a journal should integrate LLMs into its workflow.
attractive titles for specific demographic groups, thus
It may also serve as an effective testbed for future re-
improving the engagement and communicative effective-
search and development towards effective deployment
ness of the news.
in real-world scenarios - One such venue could be the
With this task, which is developed in the context of the
use of prompting to achieve the desired style and tone in
CALAMITA Challenge [1] and which consists in asking
generated headlines.
an LLM to generate a headline given the corresponding
In our challenge, language models are tasked with gen-
full article, we have a twofold aim.
erating Italian-language headlines based on articles from
The first aim is to test and analyse the ability of existing
scientific news journals written in Italian. Our dataset
and future LLMs on the task of headline generation in the
includes original articles from such journals, along with
context of Italian news articles. This would provide a sub-
their human-authored titles. Models are provided the
stantial step forward compared to past experiments on
complete source text in the prompt, as well as instruc-
headline generation for Italian, which were run training
tions to generate a title that is brief, coherent, and capti-
much smaller sequence-to-sequence models from scratch
vating. We guide the model towards the specific editorial
[2, 3]. We expect that some of the shortcomings of the
style of the media outlet by including a small number of
examples of headlines in our prompt. We employ auto-
matic metrics that assess the model’s performance along
three dimensions:
1. Coherency with the original article (HA classifier)
2. Alignment with the style of human written head-
lines (NS classifier)
3. Similarity between the generated and the gold-
standard headline (ROUGE [8], SBERT [9])
However, considering the complexity of the task, we
believe that manually reviewing a sample of the gener-
ated headlines can offer additional perspectives on the
behaviour of the model.
Figure 1: Distribution of articles by token count in the Galileo
3. Data description subset.
Our benchmark is based of two datasets consisting of
science news articles from two different sources. In each
dataset, we provide the full text of the article paired with
the original, human-authored headline. Additionally, we
include metadata such as link, date, author (if present)
and subtitle.
3.1. Origin of data
The data were obtained via web scraping with custom
Python scripts. Since links to articles more than a few
weeks old are inaccessible on the Ansa website, we col-
lected a large number by downloading the archived "Ansa
Scienza" RSS feeds from The Wayback Machine and pro-
cessing them to remove duplicates and extact links.
Figure 2: Distribution of articles by token count in the Ansa
3.2. Data format subset.
The data from web scraping were saved in "JSON Lines"
(JSONL) format, with each line containing a JSON object
with the following fields: 1. "ANSA scienza", the science section of the Italian
newspaper "ANSA", from which obtained 6,889
• Title: the title of the article articles: 649 of which are from 2024, and the oth-
• Source: the name of the website ers are from a period of time between 2018 and
• Date: the publishing date of the article 2022.
• Author: the author of the article, if present 2. The “Galileo” website, from which we sourced
• URL: the Internet address of the article 23,572 articles dating from April 1996 to May
• Text: the body of the article 2024.
• ID: a unique identifier of the article
When measured with “tiktoken o200k_base” tokenizer
model, we obtained a total of 21,365,897 tokens for the
3.3. Detailed data statistics Galileo dataset (average: 906 tokens per article, max-
Our dataset consists of 30,461 articles gathered from two imum: 24,306) and a total of 3,762,539 tokens for the
sources: Galileo dataset (average: 546 tokens per article, maxi-
mum: 7,600). Figures 1 and 2 depict the distribution of
articles by token count in the Galileo and Ansa datasets
respectively.
3.4. Prompting 4. Preliminary Evaluation
Due to the length of each article, the use of task examples To get a first impression of LLM performance on our task,
in our prompt would be too computationally expensive. we conducted preliminary experiments by manually re-
Therefore, we test the models in a zero-shot prompting viewing headlines generated by several models. Overall,
setting. While we do not use any task examples in our the results were unsatisfactory - while the titles were
prompt, we do provide seven examples of headlines. In generally coherent with the articles, they lacked capti-
this way, the model is given examples of the expected vation and originality. The majority of the generated
output (a title) rather than examples of the full task (ar- headlines followed the format ,
ticle and title). Professional journalists made a list of 22 leading to repetitive and poorly formulated headlines. Ex-
headlines that, in their opinion, were representative of amples of our preliminary results can be found in Table 1
a well-made writing process under the three aspects of in Appendix A. This behaviour persisted even when the
being captivating, short and informative. models were explicitly instructed to avoid using colons in
Each time the model is tested, 7 randomly chosen titles the titles, or when examples of titles were given. Out of
from the list are appended to the standard prompt. As a 3,006 headlines generated by Phi-3.5 Mini-Instruct, 2,940
reference, the identifier of the example headlines is also headlines contained a colon. We obtained similar re-
saved along with the output of the model. See Box 1 for sults using Mistral-7B-Instruct-v0.3, Qwen2-7B-Instruct,
our input prompt. gemma-2-9b-it and Italia-9B-Instruct-v0.1. Manual exper-
imentation with the commercial LLMs Claude 3.5 Sonnet1
Prompt for the LLM and ChatGPT 4o2 yielded the same behaviour:
Il tuo compito è generare un titolo accattivante • Titolo originale: Una rapina cosmica
e informativo per l’articolo fornito. nell’ammasso di galassie dell’Idra
Requisiti: • Claude: Rapina cosmica: il furto di gas
- Titolo breve nell’ammasso dell’Idra
- Cattura l’essenza dell’articolo • ChatGPT: Rapina Cosmica: NGC 3312 Derubata
- Usa un linguaggio vivido e coinvolgente di Gas nell’Ammasso di Galassie dell’Idra
- Non generare alcun tipo di testo che non sia il
titolo dell’articolo Interestingly, when we asked Claude 3.5 Sonnet to
- Usa esclusivamente l’Italiano. improve our prompt for generating headlines, it added
Presta particolare attenzione ai seguenti titoli di the line to our example prompt, explic-
Title 1 itly requesting the unwanted behaviour. It appears that
Title 2 LLMs consistently regard this particular structure as the
... ideal format for a headline.
Title 7 Given the inherent difficulty of interpreting LLM be-
haviour, we cannot provide a single reason for their pref-
Your task is to generate a catchy and informative erence for this particular construction. Of course, there
title for the article provided. might be a large presence of such headlines in the train-
Requirements: ing data, particularly from lower-quality journals. There
- Short title may also be an influence of Search Engine Optimizations
- Capture the essence of the article (SEO) on the behaviour of the model: Giving importance
- Use vivid and engaging language to keywords is a classic SEO technique.
- Do not generate any type of text other than the Moreover, we generally noticed a preference toward
title of the article sentences poor in determinative and indefinite articles
- Use Italian exclusively. when compared with human written headlines.
Pay particular attention to the following example
titles and adopt the same style:
Title 1 5. Metrics
Title 2
... Automatically evaluating the quality of generated head-
Title 7 lines is a challenging matter because headline qual-
ity is inherently subjective, multi-faceted, and context-
dependent. Thus, instead of providing a single numeric
Box 1: Zero-shot prompt and English translation. 1
https://www.anthropic.com/news/claude-3-5-sonnet
2
https://openai.com/index/hello-gpt-4o/
value as an overall quality score, headlines should be [14], we will evaluate our system outputs using ROUGE-
evaluated along multiple dimensions and subsequently L, which identifies the length of the longest common
rated for their quality based on specific use cases. To give subsequence between system and reference.
examples of what others have done - Cafagna et al. [2]
evaluate generated headlines based on the criteria such 5.2. SBERT
as grammatical correctness, topic relevance, attractive-
ness, and overall appropriateness. Cai et al. [10] assess Sentence-BERT, or SBERT [9], is a modification of the
factors such as factual consistency, relevance, and surface BERT network that uses Siamese networks and that
overlap between the generated headline and the article, can derive semantically meaningful, fixed-size vector
as well as its alignment with user-specific preferences. embeddings from whole sentences. We use SBERT to
In the aforementioned papers, the headlines were compare our generated headlines to the gold-standard
scored by human evaluators. This approach is resource ones by comparing their SBERT embeddings using cosine
intensive - to account for differences in individual pref- similarity, which we then use directly as the similarity
erences, hiring multiple human evaluators from varying score. SBERT produces more meaningful sentence em-
demographic backgrounds is preferred. This does not beddings compared to BERT, which is not designed for
scale well to the evaluation of multiple models on large- sentence similarity tasks - therefore, cosine similarity
scale benchmarks across multiple studies, making the with BERT embeddings could produce unwanted and
ability to automatically evaluate the outputs of LLMs less interpretable results.
essential.
Historically, n-gram overlap metrics like BLEU [11], 5.3. Custom metrics
ROUGE [8], or METEOR [12] have been used to compare
generated outputs with reference “gold standard” texts, Given the limitations of the current available metrics for
but these metrics emphasise surface-level matching and the headlines generation task, we develop two custom
are therefore not robust to paraphrasing or other vari- metrics employing classifiers based on Transformer [15]
ations in acceptable outputs. Learned metrics such as models. We trained both classifiers on a subset of the
COMET [13], a metric designed to mimic human quality “blogs” section of the “Testimole”3 dataset, which was
judgement for machine translations, have been gaining obtained by web scraping various Italian media sources.
in popularity. These are not easily transferable to other Our subset consists of only those parts of the dataset
languages or tasks, and learnable metrics designed specif- scraped from professional media outlets. The criteria for
ically for Italian headline generation are not available. the selection process, as well as the technical details for
Additionally, such metrics typically produce a single nu- each classifier, are in Appendix B.
merical score of ’quality’. To improve interpretability and
ensure contextual flexibility, we would prefer to provide 5.3.1. HA Classifier
individual scores for each dimension. We train two novel
Our first classifier is based on the Sentence Transform-
learned metrics for Italian headline generation, but leave
ers [9] architecture, fine-tuned to discriminate between
others for future work.
coherent and non-coherent pairs of headlines and arti-
We evaluate model performance on our benchmark us-
cles. A generated headline can score between 0 and 1,
ing four metrics: ROUGE [8], SBERT [9], and two custom
representative of the degree of alignment between the
metrics - the Headline-Article and Natural-Synthetic clas-
headline and the content of the article. Following the
sifiers. Within the context of the CALAMITA challenge,
work by De Mattei et al. [3], we call this classifier "HA",
the model’s final score will be an aggregate in which four
or Headline-Article.
all metrics are weighted equally. Each metric is detailed
To train the model, we used a non-finetuned Italian
in the following section.
Sentence Bert model4 to compute an embedding for each
article. We then find the headline of the article in the
5.1. ROUGE dataset with the highest cosine similarity, and create
ROUGE (Recall-Oriented Understudy for Gisting Evalua- a new dataset where each row contains the article (an-
tion) [8] is a popular metric used to evaluate automati- chor), the original title (positive), and the title of the most
cally generated summarizations. It provides a measure of similar article (negative). Because the original dataset
overlap between generated text and gold-standard refer- contained some duplicate items, we filtered all articles
ences. ROUGE is easily interpretable and allows for easy with "1" as the cosine similarity score. With this dataset,
comparison across many papers due to its widespread we were able to use Triplet Loss to train the classifier
use. However, it is not robust to variations in input, mak- 3
ing it less suitable for the assessment of tasks involving 4 https://huggingface.co/datasets/mrinaldi/TestiMole
https://huggingface.co/nickprock/
creativity, such as headline generation. Following others sentence-bert-base-italian-xxl-uncased
to differentiate between coherent and incoherent titles, allows us to build a positive feedback loop in which the
starting from the assumption that the original title is headline generation system teaches itself to generate
the one most coherent with the article’s content. We good headlines based on the classification of the discrimi-
decided to perform a cosine similarity search instead of nator. For instance, the model can be trained to ’fool’ the
random shuffling in order to increase the difficulty of the NS discriminator as often as possible while the NS dis-
discriminator’s task. criminator uses the experience to improve at identifying
The drawback of this approach is the low context win- synthetic data, causing both models to improve simulta-
dow of the model - all articles were truncated after the neously. This method, for instance, should quickly solve
first 512 tokens. While it is possible to develop a more the frequent use of the colon in automatically generated
complex architecture to account for larger texts, we leave headlines outlined in Section 4.
this for future work.
5.3.2. NS Classifier
7. Limitations
Our second classifier is called "NS", or Natural-Synthetic. Our benchmark is limited to articles and headlines from
It is a binary regression classifier based on an Italian only two journals, which restricts its representativeness
BERT-base uncased model5 , trained to discriminate be- across journalistic domains. As a result, it may not cap-
tween human-authored and machine-generated titles. ture the variability present in publications targeting dif-
Given a title as input, the classifier outputs a numerical ferent demographics, covering varied topics, or repre-
score indicating the likelihood of the title being close to senting a full spectrum of political perspectives.
those written by journalists. We believe that similarity In training our classifiers, we took care to prevent
to headlines written by journalists may be a useful indi- data contamination by ensuring non-overlapping splits
cator of the quality and appropriateness of a generated between training and test sets. Nonetheless, given the
headline. public availability of the articles online, there remains
Using the same subset of Testimole employed for the a possibility that some test data may indirectly overlap
“HA” classifier, we generated over 90,000 synthetic head- with training data due to external access and prior expo-
lines using LLMs of up to 9 billion parameters. To avoid sure.
overfitting our classifier to the specific probability distri-
bution of a single model, we generated synthetic head-
lines using different models; this process is detailed in Ap-
8. Ethical issues
pendix C, along with details about the number of gener- This task is aimed at testing the factual knowledge which
ated headlines per model. The result is a labelled dataset LLMs acquire during their training process, whose objec-
containing original as well as generated headlines. tive is language modelling. This task should not suggest,
The advantage of employing a “Natural-Synthetic” or stimulate, that LLMs should commonly be used as
classifier is that the training objective is coarse, encour- knowledge bases or as reliable sources of factual infor-
aging the classifier to consider a broad range of aspects mation. The investigation underlying this challenge is
that may account for the discrepancy of text generated research-oriented, aimed at a better understanding of
by machines and humans. LLMs’ abilities, and possibly suggest ways to discern
when models might be providing more or less reliable
6. Future works knowledge and possibly making them more transparent
in their generated output.
We see value in future research using classifiers and re-
gressors to assess specific aspects of generated headlines.
Such metrics have the potential to capture complex prob-
9. Data license and copyright
ability distributions over a multitude of dimensions of issues
the data, including dimensions that are not directly inter-
pretable to human observation. For instance, a learned Access to the data is granted for the evaluation but cannot
metric that predicts the amount of attention a headline be shared publicly at the moment, also for reasons related
will generated would be highly useful. to data contamination.
Inspired by Generative Adversarial Networks (GANs),
we find the employment of classification-based metrics
promising for developing a model specialized in headline
Acknowledgments
generation. A discriminator/generator training system The authors would like to thank ANSA Scienza and
5
Galileo, giornale di scienza - http:\www.galileonet.it for
https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased
their interest in the GATTINA CALAMITA challenge sentence summarization, arXiv Preprint, CoRR,
and for the extremely valuable exchange of ideas that abs/1509.00685 (2015).
allowed us to shape a task of high potential impact in the [8] C.-Y. Lin, ROUGE: A package for automatic eval-
field of journalism. uation of summaries, in: Text Summarization
Branches Out, Association for Computational Lin-
guistics, Barcelona, Spain, 2004, pp. 74–81. URL:
References https://aclanthology.org/W04-1013.
[9] N. Reimers, I. Gurevych, Sentence-bert: Sentence
[1] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
embeddings using siamese bert-networks, in: Pro-
cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
ceedings of the 2019 Conference on Empirical Meth-
naldi, D. Scalena, CALAMITA: Challenge the Abili-
ods in Natural Language Processing, Association
ties of LAnguage Models in ITAlian, in: Proceed-
for Computational Linguistics, 2019. URL: https:
ings of the 10th Italian Conference on Computa-
//arxiv.org/abs/1908.10084.
tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
[10] P. Cai, K. Song, S. Cho, H. Wang, X. Wang, H. Yu,
ber 4 - December 6, 2024, CEUR Workshop Proceed-
F. Liu, D. Yu, Generating user-engaging news head-
ings, CEUR-WS.org, 2024.
lines, in: Proceedings of the 61st Annual Meeting
[2] M. Cafagna, L. D. Mattei, D. Bacciu, M. Nissim, Suit-
of the Association for Computational Linguistics
able doesn’t mean attractive. human-based eval-
(Volume 1: Long Papers), 2023, pp. 3265–3280.
uation of automatically generated headlines, in:
[11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
R. Bernardi, R. Navigli, G. Semeraro (Eds.), Proceed-
method for automatic evaluation of machine trans-
ings of the Sixth Italian Conference on Computa-
lation, in: Proceedings of the 40th annual meeting
tional Linguistics, Bari, Italy, November 13-15, 2019,
of the Association for Computational Linguistics,
volume 2481 of CEUR Workshop Proceedings, CEUR-
2002, pp. 311–318.
WS.org, 2019. URL: https://ceur-ws.org/Vol-2481/
[12] A. Lavie, M. J. Denkowski, The meteor metric for
paper13.pdf.
automatic evaluation of machine translation, Ma-
[3] L. De Mattei, M. Cafagna, F. Dell’Orletta, M. Nissim,
chine translation 23 (2009) 105–115.
Invisible to people but not to machines: Evalua-
[13] R. Rei, C. Stewart, A. C. Farinha, A. Lavie, Comet: A
tion of style-aware headlinegeneration in absence
neural framework for mt evaluation, arXiv preprint
of reliable human judgment, in: Proceedings of
arXiv:2009.09025 (2020).
the Twelfth Language Resources and Evaluation
[14] M. Krubiński, P. Pecina, Towards unified uni-and
Conference, 2020, pp. 6709–6717.
multi-modal news headline generation, in: Findings
[4] X. Ao, X. Wang, L. Luo, Y. Qiao, Q. He, X. Xie, Pens:
of the Association for Computational Linguistics:
A dataset and generic framework for personalized
EACL 2024, 2024, pp. 437–450.
news headline generation, in: Proceedings of the
[15] A. Vaswani, Attention is all you need, Advances in
59th Annual Meeting of the Association for Com-
Neural Information Processing Systems (2017).
putational Linguistics and the 11th International
[16] M. Rinaldi, Testimole, 2024. URL: https:
Joint Conference on Natural Language Processing
//huggingface.co/datasets/mrinaldi/TestiMole.
(Volume 1: Long Papers), 2021, pp. 82–92.
[5] Y. Liang, N. Duan, Y. Gong, N. Wu, F. Guo, W. Qi,
M. Gong, L. Shou, D. Jiang, G. Cao, et al., Xglue:
A new benchmark dataset for cross-lingual pre-
training, understanding and generation, arXiv
preprint arXiv:2004.01401 (2020).
[6] Z. Ding, A. Smith-Renner, W. Zhang, J. Tetreault,
A. Jaimes, Harnessing the power of LLMs: Evaluat-
ing human-AI text co-creation through the lens
of news headline generation, in: H. Bouamor,
J. Pino, K. Bali (Eds.), Findings of the Association
for Computational Linguistics: EMNLP 2023, Asso-
ciation for Computational Linguistics, Singapore,
2023, pp. 3321–3339. URL: https://aclanthology.
org/2023.findings-emnlp.217. doi:10.18653/v1/
2023.findings-emnlp.217.
[7] A. Rush, A neural attention model for abstractive
A. Examples of Good titles
selected by professional
journalists
• Nella Via Lattea c’è un oggetto misterioso, è ve-
locissimo
• Nasce il gemello digitale del rischio ambientale
in Italia
• I cinque modi in cui il cervello invecchia
• Covid-19, il mistero degli over 90
• A 44 e a 60 anni i due gradini chiave
dell’invecchiamento
• Palestra o snack? la scelta dipende da un messag-
gero chimico
• Dagli stadi alle spiagge, sono i salti a sincronizzare
il ballo
• Dalle rose alle melanzane, ecco i geni delle spine
• Così il Covid accelera l’invecchiamento
• Uno zucchero naturale contro la calvizie, bene i
test sui topi
• Scoperto nel cervello il circuito dell’effetto
placebo
• Pronto il Google Earth del cuore umano
• Una molecola può ringiovanire il sistema immu-
nitario
• Scoperto il dizionario dei sinonimi e contrari del
cervello
• Le farfalle nello stomaco non sono solo un modo
di dire
• Pronto il primo orologio nucleare, il più preciso
del mondo
• Gli uccelli in volo si comportano come gli atomi
• L’Italia ritenta la sfida impossibile della geometria
• Le auto nel traffico come i batteri in cerca di cibo
• Robot come alleati, trovata la chiave per collabo-
rare con gli umani
• Dalle spugne di vetro grattacieli più sottili e re-
sistenti
• L’IA non è razionale, fa ragionamenti non logici
B. Examples of Model Outputs in
Preliminary Evaluation
Original Title (Italian) LLM Generated Title (Italian)
No allo studente flessibile Riforma Universitaria: Studenti in Rivolta contro la Selettività e la Flessibilità
No to the flexible student University reform: Students in revolt against selectivity and flexibility
Alzheimer, la proteina della speranza Nicastrina: nuova chiave per Alzheimer, scoperta in Calabria
Alzheimer, the protein of hope Nicastrin: new key for Alzheimer’s, discovered in Calabria
Un laser estremo Euv: Nuova Frontiera nella Manipolazione Molecolare con Luce Ultra-Sottile
An extreme laser EUV: New frontier in molecular manipulation with ultra-thin light
E’ italiano il primo modello genetico Creazione topi per studiare il ritardo mentale: scoperta genetica rivoluzionaria
The first genetic model is Italian Creation of mice to study mental disability: revolutionary genetic discovery
Carta d’identità per le balene Genetica della balena: Identificazione di specie protette nel commercio illegale
Identity card for whales Whale genetics: Identification of protected species in illegal trade
Scampati allo shock anafilattico Scoperto il fattore eNOS nella reazione anafilattica: speranza per una cura
Escaped from anaphylactic shock eNOS factor discovered in anaphylactic reaction: hope for a cure
La casa in un clic Domotica: Software unificato per controllare tutti gli elettrodomestici, inno-
vazione Cnr
The house in a click Home automation: Unified software to control all appliances, Cnr innovation
Deforestazione senza fine Deforestazione in aumento in Amazzonia: il Brasile reagisce
Endless deforestation Deforestation increasing in the Amazon: Brazil reacts
"Più impegno contro il fumo" Bioetica del tabagismo: raccomandazioni per campagne di prevenzione e
interventi regolativi
More commitment against smoking Bioethics of smoking: recommendations for prevention campaigns and regulatory
interventions
Una carovana contro l’Oleoducto Protesta contro l’Oleodotto Crudos Pasos: Ecuador, ambientalismo e scontro
politico
A caravan against the pipeline Protest against the Crudos Pasos pipeline: Ecuador, environmentalism and polit-
ical clash
Analisi dell’egemonia americana Egemonia Americana: Stati Fuorilegge e Diritto Internazionale
Analysis of American hegemony American hegemony: Rogue states and international law
Censura sui dati scientifici Scienza e Sicurezza: Accademia e Governo Discutono su Divulgazione nei
Lavori Scientifici
Censorship on scientific data Science and security: Academia and government discuss disclosure in scientific
works
Siamo tutti africani: ricostruito il Dna degli DNA dell’Africa: Prove definitive dell’origine umana
antenati
We are all Africans: reconstructed the DNA DNA of Africa: Definitive proof of human origin
of ancestors
Rubbia racconta Galileo divulgatore di Galileo Divulgatore: Scienza e Civiltà dal XVII al Mondo Moderno
scienza
Rubbia tells Galileo as a science communica- Galileo communicator: Science and civilization from the 17th century to the
tor modern world
Galileo Galilei nel pensiero di Thomas Khun Kuhn e Galileo: Rivoluzione scientifica attraverso la meccanica e l’esperimento
Galileo Galilei in the thought of Thomas Kuhn and Galileo: Scientific revolution through mechanics and experiment
Khun
Nella maglie della lingua Realtà e Reti: L’Illusione di una Lingua Universale nella Scienza Cognitiva
In the mesh of the language Reality and networks: The illusion of a universal language in cognitive science
Europa-America, la migrazione dei virus Scambi e mutua scoperta: Storia sanitaria dell’unificazione globale
Europe-America, the migration of viruses Exchanges and mutual discovery: Health history of global unification
Infiniti universi allacciati Inflazione cosmica: un universo di mondi nascosti
Infinite universes connected Cosmic inflation: a universe of hidden worlds
Table 1
Comparison of Original and LLM Generated Titles with Literal Translations.
C. Composition of the datasets transformers library. We initialized the model
using AutoModelForSequenceClassification and
used to train the classifiers trained the model using a binary cross-entropy loss func-
The dataset we used as a source of material for both the tion (BCEWithLogitsLoss).
NS and HA classifiers is taken from "Testimole" [16], a Training was conducted with a batch size of 32, a learn-
massive collection of Italian web scraping data that in- ing rate of 2 × 10-̂5}, and a warmup ratio of 0.1 to help
cludes a "blogs" subset containing, as of November 2024, stabilize early training. A linear learning rate scheduler
more than 2.8 million posts from various online blogs and the $AdamW$ optimizer with gradient clipping were
and websites. From the original 2.8 million rows, we ob- employed to manage learning stability. We also imple-
tained a much smaller dataset by filtering articles coming mented early stopping, monitoring the F1 score to save
from sources that are, to our judgement, more similar the best model checkpoint and halt training if the model
to professional media outlets. After this selection pro- failed to improve over multiple epochs. The resulting
cess, which yielded a total of 715,335 articles, we filtered model obtained a 95% of accuracy on the test set. Ac-
out articles written in languages different than Italian curacy is measured as the number of correctly guessed
by using the "FastText Lang ID" field already present in labels divided for the total number of examples. The
Testimole. After the foreign-languages pruning the count threshold to decide for a positive or negative label was
of articles was 293,518 articles. Finally, we discarded all set at 0.5. Using a continuos score instead of the thresh-
the rows whose article was shorter than 350 characters old led to the same result, for this reason we decided to
to arrive to a final dataset size of 264,455 articles. In kept only accuracy in this report.
the following section, this dataset will be referred After having tested the model, we decided to further
as "testimole-subset". In order to increase the diversity train it on the test set in order to have an improved model
of data for the HA Classifier, we added to this dataset a to be used for the CALAMITA task.
collection of 432.000 articles taken from the professional We then tested this further trained model on the
Italian media outlet "Il Fatto Quotidiano": we had to add smaller "experimental-dataset" dataset containing 3007
this source manually because the articles were missing natural and 3007 synthetic headlines coming from the
from the original Testimole dataset due to a scraping is- Galileo dataset. This evaluation obtained an accuracy of
sue. In the section of HA Classifier, we will refer to this 87%
additional subset as "testimole-subset-auxiliary". Finally, While initially we directly used PyTorch to train the
we are going to refer to the small subset of Galileo used experimental versions of the model, we then decided
in the testing process as "experimental-dataset". The ex- for simplicity to adopt the HuggingFace transformer li-
perimental dataset contains 3007 original headlines from brary to easily upload the model on the HuggingFace
"Galileo" and 3007 headlines generated using Phi 3.5 Mini hub. The further trained version of model is available at
Instruct from the same subset of Galileo’s articles. the address: https://huggingface.co/mrinaldi/flash-it-ns-
classifier-fpt
D. NS Classifier
E. HA Classifier
For the NS Classifier, we decided to split the testimole-
subset dataset in two sets: 60% of the dataset was kept In order to build the HA Classifier we first computed, for
with the original headline ("natural") while in the remain- each article contained in the "testimole-subset" dataset,
ing 40% the original headline was substituted with a gen- the embedding of the article’s text using SentenceBert
erated one ("synthetic"). The original headline is kept as a with an Italian model 6 and added the embedding to a
reference as a separate column in the dataset. Specifically, new column in the dataset. Then, we paired each article
we generated 93,921 headlines and kept 132,227 original (source) of the dataset with the article (target) having
headlines. There is no contamination between generated the highest cosine similarity between the embeddings.
and original headlines: no synthetic headlines were gen- After the pairing, both source and target were marked as
erated for headlines that are present in the dataset with "used" so that each article can appear no more than one
the "natural" label. The dataset was then divided in "test" time in the resulting dataset, either as a source or as a
(45230 entries, x natural, x syntethic) and "train" (180918 target. The resulting dataset 7 has 6 columns:
entries, 105885 natural, 75033 synthetic) split for training. • Anchor: the body of the "source" article
For the generation, we ran Ollama on different models • Positive: the original title of the "source" article
using the same prompt adopted for the evaluation. In
6
Table 2 you can see the amount of generated headlines https://huggingface.co/nickprock/
for each model used. sentence-bert-base-italian-xxl-uncased
7
https://huggingface.co/datasets/mrinaldi/
The classifier was created using Hugging Face’s flash-it-ha-dataset-cossim
Model Count Percentage
lama3.2:3b-instruct-fp16 51886 55.24%
qwen2.5:7b-instruct-q8_0 18418 19.61%
aya:8b-23-q8_0 17043 18.15%
mistral:7b-instruct-v0.3-q6_K 6312 6.72%
phi3.5:3.8b-mini-instruct-fp16 262 0.28%
Table 2
Distribution of generated headlines by model
• Negative: the original title of the "target" article performed every 1,000 steps to monitor model perfor-
• Cosine similarity: the Cosine Similarity be- mance, with checkpoints saved periodically to retain the
tween the source’s and target’s embeddings com- best-performing model. We kept the "margin" value at
puted on their texts "5" following the documentation of SentenceBert. 9
• Url positive: the URL of the source article, it can The resulting classifier outputs a score representing
be used as a key to find the original article in the the alignment between the article and its headline.
Testimole dataset After having trained the HA Classifier on the
• Url negative: the URL of the target article "testimole-subset" dataset, we decided to use an addi-
tional dataset (testimole-auxilliary) to further improve
Given the procedure employed for generating this dataset, the classifier. Testimole-Auxiliary, halved due to match-
the resulting number of row is halved so that, starting ing, has 216562 articles of which 108281 were used as
from the original 256530 entries in the "testimole-subset" train and 108281 as test. The same procedure used for
dataset we obtained 128265 entries, divided into 102600 testimole-subset was applied to testimole-auxilliary. In
train entries and 25665 test entries. We believe that using the following page we present a table summing up the
the cosine similarity instead of randomly shuffling the results of the various models on the test datasets.
articles can improve the performance of the classifier
by increasing the difficulty of the task. Results with a
classifier trained on randomly paired articles is present
in the table below.
The classifier was created using Sentence-
BERT, specifically by initializing the model
with the SentenceTransformer class from the
sentence_transformers library, using a pre-trained
Italian model8 . To fine-tune this model, we employed
a TripletLoss function to enhance similarity-based
ranking in embedding space. The triplet loss was the
optimal choice given our dataset because it requires an
anchor, a positive and a negative example. The goal
of the triplet loss is to maximize the distance between
the anchor and the negative example while at the
same time minimize the distance between the anchor
and the positive example. In this way, we encouraged
the formation of meaningful embeddings tailored to
minimize the distance between an article and a title
coherent with its content, notwithstanding the 512 token
length limitation.
Training was conducted over three epochs with a
batch size of 64 for training and 16 for evaluation,
using a learning rate of 2 × 10-̂5} and a warmup ra-
tio of 0.1 to stabilize initial training steps. We used
the $SentenceTransformerTrainingArguments$
to configure training, applying half-precision floating-
point (fp16) to speed up processing. An evaluation was
8 9
https://huggingface.co/nickprock/ https://sbert.net/docs/package_reference/sentence_transformer/
sentence-bert-base-italian-xxl-uncased losses.html#tripletloss
Model name Model training set Test set Correct Accuracy Avg pos. Avg neg. Average ROC
Triplets dist. dist. Margin AUC
HA-Cossim "testimole-subset" "testimole- 21949 0.8552 0.4 0.73 0.33 0.84
(Train) subset"
(Test)
HA-Cossim- "testimole-subset" "testimole- 98913 0.9135 0.37 0.72 0.35 0.89
FPT (Train+Test) auxiliary"
(Test)
HA-Cossim- "testimole-subset" "testimole- 106662 0.9850 0.3 0.76 0.47 0.96
FFPT (Train+Test), auxiliary"
"testimole- (Test)
auxiliary"
(Train)
HA- "testimole-subset" "testimole- 92523 0.8545 0.24 0.40 0.16 0.8
RANDOM (Train) auxiliary"
(Test)
Table 3
Report of the results obtained by HA Classifier on the test datasets