=Paper= {{Paper |id=Vol-3878/121_calamita_long |storemode=property |title=GATTINA - GenerAtion of TiTles for Italian News Articles: A CALAMITA Challenge |pdfUrl=https://ceur-ws.org/Vol-3878/121_calamita_long.pdf |volume=Vol-3878 |authors=Maria Francis,Matteo Rinaldi,Jacopo Gili,Leonardo De Cosmo,Sandro Iannaccone,Malvina Nissim,Viviana Patti |dblpUrl=https://dblp.org/rec/conf/clic-it/FrancisRGCINP24 }} ==GATTINA - GenerAtion of TiTles for Italian News Articles: A CALAMITA Challenge== https://ceur-ws.org/Vol-3878/121_calamita_long.pdf
                                GATTINA - GenerAtion of TiTles for Italian News Articles:
                                A CALAMITA Challenge
                                Maria Francis1,2,*,† , Matteo Rinaldi3,† , Jacopo Gili3,† , Leonardo De Cosmo4 , Sandro Iannaccone5 ,
                                Malvina Nissim1,‡ and Viviana Patti3,‡
                                1
                                  CLCG, University of Groningen
                                2
                                  University of Trento
                                3
                                  University of Turin
                                4
                                  ANSA
                                5
                                  Galileo


                                                 Abstract
                                                 We introduce a new benchmark designed to evaluate the ability of Large Language Models (LLMs) to generate Italian-language
                                                 headlines for science news articles. The benchmark is based on a large dataset of science news articles obtained from Ansa
                                                 Scienza and Galileo, two important Italian media outlets. Effective headline generation requires more than summarizing
                                                 article content; headlines must also be informative, engaging, and suitable for the topic and target audience, making automatic
                                                 evaluation particularly challenging. To address this, we propose two novel transformer-based metrics to assess headline
                                                 quality. We aim for this benchmark to support the evaluation of Italian LLMs and to foster the development of tools to assist
                                                 in editorial workflows.

                                                 Keywords
                                                 CALAMITA Challenge, Italian, Benchmarking, Headline generation, Summarisation, LLMs



                                1. Introduction and Motivation                                                                         sensitivity, balance, a sense of measure, and a deep un-
                                                                                                                                       derstanding of the readers. There are no precise and
                                The title is undoubtedly one of the most important and inescapable "rules" – save, of course, for the usual de-
                                crucial components of a journalistic article. A good title ontological norms of pertinence and truth that regulate
                                intrigues the reader, synthesises the news without an- the journalistic profession – but in fact, the operation
                                ticipating its details, encourages further reading, and is depends almost exclusively on the author’s expertise and
                                simultaneously pleasant to read or hear. Often, the fate must be evaluated on a case-by-case basis.
                                of an article is inextricably linked to the quality of its                                                Factors that can influence the composition of a title
                                accompanying title: it is not uncommon for inherently include, for example, the topic and the "tone of voice" of
                                interesting, in-depth, and factually correct articles to go the article (a piece reporting a crime news story, for in-
                                unnoticed simply because they are accompanied by an stance, requires a measured, discreet, and respectful title;
                                inappropriate or unattractive title. Composing adequate conversely, a piece on lifestyle can and should be paired
                                titles is not a simple operation; it requires experience, with a lighter, ironic, and more colorful title); the style
                                                                                                                                       of the publication hosting the article; the destination for-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, mat (the same article printed in a paper newspaper and
                                Dec 04 — 06, 2024, Pisa, Italy                                                                         published on an online outlet, for example, typically has
                                *
                                  Corresponding author.                                                                                two different titles); potential "conflicts" with other titles
                                †
                                  Shared first authorship.                                                                             present on the same page (for instance: repetitions of the
                                ‡
                                  Shared supervision.                                                                                  same word or phrase, or the enunciation of contradic-
                                $ maria.francis@unitn.it (M. Francis); matteo.rinaldi@unito.it                                         tory concepts); space limitations; prescriptions related
                                (M. Rinaldi); jacopo.gili584@edu.unito.it (J. Gili);
                                leodecosmo@gmail.com (L. D. Cosmo); iannaccone@galileonet.it
                                                                                                                                       to search engine optimisation (for example, the use of
                                (S. Iannaccone); m.nissim@rug.nl (M. Nissim);                                                          a particular word or expression particularly popular at
                                viviana.patti@unito.it (V. Patti)                                                                      the time of publication, or a specific position of words
                                € https://github.com/rosakun (M. Francis);                                                             within the title).
                                https://github.com/mrinaldi97 (M. Rinaldi);                                                               It is in this context that the journalist’s toolkit has re-
                                https://github.com/Jj-source (J. Gili);
                                https://github.com/malvinanissim (M. Nissim);
                                                                                                                                       cently   been enriched with a powerful new tool: Large
                                https://github.com/vivpatti (V. Patti)                                                                 language models (LLMs) undoubtedly have an important
                                 0009-0007-7638-9963 (M. Francis); 0009-0004-7488-8855                                                role in the world of journalism, including quality jour-
                                (M. Rinaldi); 0009-0007-1343-3760 (J. Gili); 0000-0001-5289-0971                                       nalism. Although incapable of "understanding" content
                                (M. Nissim); 0000-0001-5991-370X (V. Patti)                                                            as a human journalist would, as well as the meaning of
                                          © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                           Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
words, LLMs are naturally capable of producing fluent,            automatically generated headlines which were observed
complex, plausible, and credible texts in a matter of mo-         in previous work, such as lack of fluency and creativity
ments. These models not only can improve the efficiency           [2], might not affect LLM-based generations.
of editorial processes but also offer new creative and in-           The second aim is to provide a reliable, high quality
novative possibilities for content creation, including the        dataset of articles and corresponding headlines in Italian,
automatic generation of journalistic headlines. Analysing         developed through a direct collaboration of language
why it may be useful for journalism to have an LLM ca-            technology experts and journalists, which can be used
pable of generating titles leads us to consider numerous          and analysed well beyond the CALAMITA challenge.
factors, such as time optimisation, content personaliza-          Although similar datasets exist for other languages [4, 5],
tion, and the ability to maintain a high level of quality,        this resource is still lacking for Italian.
coherence, and communicative impact. However, these                  Overall, experimenting with the use of LLMs for title
tools also present many limitations and some dangers,             generation can also be considered a first step towards
particularly the risk of blindly relying on them.                 the introduction of more extensive and comprehensive
   Timing and speed, in particular, are one of the great          artificial intelligence agents, which assist the journalist
challenges of journalism - being the first to publish a           in all phases of the creative process, from news research
story, especially online, is often essential to attract read-     to drafting an outline, to writing the actual piece, and
ers - however, as we have seen, generating effective and          finally to its promotion. Indeed, a close interaction of
incisive titles requires skill and time, which is not always      language models and humans in this task has recently
available. An LLM can drastically reduce the time needed          been shown to be key [6].
to create appropriate titles, for example by suggesting
to the author a series of reasoned choices or proposing
modifications and corrections to an already written title,        2. Challenge Description
always keeping in mind preset criteria such as length,
                                                                  The task of headline generation has often been treated
tone, attractiveness, clarity, and the publication’s style.
                                                                  as equal to an extreme summarization task [3, 7]. How-
Furthermore, if trained on the corpus of a particular pub-
                                                                  ever, simply synthesising the content of the article into
lication, an LLM can suggest titles consistent with its
                                                                  a brief description is not enough to provide a satisfying
tone of voice and editorial history.
                                                                  title. Additional characteristics such as attractiveness,
   Another important advantage that the use of LLMs
                                                                  creativeness, and many others also play a role. Writing
can offer is the ability to personalise content for different
                                                                  appropriate headlines is challenging, even for current
platforms and audiences. In today’s newsrooms, journal-
                                                                  state-of-the-art LLMs.
ists no longer have to worry only about print media but
                                                                     Evaluating LLMs on the task of headline generation
must also consider the web, social media, newsletters,
                                                                  for Italian news articles thus serves multiple purposes.
and other digital distribution platforms. Each platform
                                                                  On one hand, it tests models’ capacity to properly under-
requires a different type of language, style, and length
                                                                  stand, that is, to reprocess large source texts in a way that
for titles. For example, a title optimised for Twitter (or
                                                                  is faithful to the content of the text. On the other hand,
X) must be short and incisive, while a title for a news
                                                                  it acts as a means to assess the performance of LLMs in
website can be more descriptive. An LLM is capable of
                                                                  many complex dimensions, such as attractiveness, cre-
generating variants of a title based on the medium of
                                                                  ativity, or adherence to tone. Finally, this benchmark
dissemination, allowing newsrooms to adapt their con-
                                                                  could prove useful in practical applications. For instance,
tent precisely and in a targeted manner. Moreover, using
                                                                  it may help guide decisions on whether, and to what ex-
reader behavioural data, the LLM can generate more
                                                                  tent, a journal should integrate LLMs into its workflow.
attractive titles for specific demographic groups, thus
                                                                  It may also serve as an effective testbed for future re-
improving the engagement and communicative effective-
                                                                  search and development towards effective deployment
ness of the news.
                                                                  in real-world scenarios - One such venue could be the
   With this task, which is developed in the context of the
                                                                  use of prompting to achieve the desired style and tone in
CALAMITA Challenge [1] and which consists in asking
                                                                  generated headlines.
an LLM to generate a headline given the corresponding
                                                                     In our challenge, language models are tasked with gen-
full article, we have a twofold aim.
                                                                  erating Italian-language headlines based on articles from
   The first aim is to test and analyse the ability of existing
                                                                  scientific news journals written in Italian. Our dataset
and future LLMs on the task of headline generation in the
                                                                  includes original articles from such journals, along with
context of Italian news articles. This would provide a sub-
                                                                  their human-authored titles. Models are provided the
stantial step forward compared to past experiments on
                                                                  complete source text in the prompt, as well as instruc-
headline generation for Italian, which were run training
                                                                  tions to generate a title that is brief, coherent, and capti-
much smaller sequence-to-sequence models from scratch
                                                                  vating. We guide the model towards the specific editorial
[2, 3]. We expect that some of the shortcomings of the
style of the media outlet by including a small number of
examples of headlines in our prompt. We employ auto-
matic metrics that assess the model’s performance along
three dimensions:

    1. Coherency with the original article (HA classifier)
    2. Alignment with the style of human written head-
       lines (NS classifier)
    3. Similarity between the generated and the gold-
       standard headline (ROUGE [8], SBERT [9])

  However, considering the complexity of the task, we
believe that manually reviewing a sample of the gener-
ated headlines can offer additional perspectives on the
behaviour of the model.
                                                               Figure 1: Distribution of articles by token count in the Galileo
3. Data description                                            subset.

Our benchmark is based of two datasets consisting of
science news articles from two different sources. In each
dataset, we provide the full text of the article paired with
the original, human-authored headline. Additionally, we
include metadata such as link, date, author (if present)
and subtitle.

3.1. Origin of data
The data were obtained via web scraping with custom
Python scripts. Since links to articles more than a few
weeks old are inaccessible on the Ansa website, we col-
lected a large number by downloading the archived "Ansa
Scienza" RSS feeds from The Wayback Machine and pro-
cessing them to remove duplicates and extact links.

                                                               Figure 2: Distribution of articles by token count in the Ansa
3.2. Data format                                               subset.
The data from web scraping were saved in "JSON Lines"
(JSONL) format, with each line containing a JSON object
with the following fields:                                         1. "ANSA scienza", the science section of the Italian
                                                                      newspaper "ANSA", from which obtained 6,889
     • Title: the title of the article                                articles: 649 of which are from 2024, and the oth-
     • Source: the name of the website                                ers are from a period of time between 2018 and
     • Date: the publishing date of the article                       2022.
     • Author: the author of the article, if present               2. The “Galileo” website, from which we sourced
     • URL: the Internet address of the article                       23,572 articles dating from April 1996 to May
     • Text: the body of the article                                  2024.
     • ID: a unique identifier of the article
                                                                  When measured with “tiktoken o200k_base” tokenizer
                                                               model, we obtained a total of 21,365,897 tokens for the
3.3. Detailed data statistics                                  Galileo dataset (average: 906 tokens per article, max-
Our dataset consists of 30,461 articles gathered from two      imum: 24,306) and a total of 3,762,539 tokens for the
sources:                                                       Galileo dataset (average: 546 tokens per article, maxi-
                                                               mum: 7,600). Figures 1 and 2 depict the distribution of
                                                               articles by token count in the Galileo and Ansa datasets
                                                               respectively.
3.4. Prompting                                                 4. Preliminary Evaluation
Due to the length of each article, the use of task examples    To get a first impression of LLM performance on our task,
in our prompt would be too computationally expensive.          we conducted preliminary experiments by manually re-
Therefore, we test the models in a zero-shot prompting         viewing headlines generated by several models. Overall,
setting. While we do not use any task examples in our          the results were unsatisfactory - while the titles were
prompt, we do provide seven examples of headlines. In          generally coherent with the articles, they lacked capti-
this way, the model is given examples of the expected          vation and originality. The majority of the generated
output (a title) rather than examples of the full task (ar-    headlines followed the format ,
ticle and title). Professional journalists made a list of 22   leading to repetitive and poorly formulated headlines. Ex-
headlines that, in their opinion, were representative of       amples of our preliminary results can be found in Table 1
a well-made writing process under the three aspects of         in Appendix A. This behaviour persisted even when the
being captivating, short and informative.                      models were explicitly instructed to avoid using colons in
   Each time the model is tested, 7 randomly chosen titles     the titles, or when examples of titles were given. Out of
from the list are appended to the standard prompt. As a        3,006 headlines generated by Phi-3.5 Mini-Instruct, 2,940
reference, the identifier of the example headlines is also     headlines contained a colon. We obtained similar re-
saved along with the output of the model. See Box 1 for        sults using Mistral-7B-Instruct-v0.3, Qwen2-7B-Instruct,
our input prompt.                                              gemma-2-9b-it and Italia-9B-Instruct-v0.1. Manual exper-
                                                               imentation with the commercial LLMs Claude 3.5 Sonnet1
    Prompt for the LLM                                         and ChatGPT 4o2 yielded the same behaviour:

     Il tuo compito è generare un titolo accattivante                  • Titolo originale:       Una rapina cosmica
     e informativo per l’articolo fornito.                               nell’ammasso di galassie dell’Idra
     Requisiti:                                                        • Claude: Rapina cosmica: il furto di gas
    - Titolo breve                                                       nell’ammasso dell’Idra
    - Cattura l’essenza dell’articolo                                  • ChatGPT: Rapina Cosmica: NGC 3312 Derubata
    - Usa un linguaggio vivido e coinvolgente                            di Gas nell’Ammasso di Galassie dell’Idra
    - Non generare alcun tipo di testo che non sia il
     titolo dell’articolo                                         Interestingly, when we asked Claude 3.5 Sonnet to
    - Usa esclusivamente l’Italiano.                           improve our prompt for generating headlines, it added
     Presta particolare attenzione ai seguenti titoli di       the line  to our example prompt, explic-
     Title 1                                                   itly requesting the unwanted behaviour. It appears that
     Title 2                                                   LLMs consistently regard this particular structure as the
     ...                                                       ideal format for a headline.
     Title 7                                                      Given the inherent difficulty of interpreting LLM be-
                                                               haviour, we cannot provide a single reason for their pref-
    Your task is to generate a catchy and informative          erence for this particular construction. Of course, there
    title for the article provided.                            might be a large presence of such headlines in the train-
    Requirements:                                              ing data, particularly from lower-quality journals. There
    - Short title                                              may also be an influence of Search Engine Optimizations
    - Capture the essence of the article                       (SEO) on the behaviour of the model: Giving importance
    - Use vivid and engaging language                          to keywords is a classic SEO technique.
    - Do not generate any type of text other than the             Moreover, we generally noticed a preference toward
    title of the article                                       sentences poor in determinative and indefinite articles
    - Use Italian exclusively.                                 when compared with human written headlines.
    Pay particular attention to the following example
    titles and adopt the same style:
    Title 1                                                    5. Metrics
    Title 2
    ...                                                        Automatically evaluating the quality of generated head-
    Title 7                                                    lines is a challenging matter because headline qual-
                                                               ity is inherently subjective, multi-faceted, and context-
                                                               dependent. Thus, instead of providing a single numeric
   Box 1: Zero-shot prompt and English translation.            1
                                                                   https://www.anthropic.com/news/claude-3-5-sonnet
                                                               2
                                                                   https://openai.com/index/hello-gpt-4o/
value as an overall quality score, headlines should be         [14], we will evaluate our system outputs using ROUGE-
evaluated along multiple dimensions and subsequently           L, which identifies the length of the longest common
rated for their quality based on specific use cases. To give   subsequence between system and reference.
examples of what others have done - Cafagna et al. [2]
evaluate generated headlines based on the criteria such        5.2. SBERT
as grammatical correctness, topic relevance, attractive-
ness, and overall appropriateness. Cai et al. [10] assess      Sentence-BERT, or SBERT [9], is a modification of the
factors such as factual consistency, relevance, and surface    BERT network that uses Siamese networks and that
overlap between the generated headline and the article,        can derive semantically meaningful, fixed-size vector
as well as its alignment with user-specific preferences.       embeddings from whole sentences. We use SBERT to
   In the aforementioned papers, the headlines were            compare our generated headlines to the gold-standard
scored by human evaluators. This approach is resource          ones by comparing their SBERT embeddings using cosine
intensive - to account for differences in individual pref-     similarity, which we then use directly as the similarity
erences, hiring multiple human evaluators from varying         score. SBERT produces more meaningful sentence em-
demographic backgrounds is preferred. This does not            beddings compared to BERT, which is not designed for
scale well to the evaluation of multiple models on large-      sentence similarity tasks - therefore, cosine similarity
scale benchmarks across multiple studies, making the           with BERT embeddings could produce unwanted and
ability to automatically evaluate the outputs of LLMs          less interpretable results.
essential.
   Historically, n-gram overlap metrics like BLEU [11],        5.3. Custom metrics
ROUGE [8], or METEOR [12] have been used to compare
generated outputs with reference “gold standard” texts,        Given the limitations of the current available metrics for
but these metrics emphasise surface-level matching and         the headlines generation task, we develop two custom
are therefore not robust to paraphrasing or other vari-        metrics employing classifiers based on Transformer [15]
ations in acceptable outputs. Learned metrics such as          models. We trained both classifiers on a subset of the
COMET [13], a metric designed to mimic human quality           “blogs” section of the “Testimole”3 dataset, which was
judgement for machine translations, have been gaining          obtained by web scraping various Italian media sources.
in popularity. These are not easily transferable to other      Our subset consists of only those parts of the dataset
languages or tasks, and learnable metrics designed specif-     scraped from professional media outlets. The criteria for
ically for Italian headline generation are not available.      the selection process, as well as the technical details for
Additionally, such metrics typically produce a single nu-      each classifier, are in Appendix B.
merical score of ’quality’. To improve interpretability and
ensure contextual flexibility, we would prefer to provide      5.3.1. HA Classifier
individual scores for each dimension. We train two novel
                                                            Our first classifier is based on the Sentence Transform-
learned metrics for Italian headline generation, but leave
                                                            ers [9] architecture, fine-tuned to discriminate between
others for future work.
                                                            coherent and non-coherent pairs of headlines and arti-
   We evaluate model performance on our benchmark us-
                                                            cles. A generated headline can score between 0 and 1,
ing four metrics: ROUGE [8], SBERT [9], and two custom
                                                            representative of the degree of alignment between the
metrics - the Headline-Article and Natural-Synthetic clas-
                                                            headline and the content of the article. Following the
sifiers. Within the context of the CALAMITA challenge,
                                                            work by De Mattei et al. [3], we call this classifier "HA",
the model’s final score will be an aggregate in which four
                                                            or Headline-Article.
all metrics are weighted equally. Each metric is detailed
                                                                To train the model, we used a non-finetuned Italian
in the following section.
                                                            Sentence Bert model4 to compute an embedding for each
                                                            article. We then find the headline of the article in the
5.1. ROUGE                                                  dataset with the highest cosine similarity, and create
ROUGE (Recall-Oriented Understudy for Gisting Evalua- a new dataset where each row contains the article (an-
tion) [8] is a popular metric used to evaluate automati- chor), the original title (positive), and the title of the most
cally generated summarizations. It provides a measure of similar article (negative). Because the original dataset
overlap between generated text and gold-standard refer- contained some duplicate items, we filtered all articles
ences. ROUGE is easily interpretable and allows for easy with "1" as the cosine similarity score. With this dataset,
comparison across many papers due to its widespread we were able to use Triplet Loss to train the classifier
use. However, it is not robust to variations in input, mak- 3
ing it less suitable for the assessment of tasks involving 4 https://huggingface.co/datasets/mrinaldi/TestiMole
                                                              https://huggingface.co/nickprock/
creativity, such as headline generation. Following others sentence-bert-base-italian-xxl-uncased
to differentiate between coherent and incoherent titles,         allows us to build a positive feedback loop in which the
starting from the assumption that the original title is          headline generation system teaches itself to generate
the one most coherent with the article’s content. We             good headlines based on the classification of the discrimi-
decided to perform a cosine similarity search instead of         nator. For instance, the model can be trained to ’fool’ the
random shuffling in order to increase the difficulty of the      NS discriminator as often as possible while the NS dis-
discriminator’s task.                                            criminator uses the experience to improve at identifying
   The drawback of this approach is the low context win-         synthetic data, causing both models to improve simulta-
dow of the model - all articles were truncated after the         neously. This method, for instance, should quickly solve
first 512 tokens. While it is possible to develop a more         the frequent use of the colon in automatically generated
complex architecture to account for larger texts, we leave       headlines outlined in Section 4.
this for future work.

5.3.2. NS Classifier
                                                                 7. Limitations
Our second classifier is called "NS", or Natural-Synthetic.      Our benchmark is limited to articles and headlines from
It is a binary regression classifier based on an Italian         only two journals, which restricts its representativeness
BERT-base uncased model5 , trained to discriminate be-           across journalistic domains. As a result, it may not cap-
tween human-authored and machine-generated titles.               ture the variability present in publications targeting dif-
Given a title as input, the classifier outputs a numerical       ferent demographics, covering varied topics, or repre-
score indicating the likelihood of the title being close to      senting a full spectrum of political perspectives.
those written by journalists. We believe that similarity            In training our classifiers, we took care to prevent
to headlines written by journalists may be a useful indi-        data contamination by ensuring non-overlapping splits
cator of the quality and appropriateness of a generated          between training and test sets. Nonetheless, given the
headline.                                                        public availability of the articles online, there remains
   Using the same subset of Testimole employed for the           a possibility that some test data may indirectly overlap
“HA” classifier, we generated over 90,000 synthetic head-        with training data due to external access and prior expo-
lines using LLMs of up to 9 billion parameters. To avoid         sure.
overfitting our classifier to the specific probability distri-
bution of a single model, we generated synthetic head-
lines using different models; this process is detailed in Ap-
                                                                 8. Ethical issues
pendix C, along with details about the number of gener-          This task is aimed at testing the factual knowledge which
ated headlines per model. The result is a labelled dataset       LLMs acquire during their training process, whose objec-
containing original as well as generated headlines.              tive is language modelling. This task should not suggest,
   The advantage of employing a “Natural-Synthetic”              or stimulate, that LLMs should commonly be used as
classifier is that the training objective is coarse, encour-     knowledge bases or as reliable sources of factual infor-
aging the classifier to consider a broad range of aspects        mation. The investigation underlying this challenge is
that may account for the discrepancy of text generated           research-oriented, aimed at a better understanding of
by machines and humans.                                          LLMs’ abilities, and possibly suggest ways to discern
                                                                 when models might be providing more or less reliable
6. Future works                                                  knowledge and possibly making them more transparent
                                                                 in their generated output.
We see value in future research using classifiers and re-
gressors to assess specific aspects of generated headlines.
Such metrics have the potential to capture complex prob-
                                                                 9. Data license and copyright
ability distributions over a multitude of dimensions of             issues
the data, including dimensions that are not directly inter-
pretable to human observation. For instance, a learned           Access to the data is granted for the evaluation but cannot
metric that predicts the amount of attention a headline          be shared publicly at the moment, also for reasons related
will generated would be highly useful.                           to data contamination.
  Inspired by Generative Adversarial Networks (GANs),
we find the employment of classification-based metrics
promising for developing a model specialized in headline
                                                                 Acknowledgments
generation. A discriminator/generator training system            The authors would like to thank ANSA Scienza and
5
                                                                 Galileo, giornale di scienza - http:\www.galileonet.it for
    https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased
their interest in the GATTINA CALAMITA challenge                    sentence summarization, arXiv Preprint, CoRR,
and for the extremely valuable exchange of ideas that               abs/1509.00685 (2015).
allowed us to shape a task of high potential impact in the      [8] C.-Y. Lin, ROUGE: A package for automatic eval-
field of journalism.                                                uation of summaries, in: Text Summarization
                                                                    Branches Out, Association for Computational Lin-
                                                                    guistics, Barcelona, Spain, 2004, pp. 74–81. URL:
References                                                          https://aclanthology.org/W04-1013.
                                                                [9] N. Reimers, I. Gurevych, Sentence-bert: Sentence
 [1] G. Attanasio, P. Basile, F. Borazio, D. Croce, M. Fran-
                                                                    embeddings using siamese bert-networks, in: Pro-
     cis, J. Gili, E. Musacchio, M. Nissim, V. Patti, M. Ri-
                                                                    ceedings of the 2019 Conference on Empirical Meth-
     naldi, D. Scalena, CALAMITA: Challenge the Abili-
                                                                    ods in Natural Language Processing, Association
     ties of LAnguage Models in ITAlian, in: Proceed-
                                                                    for Computational Linguistics, 2019. URL: https:
     ings of the 10th Italian Conference on Computa-
                                                                    //arxiv.org/abs/1908.10084.
     tional Linguistics (CLiC-it 2024), Pisa, Italy, Decem-
                                                               [10] P. Cai, K. Song, S. Cho, H. Wang, X. Wang, H. Yu,
     ber 4 - December 6, 2024, CEUR Workshop Proceed-
                                                                    F. Liu, D. Yu, Generating user-engaging news head-
     ings, CEUR-WS.org, 2024.
                                                                    lines, in: Proceedings of the 61st Annual Meeting
 [2] M. Cafagna, L. D. Mattei, D. Bacciu, M. Nissim, Suit-
                                                                    of the Association for Computational Linguistics
     able doesn’t mean attractive. human-based eval-
                                                                    (Volume 1: Long Papers), 2023, pp. 3265–3280.
     uation of automatically generated headlines, in:
                                                               [11] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
     R. Bernardi, R. Navigli, G. Semeraro (Eds.), Proceed-
                                                                    method for automatic evaluation of machine trans-
     ings of the Sixth Italian Conference on Computa-
                                                                    lation, in: Proceedings of the 40th annual meeting
     tional Linguistics, Bari, Italy, November 13-15, 2019,
                                                                    of the Association for Computational Linguistics,
     volume 2481 of CEUR Workshop Proceedings, CEUR-
                                                                    2002, pp. 311–318.
     WS.org, 2019. URL: https://ceur-ws.org/Vol-2481/
                                                               [12] A. Lavie, M. J. Denkowski, The meteor metric for
     paper13.pdf.
                                                                    automatic evaluation of machine translation, Ma-
 [3] L. De Mattei, M. Cafagna, F. Dell’Orletta, M. Nissim,
                                                                    chine translation 23 (2009) 105–115.
     Invisible to people but not to machines: Evalua-
                                                               [13] R. Rei, C. Stewart, A. C. Farinha, A. Lavie, Comet: A
     tion of style-aware headlinegeneration in absence
                                                                    neural framework for mt evaluation, arXiv preprint
     of reliable human judgment, in: Proceedings of
                                                                    arXiv:2009.09025 (2020).
     the Twelfth Language Resources and Evaluation
                                                               [14] M. Krubiński, P. Pecina, Towards unified uni-and
     Conference, 2020, pp. 6709–6717.
                                                                    multi-modal news headline generation, in: Findings
 [4] X. Ao, X. Wang, L. Luo, Y. Qiao, Q. He, X. Xie, Pens:
                                                                    of the Association for Computational Linguistics:
     A dataset and generic framework for personalized
                                                                    EACL 2024, 2024, pp. 437–450.
     news headline generation, in: Proceedings of the
                                                               [15] A. Vaswani, Attention is all you need, Advances in
     59th Annual Meeting of the Association for Com-
                                                                    Neural Information Processing Systems (2017).
     putational Linguistics and the 11th International
                                                               [16] M. Rinaldi, Testimole, 2024. URL: https:
     Joint Conference on Natural Language Processing
                                                                    //huggingface.co/datasets/mrinaldi/TestiMole.
     (Volume 1: Long Papers), 2021, pp. 82–92.
 [5] Y. Liang, N. Duan, Y. Gong, N. Wu, F. Guo, W. Qi,
     M. Gong, L. Shou, D. Jiang, G. Cao, et al., Xglue:
     A new benchmark dataset for cross-lingual pre-
     training, understanding and generation, arXiv
     preprint arXiv:2004.01401 (2020).
 [6] Z. Ding, A. Smith-Renner, W. Zhang, J. Tetreault,
     A. Jaimes, Harnessing the power of LLMs: Evaluat-
     ing human-AI text co-creation through the lens
     of news headline generation, in: H. Bouamor,
     J. Pino, K. Bali (Eds.), Findings of the Association
     for Computational Linguistics: EMNLP 2023, Asso-
     ciation for Computational Linguistics, Singapore,
     2023, pp. 3321–3339. URL: https://aclanthology.
     org/2023.findings-emnlp.217. doi:10.18653/v1/
     2023.findings-emnlp.217.
 [7] A. Rush, A neural attention model for abstractive
A. Examples of Good titles
   selected by professional
   journalists
   • Nella Via Lattea c’è un oggetto misterioso, è ve-
     locissimo
   • Nasce il gemello digitale del rischio ambientale
     in Italia
   • I cinque modi in cui il cervello invecchia
   • Covid-19, il mistero degli over 90
   • A 44 e a 60 anni i due gradini chiave
     dell’invecchiamento
   • Palestra o snack? la scelta dipende da un messag-
     gero chimico
   • Dagli stadi alle spiagge, sono i salti a sincronizzare
     il ballo
   • Dalle rose alle melanzane, ecco i geni delle spine
   • Così il Covid accelera l’invecchiamento
   • Uno zucchero naturale contro la calvizie, bene i
     test sui topi
   • Scoperto nel cervello il circuito dell’effetto
     placebo
   • Pronto il Google Earth del cuore umano
   • Una molecola può ringiovanire il sistema immu-
     nitario
   • Scoperto il dizionario dei sinonimi e contrari del
     cervello
   • Le farfalle nello stomaco non sono solo un modo
     di dire
   • Pronto il primo orologio nucleare, il più preciso
     del mondo
   • Gli uccelli in volo si comportano come gli atomi
   • L’Italia ritenta la sfida impossibile della geometria
   • Le auto nel traffico come i batteri in cerca di cibo
   • Robot come alleati, trovata la chiave per collabo-
     rare con gli umani
   • Dalle spugne di vetro grattacieli più sottili e re-
     sistenti
   • L’IA non è razionale, fa ragionamenti non logici


B. Examples of Model Outputs in
   Preliminary Evaluation
 Original Title (Italian)                         LLM Generated Title (Italian)
 No allo studente flessibile                      Riforma Universitaria: Studenti in Rivolta contro la Selettività e la Flessibilità
 No to the flexible student                       University reform: Students in revolt against selectivity and flexibility
 Alzheimer, la proteina della speranza            Nicastrina: nuova chiave per Alzheimer, scoperta in Calabria
 Alzheimer, the protein of hope                   Nicastrin: new key for Alzheimer’s, discovered in Calabria
 Un laser estremo                                 Euv: Nuova Frontiera nella Manipolazione Molecolare con Luce Ultra-Sottile
 An extreme laser                                 EUV: New frontier in molecular manipulation with ultra-thin light
 E’ italiano il primo modello genetico            Creazione topi per studiare il ritardo mentale: scoperta genetica rivoluzionaria
 The first genetic model is Italian               Creation of mice to study mental disability: revolutionary genetic discovery
 Carta d’identità per le balene                   Genetica della balena: Identificazione di specie protette nel commercio illegale
 Identity card for whales                         Whale genetics: Identification of protected species in illegal trade
 Scampati allo shock anafilattico                 Scoperto il fattore eNOS nella reazione anafilattica: speranza per una cura
 Escaped from anaphylactic shock                  eNOS factor discovered in anaphylactic reaction: hope for a cure
 La casa in un clic                               Domotica: Software unificato per controllare tutti gli elettrodomestici, inno-
                                                  vazione Cnr
 The house in a click                             Home automation: Unified software to control all appliances, Cnr innovation
 Deforestazione senza fine                        Deforestazione in aumento in Amazzonia: il Brasile reagisce
 Endless deforestation                            Deforestation increasing in the Amazon: Brazil reacts
 "Più impegno contro il fumo"                     Bioetica del tabagismo: raccomandazioni per campagne di prevenzione e
                                                  interventi regolativi
 More commitment against smoking                  Bioethics of smoking: recommendations for prevention campaigns and regulatory
                                                  interventions
 Una carovana contro l’Oleoducto                  Protesta contro l’Oleodotto Crudos Pasos: Ecuador, ambientalismo e scontro
                                                  politico
 A caravan against the pipeline                   Protest against the Crudos Pasos pipeline: Ecuador, environmentalism and polit-
                                                  ical clash
 Analisi dell’egemonia americana                  Egemonia Americana: Stati Fuorilegge e Diritto Internazionale
 Analysis of American hegemony                    American hegemony: Rogue states and international law
 Censura sui dati scientifici                     Scienza e Sicurezza: Accademia e Governo Discutono su Divulgazione nei
                                                  Lavori Scientifici
 Censorship on scientific data                    Science and security: Academia and government discuss disclosure in scientific
                                                  works
 Siamo tutti africani: ricostruito il Dna degli   DNA dell’Africa: Prove definitive dell’origine umana
 antenati
 We are all Africans: reconstructed the DNA       DNA of Africa: Definitive proof of human origin
 of ancestors
 Rubbia racconta Galileo divulgatore di           Galileo Divulgatore: Scienza e Civiltà dal XVII al Mondo Moderno
 scienza
 Rubbia tells Galileo as a science communica-     Galileo communicator: Science and civilization from the 17th century to the
 tor                                              modern world
 Galileo Galilei nel pensiero di Thomas Khun      Kuhn e Galileo: Rivoluzione scientifica attraverso la meccanica e l’esperimento
 Galileo Galilei in the thought of Thomas         Kuhn and Galileo: Scientific revolution through mechanics and experiment
 Khun
 Nella maglie della lingua                        Realtà e Reti: L’Illusione di una Lingua Universale nella Scienza Cognitiva
 In the mesh of the language                      Reality and networks: The illusion of a universal language in cognitive science
 Europa-America, la migrazione dei virus          Scambi e mutua scoperta: Storia sanitaria dell’unificazione globale
 Europe-America, the migration of viruses         Exchanges and mutual discovery: Health history of global unification
 Infiniti universi allacciati                     Inflazione cosmica: un universo di mondi nascosti
 Infinite universes connected                     Cosmic inflation: a universe of hidden worlds

Table 1
Comparison of Original and LLM Generated Titles with Literal Translations.
C. Composition of the datasets                                  transformers library.  We initialized the model
                                                                using AutoModelForSequenceClassification and
   used to train the classifiers                                trained the model using a binary cross-entropy loss func-
The dataset we used as a source of material for both the        tion (BCEWithLogitsLoss).
NS and HA classifiers is taken from "Testimole" [16], a            Training was conducted with a batch size of 32, a learn-
massive collection of Italian web scraping data that in-        ing rate of 2 × 10-̂5}, and a warmup ratio of 0.1 to help
cludes a "blogs" subset containing, as of November 2024,        stabilize early training. A linear learning rate scheduler
more than 2.8 million posts from various online blogs           and the $AdamW$ optimizer with gradient clipping were
and websites. From the original 2.8 million rows, we ob-        employed to manage learning stability. We also imple-
tained a much smaller dataset by filtering articles coming      mented early stopping, monitoring the F1 score to save
from sources that are, to our judgement, more similar           the best model checkpoint and halt training if the model
to professional media outlets. After this selection pro-        failed to improve over multiple epochs. The resulting
cess, which yielded a total of 715,335 articles, we filtered    model obtained a 95% of accuracy on the test set. Ac-
out articles written in languages different than Italian        curacy is measured as the number of correctly guessed
by using the "FastText Lang ID" field already present in        labels divided for the total number of examples. The
Testimole. After the foreign-languages pruning the count        threshold to decide for a positive or negative label was
of articles was 293,518 articles. Finally, we discarded all     set at 0.5. Using a continuos score instead of the thresh-
the rows whose article was shorter than 350 characters          old led to the same result, for this reason we decided to
to arrive to a final dataset size of 264,455 articles. In       kept only accuracy in this report.
the following section, this dataset will be referred               After having tested the model, we decided to further
as "testimole-subset". In order to increase the diversity       train it on the test set in order to have an improved model
of data for the HA Classifier, we added to this dataset a       to be used for the CALAMITA task.
collection of 432.000 articles taken from the professional         We then tested this further trained model on the
Italian media outlet "Il Fatto Quotidiano": we had to add       smaller "experimental-dataset" dataset containing 3007
this source manually because the articles were missing          natural and 3007 synthetic headlines coming from the
from the original Testimole dataset due to a scraping is-       Galileo dataset. This evaluation obtained an accuracy of
sue. In the section of HA Classifier, we will refer to this     87%
additional subset as "testimole-subset-auxiliary". Finally,        While initially we directly used PyTorch to train the
we are going to refer to the small subset of Galileo used       experimental versions of the model, we then decided
in the testing process as "experimental-dataset". The ex-       for simplicity to adopt the HuggingFace transformer li-
perimental dataset contains 3007 original headlines from        brary to easily upload the model on the HuggingFace
"Galileo" and 3007 headlines generated using Phi 3.5 Mini       hub. The further trained version of model is available at
Instruct from the same subset of Galileo’s articles.            the address: https://huggingface.co/mrinaldi/flash-it-ns-
                                                                classifier-fpt

D. NS Classifier
                                                                E. HA Classifier
For the NS Classifier, we decided to split the testimole-
subset dataset in two sets: 60% of the dataset was kept         In order to build the HA Classifier we first computed, for
with the original headline ("natural") while in the remain-     each article contained in the "testimole-subset" dataset,
ing 40% the original headline was substituted with a gen-       the embedding of the article’s text using SentenceBert
erated one ("synthetic"). The original headline is kept as a    with an Italian model 6 and added the embedding to a
reference as a separate column in the dataset. Specifically,    new column in the dataset. Then, we paired each article
we generated 93,921 headlines and kept 132,227 original         (source) of the dataset with the article (target) having
headlines. There is no contamination between generated          the highest cosine similarity between the embeddings.
and original headlines: no synthetic headlines were gen-        After the pairing, both source and target were marked as
erated for headlines that are present in the dataset with       "used" so that each article can appear no more than one
the "natural" label. The dataset was then divided in "test"     time in the resulting dataset, either as a source or as a
(45230 entries, x natural, x syntethic) and "train" (180918     target. The resulting dataset 7 has 6 columns:
entries, 105885 natural, 75033 synthetic) split for training.         • Anchor: the body of the "source" article
For the generation, we ran Ollama on different models                 • Positive: the original title of the "source" article
using the same prompt adopted for the evaluation. In
                                                                6
Table 2 you can see the amount of generated headlines             https://huggingface.co/nickprock/
for each model used.                                              sentence-bert-base-italian-xxl-uncased
                                                                7
                                                                  https://huggingface.co/datasets/mrinaldi/
   The classifier was created using Hugging Face’s                flash-it-ha-dataset-cossim
           Model                                                           Count                 Percentage
           lama3.2:3b-instruct-fp16                                        51886                 55.24%
           qwen2.5:7b-instruct-q8_0                                        18418                 19.61%
           aya:8b-23-q8_0                                                  17043                 18.15%
           mistral:7b-instruct-v0.3-q6_K                                   6312                  6.72%
           phi3.5:3.8b-mini-instruct-fp16                                  262                   0.28%
Table 2
Distribution of generated headlines by model



        • Negative: the original title of the "target" article   performed every 1,000 steps to monitor model perfor-
        • Cosine similarity: the Cosine Similarity be-           mance, with checkpoints saved periodically to retain the
          tween the source’s and target’s embeddings com-        best-performing model. We kept the "margin" value at
          puted on their texts                                   "5" following the documentation of SentenceBert. 9
        • Url positive: the URL of the source article, it can       The resulting classifier outputs a score representing
          be used as a key to find the original article in the   the alignment between the article and its headline.
          Testimole dataset                                         After having trained the HA Classifier on the
        • Url negative: the URL of the target article            "testimole-subset" dataset, we decided to use an addi-
                                                                 tional dataset (testimole-auxilliary) to further improve
Given the procedure employed for generating this dataset,        the classifier. Testimole-Auxiliary, halved due to match-
the resulting number of row is halved so that, starting          ing, has 216562 articles of which 108281 were used as
from the original 256530 entries in the "testimole-subset"       train and 108281 as test. The same procedure used for
dataset we obtained 128265 entries, divided into 102600          testimole-subset was applied to testimole-auxilliary. In
train entries and 25665 test entries. We believe that using      the following page we present a table summing up the
the cosine similarity instead of randomly shuffling the          results of the various models on the test datasets.
articles can improve the performance of the classifier
by increasing the difficulty of the task. Results with a
classifier trained on randomly paired articles is present
in the table below.
   The classifier was created using Sentence-
BERT, specifically by initializing the model
with the SentenceTransformer class from the
sentence_transformers library, using a pre-trained
Italian model8 . To fine-tune this model, we employed
a TripletLoss function to enhance similarity-based
ranking in embedding space. The triplet loss was the
optimal choice given our dataset because it requires an
anchor, a positive and a negative example. The goal
of the triplet loss is to maximize the distance between
the anchor and the negative example while at the
same time minimize the distance between the anchor
and the positive example. In this way, we encouraged
the formation of meaningful embeddings tailored to
minimize the distance between an article and a title
coherent with its content, notwithstanding the 512 token
length limitation.
   Training was conducted over three epochs with a
batch size of 64 for training and 16 for evaluation,
using a learning rate of 2 × 10-̂5} and a warmup ra-
tio of 0.1 to stabilize initial training steps. We used
the $SentenceTransformerTrainingArguments$
to configure training, applying half-precision floating-
point (fp16) to speed up processing. An evaluation was
8                                                                9
    https://huggingface.co/nickprock/                                https://sbert.net/docs/package_reference/sentence_transformer/
    sentence-bert-base-italian-xxl-uncased                           losses.html#tripletloss
Model name        Model training set    Test set        Correct        Accuracy Avg pos.   Avg neg.   Average   ROC
                                                        Triplets                dist.      dist.      Margin    AUC
HA-Cossim         "testimole-subset"    "testimole-     21949          0.8552   0.4        0.73       0.33      0.84
                  (Train)               subset"
                                        (Test)
HA-Cossim-        "testimole-subset"    "testimole-     98913          0.9135   0.37       0.72       0.35      0.89
FPT               (Train+Test)          auxiliary"
                                        (Test)
HA-Cossim-        "testimole-subset"    "testimole-     106662         0.9850   0.3        0.76       0.47      0.96
FFPT              (Train+Test),         auxiliary"
                  "testimole-           (Test)
                  auxiliary"
                  (Train)
HA-               "testimole-subset"    "testimole-     92523          0.8545   0.24       0.40       0.16      0.8
RANDOM            (Train)               auxiliary"
                                        (Test)

Table 3
Report of the results obtained by HA Classifier on the test datasets