<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Gili);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>GAT TINA - GenerAtion of TiTles for Italian News Articles: A CALAMITA Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Francis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Rinaldi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacopo Gili</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo De Cosmo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandro Iannaccone</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Malvina Nissim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviana Patti</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLCG, University of Groningen</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Turin</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>We introduce a new benchmark designed to evaluate the ability of Large Language Models (LLMs) to generate Italian-language headlines for science news articles. The benchmark is based on a large dataset of science news articles obtained from Ansa Scienza and Galileo, two important Italian media outlets. Efective headline generation requires more than summarizing article content; headlines must also be informative, engaging, and suitable for the topic and target audience, making automatic evaluation particularly challenging. To address this, we propose two novel transformer-based metrics to assess headline quality. We aim for this benchmark to support the evaluation of Italian LLMs and to foster the development of tools to assist in editorial workflows.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CALAMITA Challenge</kwd>
        <kwd>Italian</kwd>
        <kwd>Benchmarking</kwd>
        <kwd>Headline generation</kwd>
        <kwd>Summarisation</kwd>
        <kwd>LLMs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>sensitivity, balance, a sense of measure, and a deep
understanding of the readers. There are no precise and
The title is undoubtedly one of the most important and inescapable "rules" – save, of course, for the usual
decrucial components of a journalistic article. A good title ontological norms of pertinence and truth that regulate
intrigues the reader, synthesises the news without an- the journalistic profession – but in fact, the operation
ticipating its details, encourages further reading, and is depends almost exclusively on the author’s expertise and
simultaneously pleasant to read or hear. Often, the fate must be evaluated on a case-by-case basis.
of an article is inextricably linked to the quality of its Factors that can influence the composition of a title
accompanying title: it is not uncommon for inherently include, for example, the topic and the "tone of voice" of
interesting, in-depth, and factually correct articles to go the article (a piece reporting a crime news story, for
inunnoticed simply because they are accompanied by an stance, requires a measured, discreet, and respectful title;
inappropriate or unattractive title. Composing adequate conversely, a piece on lifestyle can and should be paired
titles is not a simple operation; it requires experience, with a lighter, ironic, and more colorful title); the style
of the publication hosting the article; the destination
format (the same article printed in a paper newspaper and
published on an online outlet, for example, typically has
two diferent titles); potential "conflicts" with other titles
present on the same page (for instance: repetitions of the
same word or phrase, or the enunciation of
contradictory concepts); space limitations; prescriptions related
to search engine optimisation (for example, the use of
a particular word or expression particularly popular at
the time of publication, or a specific position of words
within the title).</p>
      <p>
        It is in this context that the journalist’s toolkit has
recently been enriched with a powerful new tool: Large
language models (LLMs) undoubtedly have an important
role in the world of journalism, including quality
journalism. Although incapable of "understanding" content
words, LLMs are naturally capable of producing fluent, automatically generated headlines which were observed
complex, plausible, and credible texts in a matter of mo- in previous work, such as lack of fluency and creativity
ments. These models not only can improve the eficiency [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], might not afect LLM-based generations.
of editorial processes but also ofer new creative and in- The second aim is to provide a reliable, high quality
novative possibilities for content creation, including the dataset of articles and corresponding headlines in Italian,
automatic generation of journalistic headlines. Analysing developed through a direct collaboration of language
why it may be useful for journalism to have an LLM ca- technology experts and journalists, which can be used
pable of generating titles leads us to consider numerous and analysed well beyond the CALAMITA challenge.
factors, such as time optimisation, content personaliza- Although similar datasets exist for other languages [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ],
tion, and the ability to maintain a high level of quality, this resource is still lacking for Italian.
coherence, and communicative impact. However, these Overall, experimenting with the use of LLMs for title
tools also present many limitations and some dangers, generation can also be considered a first step towards
particularly the risk of blindly relying on them. the introduction of more extensive and comprehensive
      </p>
      <p>
        Timing and speed, in particular, are one of the great artificial intelligence agents, which assist the journalist
challenges of journalism - being the first to publish a in all phases of the creative process, from news research
story, especially online, is often essential to attract read- to drafting an outline, to writing the actual piece, and
ers - however, as we have seen, generating efective and ifnally to its promotion. Indeed, a close interaction of
incisive titles requires skill and time, which is not always language models and humans in this task has recently
available. An LLM can drastically reduce the time needed been shown to be key [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
to create appropriate titles, for example by suggesting
to the author a series of reasoned choices or proposing
modifications and corrections to an already written title, 2. Challenge Description
always keeping in mind preset criteria such as length,
tone, attractiveness, clarity, and the publication’s style. The task of headline generation has often been treated
Furthermore, if trained on the corpus of a particular pub- as equal to an extreme summarization task [
        <xref ref-type="bibr" rid="ref3 ref7">3, 7</xref>
        ].
Howlication, an LLM can suggest titles consistent with its ever, simply synthesising the content of the article into
tone of voice and editorial history. a brief description is not enough to provide a satisfying
      </p>
      <p>Another important advantage that the use of LLMs title. Additional characteristics such as attractiveness,
can ofer is the ability to personalise content for diferent creativeness, and many others also play a role. Writing
platforms and audiences. In today’s newsrooms, journal- appropriate headlines is challenging, even for current
ists no longer have to worry only about print media but state-of-the-art LLMs.
must also consider the web, social media, newsletters, Evaluating LLMs on the task of headline generation
and other digital distribution platforms. Each platform for Italian news articles thus serves multiple purposes.
requires a diferent type of language, style, and length On one hand, it tests models’ capacity to properly
underfor titles. For example, a title optimised for Twitter (or stand, that is, to reprocess large source texts in a way that
X) must be short and incisive, while a title for a news is faithful to the content of the text. On the other hand,
website can be more descriptive. An LLM is capable of it acts as a means to assess the performance of LLMs in
generating variants of a title based on the medium of many complex dimensions, such as attractiveness,
credissemination, allowing newsrooms to adapt their con- ativity, or adherence to tone. Finally, this benchmark
tent precisely and in a targeted manner. Moreover, using could prove useful in practical applications. For instance,
reader behavioural data, the LLM can generate more it may help guide decisions on whether, and to what
exattractive titles for specific demographic groups, thus tent, a journal should integrate LLMs into its workflow.
improving the engagement and communicative efective- It may also serve as an efective testbed for future
reness of the news. search and development towards efective deployment</p>
      <p>
        With this task, which is developed in the context of the in real-world scenarios - One such venue could be the
CALAMITA Challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and which consists in asking use of prompting to achieve the desired style and tone in
an LLM to generate a headline given the corresponding generated headlines.
full article, we have a twofold aim. In our challenge, language models are tasked with
gen
      </p>
      <p>
        The first aim is to test and analyse the ability of existing erating Italian-language headlines based on articles from
and future LLMs on the task of headline generation in the scientific news journals written in Italian. Our dataset
context of Italian news articles. This would provide a sub- includes original articles from such journals, along with
stantial step forward compared to past experiments on their human-authored titles. Models are provided the
headline generation for Italian, which were run training complete source text in the prompt, as well as
instrucmuch smaller sequence-to-sequence models from scratch tions to generate a title that is brief, coherent, and
capti[
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. We expect that some of the shortcomings of the vating. We guide the model towards the specific editorial
style of the media outlet by including a small number of
examples of headlines in our prompt. We employ
automatic metrics that assess the model’s performance along
three dimensions:
      </p>
      <sec id="sec-1-1">
        <title>1. Coherency with the original article (HA classifier)</title>
        <p>
          2. Alignment with the style of human written
headlines (NS classifier)
3. Similarity between the generated and the
goldstandard headline (ROUGE [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], SBERT [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ])
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>However, considering the complexity of the task, we believe that manually reviewing a sample of the generated headlines can ofer additional perspectives on the behaviour of the model.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Data description</title>
      <p>Our benchmark is based of two datasets consisting of
science news articles from two diferent sources. In each
dataset, we provide the full text of the article paired with
the original, human-authored headline. Additionally, we
include metadata such as link, date, author (if present)
and subtitle.</p>
      <sec id="sec-2-1">
        <title>3.1. Origin of data</title>
        <sec id="sec-2-1-1">
          <title>The data were obtained via web scraping with custom</title>
          <p>Python scripts. Since links to articles more than a few
weeks old are inaccessible on the Ansa website, we
collected a large number by downloading the archived "Ansa
Scienza" RSS feeds from The Wayback Machine and
processing them to remove duplicates and extact links.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>3.2. Data format</title>
        <p>The data from web scraping were saved in "JSON Lines"
(JSONL) format, with each line containing a JSON object
with the following fields:
• Title: the title of the article
• Source: the name of the website
• Date: the publishing date of the article
• Author: the author of the article, if present
• URL: the Internet address of the article
• Text: the body of the article
• ID: a unique identifier of the article</p>
      </sec>
      <sec id="sec-2-3">
        <title>3.3. Detailed data statistics</title>
        <sec id="sec-2-3-1">
          <title>Our dataset consists of 30,461 articles gathered from two sources:</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>When measured with “tiktoken o200k_base” tokenizer</title>
          <p>model, we obtained a total of 21,365,897 tokens for the
Galileo dataset (average: 906 tokens per article,
maximum: 24,306) and a total of 3,762,539 tokens for the
Galileo dataset (average: 546 tokens per article,
maximum: 7,600). Figures 1 and 2 depict the distribution of
articles by token count in the Galileo and Ansa datasets
respectively.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>3.4. Prompting</title>
        <sec id="sec-2-4-1">
          <title>Due to the length of each article, the use of task examples</title>
          <p>in our prompt would be too computationally expensive.
Therefore, we test the models in a zero-shot prompting
setting. While we do not use any task examples in our
prompt, we do provide seven examples of headlines. In
this way, the model is given examples of the expected
output (a title) rather than examples of the full task
(article and title). Professional journalists made a list of 22
headlines that, in their opinion, were representative of
a well-made writing process under the three aspects of
being captivating, short and informative.</p>
          <p>Each time the model is tested, 7 randomly chosen titles
from the list are appended to the standard prompt. As a
reference, the identifier of the example headlines is also
saved along with the output of the model. See Box 1 for
our input prompt.</p>
          <p>Prompt for the LLM</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>Il tuo compito è generare un titolo accattivante</title>
          <p>e informativo per l’articolo fornito.</p>
          <p>Requisiti:
- Titolo breve
- Cattura l’essenza dell’articolo
- Usa un linguaggio vivido e coinvolgente
- Non generare alcun tipo di testo che non sia il
titolo dell’articolo
- Usa esclusivamente l’Italiano.</p>
          <p>Presta particolare attenzione ai seguenti titoli di
esempio e adotta lo stesso stile:
Title 1
Title 2
...</p>
          <p>Title 7
Your task is to generate a catchy and informative
title for the article provided.</p>
          <p>Requirements:
- Short title
- Capture the essence of the article
- Use vivid and engaging language
- Do not generate any type of text other than the
title of the article
- Use Italian exclusively.</p>
          <p>Pay particular attention to the following example
titles and adopt the same style:
Title 1
Title 2
...</p>
          <p>Title 7</p>
        </sec>
        <sec id="sec-2-4-3">
          <title>Box 1: Zero-shot prompt and English translation.</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Preliminary Evaluation</title>
      <p>To get a first impression of LLM performance on our task,
we conducted preliminary experiments by manually
reviewing headlines generated by several models. Overall,
the results were unsatisfactory - while the titles were
generally coherent with the articles, they lacked
captivation and originality. The majority of the generated
headlines followed the format &lt;Keywords: explanation&gt;,
leading to repetitive and poorly formulated headlines.
Examples of our preliminary results can be found in Table 1
in Appendix A. This behaviour persisted even when the
models were explicitly instructed to avoid using colons in
the titles, or when examples of titles were given. Out of
3,006 headlines generated by Phi-3.5 Mini-Instruct, 2,940
headlines contained a colon. We obtained similar
results using Mistral-7B-Instruct-v0.3, Qwen2-7B-Instruct,
gemma-2-9b-it and Italia-9B-Instruct-v0.1. Manual
experimentation with the commercial LLMs Claude 3.5 Sonnet1
and ChatGPT 4o2 yielded the same behaviour:
• Titolo originale: Una rapina cosmica
nell’ammasso di galassie dell’Idra
• Claude: Rapina cosmica: il furto di gas
nell’ammasso dell’Idra
• ChatGPT: Rapina Cosmica: NGC 3312 Derubata
di Gas nell’Ammasso di Galassie dell’Idra</p>
      <sec id="sec-3-1">
        <title>Interestingly, when we asked Claude 3.5 Sonnet to</title>
        <p>improve our prompt for generating headlines, it added
the line &lt;Struttura: [Frase d’impatto o dato interessante]:
[Spiegazione o contesto]&gt; to our example prompt,
explicitly requesting the unwanted behaviour. It appears that
LLMs consistently regard this particular structure as the
ideal format for a headline.</p>
        <p>Given the inherent dificulty of interpreting LLM
behaviour, we cannot provide a single reason for their
preference for this particular construction. Of course, there
might be a large presence of such headlines in the
training data, particularly from lower-quality journals. There
may also be an influence of Search Engine Optimizations
(SEO) on the behaviour of the model: Giving importance
to keywords is a classic SEO technique.</p>
        <p>Moreover, we generally noticed a preference toward
sentences poor in determinative and indefinite articles
when compared with human written headlines.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Metrics</title>
      <p>
        Automatically evaluating the quality of generated
headlines is a challenging matter because headline
quality is inherently subjective, multi-faceted, and
contextdependent. Thus, instead of providing a single numeric
1https://www.anthropic.com/news/claude-3-5-sonnet
2https://openai.com/index/hello-gpt-4o/
value as an overall quality score, headlines should be [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], we will evaluate our system outputs using
ROUGEevaluated along multiple dimensions and subsequently L, which identifies the length of the longest common
rated for their quality based on specific use cases. To give subsequence between system and reference.
examples of what others have done - Cafagna et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
evaluate generated headlines based on the criteria such 5.2. SBERT
as grammatical correctness, topic relevance,
attractiveness, and overall appropriateness. Cai et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] assess Sentence-BERT, or SBERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], is a modification of the
factors such as factual consistency, relevance, and surface BERT network that uses Siamese networks and that
overlap between the generated headline and the article, can derive semantically meaningful, fixed-size vector
as well as its alignment with user-specific preferences. embeddings from whole sentences. We use SBERT to
      </p>
      <p>In the aforementioned papers, the headlines were compare our generated headlines to the gold-standard
scored by human evaluators. This approach is resource ones by comparing their SBERT embeddings using cosine
intensive - to account for diferences in individual pref- similarity, which we then use directly as the similarity
erences, hiring multiple human evaluators from varying score. SBERT produces more meaningful sentence
emdemographic backgrounds is preferred. This does not beddings compared to BERT, which is not designed for
scale well to the evaluation of multiple models on large- sentence similarity tasks - therefore, cosine similarity
scale benchmarks across multiple studies, making the with BERT embeddings could produce unwanted and
ability to automatically evaluate the outputs of LLMs less interpretable results.
essential.</p>
      <p>
        Historically, n-gram overlap metrics like BLEU [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], 5.3. Custom metrics
ROUGE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or METEOR [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] have been used to compare
generated outputs with reference “gold standard” texts, Given the limitations of the current available metrics for
but these metrics emphasise surface-level matching and the headlines generation task, we develop two custom
are therefore not robust to paraphrasing or other vari- metrics employing classifiers based on Transformer [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
ations in acceptable outputs. Learned metrics such as models. We trained both classifiers on a subset of the
COMET [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], a metric designed to mimic human quality “blogs” section of the “Testimole”3 dataset, which was
judgement for machine translations, have been gaining obtained by web scraping various Italian media sources.
in popularity. These are not easily transferable to other Our subset consists of only those parts of the dataset
languages or tasks, and learnable metrics designed specif- scraped from professional media outlets. The criteria for
ically for Italian headline generation are not available. the selection process, as well as the technical details for
Additionally, such metrics typically produce a single nu- each classifier, are in Appendix B.
merical score of ’quality’. To improve interpretability and
ensure contextual flexibility, we would prefer to provide 5.3.1. HA Classifier
individual scores for each dimension. We train two novel
learned metrics for Italian headline generation, but leave
others for future work.
      </p>
      <p>
        We evaluate model performance on our benchmark
using four metrics: ROUGE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], SBERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and two custom
metrics - the Headline-Article and Natural-Synthetic
classifiers. Within the context of the CALAMITA challenge,
the model’s final score will be an aggregate in which four
all metrics are weighted equally. Each metric is detailed
in the following section.
      </p>
      <p>
        Our first classifier is based on the Sentence
Transformers [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] architecture, fine-tuned to discriminate between
coherent and non-coherent pairs of headlines and
articles. A generated headline can score between 0 and 1,
representative of the degree of alignment between the
headline and the content of the article. Following the
work by De Mattei et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we call this classifier "HA",
or Headline-Article.
      </p>
      <p>
        To train the model, we used a non-finetuned Italian
Sentence Bert model4 to compute an embedding for each
article. We then find the headline of the article in the
dataset with the highest cosine similarity, and create
a new dataset where each row contains the article
(anchor), the original title (positive), and the title of the most
similar article (negative). Because the original dataset
contained some duplicate items, we filtered all articles
with "1" as the cosine similarity score. With this dataset,
we were able to use Triplet Loss to train the classifier
5.1. ROUGE
ROUGE (Recall-Oriented Understudy for Gisting
Evaluation) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is a popular metric used to evaluate
automatically generated summarizations. It provides a measure of
overlap between generated text and gold-standard
references. ROUGE is easily interpretable and allows for easy
comparison across many papers due to its widespread
use. However, it is not robust to variations in input,
making it less suitable for the assessment of tasks involving
creativity, such as headline generation. Following others
3https://huggingface.co/datasets/mrinaldi/TestiMole
4https://huggingface.co/nickprock/
sentence-bert-base-italian-xxl-uncased
to diferentiate between coherent and incoherent titles, allows us to build a positive feedback loop in which the
starting from the assumption that the original title is headline generation system teaches itself to generate
the one most coherent with the article’s content. We good headlines based on the classification of the
discrimidecided to perform a cosine similarity search instead of nator. For instance, the model can be trained to ’fool’ the
random shufling in order to increase the dificulty of the NS discriminator as often as possible while the NS
disdiscriminator’s task. criminator uses the experience to improve at identifying
      </p>
      <p>The drawback of this approach is the low context win- synthetic data, causing both models to improve
simultadow of the model - all articles were truncated after the neously. This method, for instance, should quickly solve
ifrst 512 tokens. While it is possible to develop a more the frequent use of the colon in automatically generated
complex architecture to account for larger texts, we leave headlines outlined in Section 4.
this for future work.
5.3.2. NS Classifier</p>
    </sec>
    <sec id="sec-5">
      <title>7. Limitations</title>
    </sec>
    <sec id="sec-6">
      <title>6. Future works</title>
      <p>Our second classifier is called "NS", or Natural-Synthetic. Our benchmark is limited to articles and headlines from
It is a binary regression classiefir based on an Italian only two journals, which restricts its representativeness
BERT-base uncased model5, trained to discriminate be- across journalistic domains. As a result, it may not
captween human-authored and machine-generated titles. ture the variability present in publications targeting
difGiven a title as input, the classifier outputs a numerical ferent demographics, covering varied topics, or
represcore indicating the likelihood of the title being close to senting a full spectrum of political perspectives.
those written by journalists. We believe that similarity In training our classifiers, we took care to prevent
to headlines written by journalists may be a useful indi- data contamination by ensuring non-overlapping splits
cator of the quality and appropriateness of a generated between training and test sets. Nonetheless, given the
headline. public availability of the articles online, there remains</p>
      <p>Using the same subset of Testimole employed for the a possibility that some test data may indirectly overlap
“HA” classifier, we generated over 90,000 synthetic head- with training data due to external access and prior
expolines using LLMs of up to 9 billion parameters. To avoid sure.
overfitting our classifier to the specific probability
distribution of a single model, we generated synthetic head- 8. Ethical issues
lines using diferent models; this process is detailed in
Appendix C, along with details about the number of
generated headlines per model. The result is a labelled dataset
containing original as well as generated headlines.</p>
      <p>The advantage of employing a “Natural-Synthetic”
classifier is that the training objective is coarse,
encouraging the classifier to consider a broad range of aspects
that may account for the discrepancy of text generated
by machines and humans.</p>
      <sec id="sec-6-1">
        <title>This task is aimed at testing the factual knowledge which</title>
        <p>LLMs acquire during their training process, whose
objective is language modelling. This task should not suggest,
or stimulate, that LLMs should commonly be used as
knowledge bases or as reliable sources of factual
information. The investigation underlying this challenge is
research-oriented, aimed at a better understanding of
LLMs’ abilities, and possibly suggest ways to discern
when models might be providing more or less reliable
knowledge and possibly making them more transparent
in their generated output.</p>
      </sec>
      <sec id="sec-6-2">
        <title>We see value in future research using classifiers and re</title>
        <p>gressors to assess specific aspects of generated headlines.
Such metrics have the potential to capture complex
probability distributions over a multitude of dimensions of
the data, including dimensions that are not directly
interpretable to human observation. For instance, a learned
metric that predicts the amount of attention a headline
will generated would be highly useful.</p>
        <p>Inspired by Generative Adversarial Networks (GANs),
we find the employment of classification-based metrics
promising for developing a model specialized in headline
generation. A discriminator/generator training system</p>
      </sec>
      <sec id="sec-6-3">
        <title>5https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>9. Data license and copyright issues</title>
      <sec id="sec-7-1">
        <title>Access to the data is granted for the evaluation but cannot be shared publicly at the moment, also for reasons related to data contamination.</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <sec id="sec-8-1">
        <title>The authors would like to thank ANSA Scienza and</title>
        <p>Galileo, giornale di scienza - http:\www.galileonet.it for
their interest in the GATTINA CALAMITA challenge
and for the extremely valuable exchange of ideas that
allowed us to shape a task of high potential impact in the
ifeld of journalism.
A. Examples of Good titles
selected by professional
journalists
• Nella Via Lattea c’è un oggetto misterioso, è
velocissimo
• Nasce il gemello digitale del rischio ambientale
in Italia
• I cinque modi in cui il cervello invecchia
• Covid-19, il mistero degli over 90
• A 44 e a 60 anni i due gradini chiave
dell’invecchiamento
• Palestra o snack? la scelta dipende da un
messaggero chimico
• Dagli stadi alle spiagge, sono i salti a sincronizzare
il ballo
• Dalle rose alle melanzane, ecco i geni delle spine
• Così il Covid accelera l’invecchiamento
• Uno zucchero naturale contro la calvizie, bene i
test sui topi
• Scoperto nel cervello il circuito dell’efetto
placebo
• Pronto il Google Earth del cuore umano
• Una molecola può ringiovanire il sistema
immunitario
• Scoperto il dizionario dei sinonimi e contrari del
cervello
• Le farfalle nello stomaco non sono solo un modo
di dire
• Pronto il primo orologio nucleare, il più preciso
del mondo
• Gli uccelli in volo si comportano come gli atomi
• L’Italia ritenta la sfida impossibile della geometria
• Le auto nel trafico come i batteri in cerca di cibo
• Robot come alleati, trovata la chiave per
collaborare con gli umani
• Dalle spugne di vetro grattacieli più sottili e
resistenti
• L’IA non è razionale, fa ragionamenti non logici</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>B. Examples of Model Outputs in</title>
    </sec>
    <sec id="sec-10">
      <title>Preliminary Evaluation</title>
      <p>Original Title (Italian)
No allo studente flessibile
No to the flexible student
Alzheimer, la proteina della speranza
Alzheimer, the protein of hope
Un laser estremo
An extreme laser
E’ italiano il primo modello genetico
The first genetic model is Italian
Carta d’identità per le balene
Identity card for whales
Scampati allo shock anafilattico
Escaped from anaphylactic shock
La casa in un clic
The house in a click
Deforestazione senza fine
Endless deforestation
"Più impegno contro il fumo"
More commitment against smoking
Una carovana contro l’Oleoducto
A caravan against the pipeline
Analisi dell’egemonia americana
Analysis of American hegemony
Censura sui dati scientifici
Censorship on scientific data
Siamo tutti africani: ricostruito il Dna degli
antenati
We are all Africans: reconstructed the DNA
of ancestors
Rubbia racconta Galileo divulgatore di
scienza
Rubbia tells Galileo as a science
communicator
Galileo Galilei nel pensiero di Thomas Khun
Galileo Galilei in the thought of Thomas
Khun
Nella maglie della lingua
In the mesh of the language
Europa-America, la migrazione dei virus
Europe-America, the migration of viruses
Infiniti universi allacciati
Infinite universes connected
LLM Generated Title (Italian)
Riforma Universitaria: Studenti in Rivolta contro la Selettività e la Flessibilità
University reform: Students in revolt against selectivity and flexibility
Nicastrina: nuova chiave per Alzheimer, scoperta in Calabria
Nicastrin: new key for Alzheimer’s, discovered in Calabria
Euv: Nuova Frontiera nella Manipolazione Molecolare con Luce Ultra-Sottile
EUV: New frontier in molecular manipulation with ultra-thin light
Creazione topi per studiare il ritardo mentale: scoperta genetica rivoluzionaria
Creation of mice to study mental disability: revolutionary genetic discovery
Genetica della balena: Identificazione di specie protette nel commercio illegale
Whale genetics: Identification of protected species in illegal trade
Scoperto il fattore eNOS nella reazione anafilattica: speranza per una cura
eNOS factor discovered in anaphylactic reaction: hope for a cure
Domotica: Software unificato per controllare tutti gli elettrodomestici,
innovazione Cnr
Home automation: Unified software to control all appliances, Cnr innovation
Deforestazione in aumento in Amazzonia: il Brasile reagisce
Deforestation increasing in the Amazon: Brazil reacts
Bioetica del tabagismo: raccomandazioni per campagne di prevenzione e
interventi regolativi
Bioethics of smoking: recommendations for prevention campaigns and regulatory
interventions
Protesta contro l’Oleodotto Crudos Pasos: Ecuador, ambientalismo e scontro
politico
Protest against the Crudos Pasos pipeline: Ecuador, environmentalism and
political clash
Egemonia Americana: Stati Fuorilegge e Diritto Internazionale
American hegemony: Rogue states and international law
Scienza e Sicurezza: Accademia e Governo Discutono su Divulgazione nei
Lavori Scientifici
Science and security: Academia and government discuss disclosure in scientific
works
DNA dell’Africa: Prove definitive dell’origine umana
DNA of Africa: Definitive proof of human origin
Galileo Divulgatore: Scienza e Civiltà dal XVII al Mondo Moderno
Galileo communicator: Science and civilization from the 17th century to the
modern world
Kuhn e Galileo: Rivoluzione scientifica attraverso la meccanica e l’esperimento
Kuhn and Galileo: Scientific revolution through mechanics and experiment
Realtà e Reti: L’Illusione di una Lingua Universale nella Scienza Cognitiva
Reality and networks: The illusion of a universal language in cognitive science
Scambi e mutua scoperta: Storia sanitaria dell’unificazione globale
Exchanges and mutual discovery: Health history of global unification
Inflazione cosmica: un universo di mondi nascosti
Cosmic inflation: a universe of hidden worlds</p>
    </sec>
    <sec id="sec-11">
      <title>C. Composition of the datasets</title>
      <p>used to train the classifiers
transformers library. We initialized the model
using AutoModelForSequenceClassification and
trained the model using a binary cross-entropy loss
funcThe dataset we used as a source of material for both the tion (BCEWithLogitsLoss).</p>
      <p>
        NS and HA classifiers is taken from "Testimole" [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], a Training was conducted with a batch size of 32, a
learnmassive collection of Italian web scraping data that in- ing rate of 2 × 10ˆ-5}, and a warmup ratio of 0.1 to help
cludes a "blogs" subset containing, as of November 2024, stabilize early training. A linear learning rate scheduler
more than 2.8 million posts from various online blogs and the $AdamW$ optimizer with gradient clipping were
and websites. From the original 2.8 million rows, we ob- employed to manage learning stability. We also
impletained a much smaller dataset by filtering articles coming mented early stopping, monitoring the F1 score to save
from sources that are, to our judgement, more similar the best model checkpoint and halt training if the model
to professional media outlets. After this selection pro- failed to improve over multiple epochs. The resulting
cess, which yielded a total of 715,335 articles, we filtered model obtained a 95% of accuracy on the test set.
Acout articles written in languages diferent than Italian curacy is measured as the number of correctly guessed
by using the "FastText Lang ID" field already present in labels divided for the total number of examples. The
Testimole. After the foreign-languages pruning the count threshold to decide for a positive or negative label was
of articles was 293,518 articles. Finally, we discarded all set at 0.5. Using a continuos score instead of the
threshthe rows whose article was shorter than 350 characters old led to the same result, for this reason we decided to
to arrive to a final dataset size of 264,455 articles. In kept only accuracy in this report.
the following section, this dataset will be referred After having tested the model, we decided to further
as "testimole-subset". In order to increase the diversity train it on the test set in order to have an improved model
of data for the HA Classifier, we added to this dataset a to be used for the CALAMITA task.
collection of 432.000 articles taken from the professional We then tested this further trained model on the
Italian media outlet "Il Fatto Quotidiano": we had to add smaller "experimental-dataset" dataset containing 3007
this source manually because the articles were missing natural and 3007 synthetic headlines coming from the
from the original Testimole dataset due to a scraping is- Galileo dataset. This evaluation obtained an accuracy of
sue. In the section of HA Classifier, we will refer to this 87%
additional subset as "testimole-subset-auxiliary". Finally, While initially we directly used PyTorch to train the
we are going to refer to the small subset of Galileo used experimental versions of the model, we then decided
in the testing process as "experimental-dataset". The ex- for simplicity to adopt the HuggingFace transformer
liperimental dataset contains 3007 original headlines from brary to easily upload the model on the HuggingFace
"Galileo" and 3007 headlines generated using Phi 3.5 Mini hub. The further trained version of model is available at
Instruct from the same subset of Galileo’s articles. the address:
https://huggingface.co/mrinaldi/flash-it-nsclassifier-fpt
      </p>
    </sec>
    <sec id="sec-12">
      <title>D. NS Classifier</title>
      <p>For the NS Classifier, we decided to split the
testimolesubset dataset in two sets: 60% of the dataset was kept
with the original headline ("natural") while in the
remaining 40% the original headline was substituted with a
generated one ("synthetic"). The original headline is kept as a
reference as a separate column in the dataset. Specifically,
we generated 93,921 headlines and kept 132,227 original
headlines. There is no contamination between generated
and original headlines: no synthetic headlines were
generated for headlines that are present in the dataset with
the "natural" label. The dataset was then divided in "test"
(45230 entries, x natural, x syntethic) and "train" (180918
entries, 105885 natural, 75033 synthetic) split for training.
For the generation, we ran Ollama on diferent models
using the same prompt adopted for the evaluation. In
Table 2 you can see the amount of generated headlines
for each model used.</p>
      <p>The classifier was created using Hugging Face’s</p>
    </sec>
    <sec id="sec-13">
      <title>E. HA Classifier</title>
      <p>In order to build the HA Classifier we first computed, for
each article contained in the "testimole-subset" dataset,
the embedding of the article’s text using SentenceBert
with an Italian model 6 and added the embedding to a
new column in the dataset. Then, we paired each article
(source) of the dataset with the article (target) having
the highest cosine similarity between the embeddings.
After the pairing, both source and target were marked as
"used" so that each article can appear no more than one
time in the resulting dataset, either as a source or as a
target. The resulting dataset 7 has 6 columns:
• Anchor: the body of the "source" article
• Positive: the original title of the "source" article</p>
      <sec id="sec-13-1">
        <title>6https://huggingface.co/nickprock/</title>
        <p>sentence-bert-base-italian-xxl-uncased
7https://huggingface.co/datasets/mrinaldi/
lfash-it-ha-dataset-cossim
Model
lama3.2:3b-instruct-fp16
qwen2.5:7b-instruct-q8_0
aya:8b-23-q8_0
mistral:7b-instruct-v0.3-q6_K
phi3.5:3.8b-mini-instruct-fp16
• Negative: the original title of the "target" article performed every 1,000 steps to monitor model
perfor• Cosine similarity: the Cosine Similarity be- mance, with checkpoints saved periodically to retain the
tween the source’s and target’s embeddings com- best-performing model. We kept the "margin" value at
puted on their texts "5" following the documentation of SentenceBert. 9
• Url positive: the URL of the source article, it can The resulting classifier outputs a score representing
be used as a key to find the original article in the the alignment between the article and its headline.</p>
        <p>Testimole dataset After having trained the HA Classifier on the
• Url negative: the URL of the target article "testimole-subset" dataset, we decided to use an
additional dataset (testimole-auxilliary) to further improve
Given the procedure employed for generating this dataset, the classifier. Testimole-Auxiliary, halved due to
matchthe resulting number of row is halved so that, starting ing, has 216562 articles of which 108281 were used as
from the original 256530 entries in the "testimole-subset" train and 108281 as test. The same procedure used for
dataset we obtained 128265 entries, divided into 102600 testimole-subset was applied to testimole-auxilliary. In
train entries and 25665 test entries. We believe that using the following page we present a table summing up the
the cosine similarity instead of randomly shufling the results of the various models on the test datasets.
articles can improve the performance of the classifier
by increasing the dificulty of the task. Results with a
classifier trained on randomly paired articles is present
in the table below.</p>
        <p>The classifier was created using
SentenceBERT, specifically by initializing the model
with the SentenceTransformer class from the
sentence_transformers library, using a pre-trained
Italian model8. To fine-tune this model, we employed
a TripletLoss function to enhance similarity-based
ranking in embedding space. The triplet loss was the
optimal choice given our dataset because it requires an
anchor, a positive and a negative example. The goal
of the triplet loss is to maximize the distance between
the anchor and the negative example while at the
same time minimize the distance between the anchor
and the positive example. In this way, we encouraged
the formation of meaningful embeddings tailored to
minimize the distance between an article and a title
coherent with its content, notwithstanding the 512 token
length limitation.</p>
        <p>Training was conducted over three epochs with a
batch size of 64 for training and 16 for evaluation,
using a learning rate of 2 × 10ˆ-5} and a warmup
ratio of 0.1 to stabilize initial training steps. We used
the $SentenceTransformerTrainingArguments$
to configure training, applying half-precision
floatingpoint (fp16) to speed up processing. An evaluation was</p>
      </sec>
      <sec id="sec-13-2">
        <title>8https://huggingface.co/nickprock/</title>
        <p>sentence-bert-base-italian-xxl-uncased
21949
98913
106662</p>
        <p>Accuracy
0.8552
0.9135
0.9850
ROC
AUC
Avg pos.
dist.
0.73
0.72</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gili</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scalena</surname>
          </string-name>
          ,
          <article-title>CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian</article-title>
          ,
          <source>in: Proceedings of the 10th Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2024</year>
          ), Pisa, Italy, December 4 - December 6,
          <year>2024</year>
          , CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Mattei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <article-title>Suitable doesn't mean attractive. human-based evaluation of automatically generated headlines</article-title>
          , in: R. Bernardi,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , G. Semeraro (Eds.),
          <source>Proceedings of the Sixth Italian Conference on Computational Linguistics</source>
          , Bari, Italy,
          <source>November 13-15</source>
          ,
          <year>2019</year>
          , volume
          <volume>2481</volume>
          <source>of CEUR Workshop Proceedings</source>
          , CEURWS.org,
          <year>2019</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2481</volume>
          / paper13.pdf .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>De Mattei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nissim</surname>
          </string-name>
          ,
          <article-title>Invisible to people but not to machines: Evaluation of style-aware headlinegeneration in absence of reliable human judgment</article-title>
          ,
          <source>in: Proceedings of the Twelfth Language Resources and Evaluation Conference</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6709</fpage>
          -
          <lpage>6717</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <article-title>Pens: A dataset and generic framework for personalized news headline generation</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cao</surname>
          </string-name>
          , et al.,
          <article-title>Xglue: A new benchmark dataset for cross-lingual pretraining, understanding and generation</article-title>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>01401</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smith-Renner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetreault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jaimes</surname>
          </string-name>
          ,
          <article-title>Harnessing the power of LLMs: Evaluating human-AI text co-creation through the lens of news headline generation</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>3321</fpage>
          -
          <lpage>3339</lpage>
          . URL: https://aclanthology. org/
          <year>2023</year>
          .findings-emnlp.
          <volume>217</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>217</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <article-title>A neural attention model for abstractive sentence summarization</article-title>
          , arXiv Preprint, CoRR, abs/1509.00685 (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>ROUGE:</surname>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https: //arxiv.org/abs/
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Generating user-engaging news headlines</article-title>
          ,
          <source>in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>3265</fpage>
          -
          <lpage>3280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Denkowski</surname>
          </string-name>
          ,
          <article-title>The meteor metric for automatic evaluation of machine translation</article-title>
          ,
          <source>Machine translation 23</source>
          (
          <year>2009</year>
          )
          <fpage>105</fpage>
          -
          <lpage>115</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Farinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <article-title>Comet: A neural framework for mt evaluation</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>09025</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Krubiński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pecina</surname>
          </string-name>
          ,
          <article-title>Towards unified uni-and multi-modal news headline generation</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EACL</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>437</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          , Testimole,
          <year>2024</year>
          . URL: https: //huggingface.co/datasets/mrinaldi/TestiMole.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>