1. Introduction and Motivation

J. Gili);

GAT TINA - GenerAtion of TiTles for Italian News Articles: A CALAMITA Challenge

Maria Francis

0 1

Matteo Rinaldi

Jacopo Gili

Leonardo De Cosmo

Sandro Iannaccone

Malvina Nissim

Viviana Patti

2 0 CLCG, University of Groningen 1 University of Trento 2 University of Turin

2024

000 0 0001

We introduce a new benchmark designed to evaluate the ability of Large Language Models (LLMs) to generate Italian-language headlines for science news articles. The benchmark is based on a large dataset of science news articles obtained from Ansa Scienza and Galileo, two important Italian media outlets. Efective headline generation requires more than summarizing article content; headlines must also be informative, engaging, and suitable for the topic and target audience, making automatic evaluation particularly challenging. To address this, we propose two novel transformer-based metrics to assess headline quality. We aim for this benchmark to support the evaluation of Italian LLMs and to foster the development of tools to assist in editorial workflows.

eol>CALAMITA Challenge Italian Benchmarking Headline generation Summarisation LLMs

1. Introduction and Motivation

sensitivity, balance, a sense of measure, and a deep understanding of the readers. There are no precise and The title is undoubtedly one of the most important and inescapable "rules" – save, of course, for the usual decrucial components of a journalistic article. A good title ontological norms of pertinence and truth that regulate intrigues the reader, synthesises the news without an- the journalistic profession – but in fact, the operation ticipating its details, encourages further reading, and is depends almost exclusively on the author’s expertise and simultaneously pleasant to read or hear. Often, the fate must be evaluated on a case-by-case basis. of an article is inextricably linked to the quality of its Factors that can influence the composition of a title accompanying title: it is not uncommon for inherently include, for example, the topic and the "tone of voice" of interesting, in-depth, and factually correct articles to go the article (a piece reporting a crime news story, for inunnoticed simply because they are accompanied by an stance, requires a measured, discreet, and respectful title; inappropriate or unattractive title. Composing adequate conversely, a piece on lifestyle can and should be paired titles is not a simple operation; it requires experience, with a lighter, ironic, and more colorful title); the style of the publication hosting the article; the destination format (the same article printed in a paper newspaper and published on an online outlet, for example, typically has two diferent titles); potential "conflicts" with other titles present on the same page (for instance: repetitions of the same word or phrase, or the enunciation of contradictory concepts); space limitations; prescriptions related to search engine optimisation (for example, the use of a particular word or expression particularly popular at the time of publication, or a specific position of words within the title).

It is in this context that the journalist’s toolkit has recently been enriched with a powerful new tool: Large language models (LLMs) undoubtedly have an important role in the world of journalism, including quality journalism. Although incapable of "understanding" content words, LLMs are naturally capable of producing fluent, automatically generated headlines which were observed complex, plausible, and credible texts in a matter of mo- in previous work, such as lack of fluency and creativity ments. These models not only can improve the eficiency [ 2 ], might not afect LLM-based generations. of editorial processes but also ofer new creative and in- The second aim is to provide a reliable, high quality novative possibilities for content creation, including the dataset of articles and corresponding headlines in Italian, automatic generation of journalistic headlines. Analysing developed through a direct collaboration of language why it may be useful for journalism to have an LLM ca- technology experts and journalists, which can be used pable of generating titles leads us to consider numerous and analysed well beyond the CALAMITA challenge. factors, such as time optimisation, content personaliza- Although similar datasets exist for other languages [ 4, 5 ], tion, and the ability to maintain a high level of quality, this resource is still lacking for Italian. coherence, and communicative impact. However, these Overall, experimenting with the use of LLMs for title tools also present many limitations and some dangers, generation can also be considered a first step towards particularly the risk of blindly relying on them. the introduction of more extensive and comprehensive

Timing and speed, in particular, are one of the great artificial intelligence agents, which assist the journalist challenges of journalism - being the first to publish a in all phases of the creative process, from news research story, especially online, is often essential to attract read- to drafting an outline, to writing the actual piece, and ers - however, as we have seen, generating efective and ifnally to its promotion. Indeed, a close interaction of incisive titles requires skill and time, which is not always language models and humans in this task has recently available. An LLM can drastically reduce the time needed been shown to be key [ 6 ]. to create appropriate titles, for example by suggesting to the author a series of reasoned choices or proposing modifications and corrections to an already written title, 2. Challenge Description always keeping in mind preset criteria such as length, tone, attractiveness, clarity, and the publication’s style. The task of headline generation has often been treated Furthermore, if trained on the corpus of a particular pub- as equal to an extreme summarization task [ 3, 7 ]. Howlication, an LLM can suggest titles consistent with its ever, simply synthesising the content of the article into tone of voice and editorial history. a brief description is not enough to provide a satisfying

Another important advantage that the use of LLMs title. Additional characteristics such as attractiveness, can ofer is the ability to personalise content for diferent creativeness, and many others also play a role. Writing platforms and audiences. In today’s newsrooms, journal- appropriate headlines is challenging, even for current ists no longer have to worry only about print media but state-of-the-art LLMs. must also consider the web, social media, newsletters, Evaluating LLMs on the task of headline generation and other digital distribution platforms. Each platform for Italian news articles thus serves multiple purposes. requires a diferent type of language, style, and length On one hand, it tests models’ capacity to properly underfor titles. For example, a title optimised for Twitter (or stand, that is, to reprocess large source texts in a way that X) must be short and incisive, while a title for a news is faithful to the content of the text. On the other hand, website can be more descriptive. An LLM is capable of it acts as a means to assess the performance of LLMs in generating variants of a title based on the medium of many complex dimensions, such as attractiveness, credissemination, allowing newsrooms to adapt their con- ativity, or adherence to tone. Finally, this benchmark tent precisely and in a targeted manner. Moreover, using could prove useful in practical applications. For instance, reader behavioural data, the LLM can generate more it may help guide decisions on whether, and to what exattractive titles for specific demographic groups, thus tent, a journal should integrate LLMs into its workflow. improving the engagement and communicative efective- It may also serve as an efective testbed for future reness of the news. search and development towards efective deployment

With this task, which is developed in the context of the in real-world scenarios - One such venue could be the CALAMITA Challenge [ 1 ] and which consists in asking use of prompting to achieve the desired style and tone in an LLM to generate a headline given the corresponding generated headlines. full article, we have a twofold aim. In our challenge, language models are tasked with gen

The first aim is to test and analyse the ability of existing erating Italian-language headlines based on articles from and future LLMs on the task of headline generation in the scientific news journals written in Italian. Our dataset context of Italian news articles. This would provide a sub- includes original articles from such journals, along with stantial step forward compared to past experiments on their human-authored titles. Models are provided the headline generation for Italian, which were run training complete source text in the prompt, as well as instrucmuch smaller sequence-to-sequence models from scratch tions to generate a title that is brief, coherent, and capti[ 2, 3 ]. We expect that some of the shortcomings of the vating. We guide the model towards the specific editorial style of the media outlet by including a small number of examples of headlines in our prompt. We employ automatic metrics that assess the model’s performance along three dimensions:

1. Coherency with the original article (HA classifier)

2. Alignment with the style of human written headlines (NS classifier) 3. Similarity between the generated and the goldstandard headline (ROUGE [ 8 ], SBERT [ 9 ])

However, considering the complexity of the task, we believe that manually reviewing a sample of the generated headlines can ofer additional perspectives on the behaviour of the model. 3. Data description

Our benchmark is based of two datasets consisting of science news articles from two diferent sources. In each dataset, we provide the full text of the article paired with the original, human-authored headline. Additionally, we include metadata such as link, date, author (if present) and subtitle.

3.1. Origin of data The data were obtained via web scraping with custom

Python scripts. Since links to articles more than a few weeks old are inaccessible on the Ansa website, we collected a large number by downloading the archived "Ansa Scienza" RSS feeds from The Wayback Machine and processing them to remove duplicates and extact links.

3.2. Data format

The data from web scraping were saved in "JSON Lines" (JSONL) format, with each line containing a JSON object with the following fields: • Title: the title of the article • Source: the name of the website • Date: the publishing date of the article • Author: the author of the article, if present • URL: the Internet address of the article • Text: the body of the article • ID: a unique identifier of the article

3.3. Detailed data statistics Our dataset consists of 30,461 articles gathered from two sources: When measured with “tiktoken o200k_base” tokenizer

model, we obtained a total of 21,365,897 tokens for the Galileo dataset (average: 906 tokens per article, maximum: 24,306) and a total of 3,762,539 tokens for the Galileo dataset (average: 546 tokens per article, maximum: 7,600). Figures 1 and 2 depict the distribution of articles by token count in the Galileo and Ansa datasets respectively.

3.4. Prompting Due to the length of each article, the use of task examples

in our prompt would be too computationally expensive. Therefore, we test the models in a zero-shot prompting setting. While we do not use any task examples in our prompt, we do provide seven examples of headlines. In this way, the model is given examples of the expected output (a title) rather than examples of the full task (article and title). Professional journalists made a list of 22 headlines that, in their opinion, were representative of a well-made writing process under the three aspects of being captivating, short and informative.

Each time the model is tested, 7 randomly chosen titles from the list are appended to the standard prompt. As a reference, the identifier of the example headlines is also saved along with the output of the model. See Box 1 for our input prompt.

Prompt for the LLM

Il tuo compito è generare un titolo accattivante

e informativo per l’articolo fornito.

Requisiti: - Titolo breve - Cattura l’essenza dell’articolo - Usa un linguaggio vivido e coinvolgente - Non generare alcun tipo di testo che non sia il titolo dell’articolo - Usa esclusivamente l’Italiano.

Presta particolare attenzione ai seguenti titoli di esempio e adotta lo stesso stile: Title 1 Title 2 ...

Title 7 Your task is to generate a catchy and informative title for the article provided.

Requirements: - Short title - Capture the essence of the article - Use vivid and engaging language - Do not generate any type of text other than the title of the article - Use Italian exclusively.

Pay particular attention to the following example titles and adopt the same style: Title 1 Title 2 ...

Title 7

Box 1: Zero-shot prompt and English translation. 4. Preliminary Evaluation

To get a first impression of LLM performance on our task, we conducted preliminary experiments by manually reviewing headlines generated by several models. Overall, the results were unsatisfactory - while the titles were generally coherent with the articles, they lacked captivation and originality. The majority of the generated headlines followed the format <Keywords: explanation>, leading to repetitive and poorly formulated headlines. Examples of our preliminary results can be found in Table 1 in Appendix A. This behaviour persisted even when the models were explicitly instructed to avoid using colons in the titles, or when examples of titles were given. Out of 3,006 headlines generated by Phi-3.5 Mini-Instruct, 2,940 headlines contained a colon. We obtained similar results using Mistral-7B-Instruct-v0.3, Qwen2-7B-Instruct, gemma-2-9b-it and Italia-9B-Instruct-v0.1. Manual experimentation with the commercial LLMs Claude 3.5 Sonnet1 and ChatGPT 4o2 yielded the same behaviour: • Titolo originale: Una rapina cosmica nell’ammasso di galassie dell’Idra • Claude: Rapina cosmica: il furto di gas nell’ammasso dell’Idra • ChatGPT: Rapina Cosmica: NGC 3312 Derubata di Gas nell’Ammasso di Galassie dell’Idra

Interestingly, when we asked Claude 3.5 Sonnet to

improve our prompt for generating headlines, it added the line <Struttura: [Frase d’impatto o dato interessante]: [Spiegazione o contesto]> to our example prompt, explicitly requesting the unwanted behaviour. It appears that LLMs consistently regard this particular structure as the ideal format for a headline.

Given the inherent dificulty of interpreting LLM behaviour, we cannot provide a single reason for their preference for this particular construction. Of course, there might be a large presence of such headlines in the training data, particularly from lower-quality journals. There may also be an influence of Search Engine Optimizations (SEO) on the behaviour of the model: Giving importance to keywords is a classic SEO technique.

Moreover, we generally noticed a preference toward sentences poor in determinative and indefinite articles when compared with human written headlines.

5. Metrics

Automatically evaluating the quality of generated headlines is a challenging matter because headline quality is inherently subjective, multi-faceted, and contextdependent. Thus, instead of providing a single numeric 1https://www.anthropic.com/news/claude-3-5-sonnet 2https://openai.com/index/hello-gpt-4o/ value as an overall quality score, headlines should be [ 14 ], we will evaluate our system outputs using ROUGEevaluated along multiple dimensions and subsequently L, which identifies the length of the longest common rated for their quality based on specific use cases. To give subsequence between system and reference. examples of what others have done - Cafagna et al. [ 2 ] evaluate generated headlines based on the criteria such 5.2. SBERT as grammatical correctness, topic relevance, attractiveness, and overall appropriateness. Cai et al. [ 10 ] assess Sentence-BERT, or SBERT [ 9 ], is a modification of the factors such as factual consistency, relevance, and surface BERT network that uses Siamese networks and that overlap between the generated headline and the article, can derive semantically meaningful, fixed-size vector as well as its alignment with user-specific preferences. embeddings from whole sentences. We use SBERT to

In the aforementioned papers, the headlines were compare our generated headlines to the gold-standard scored by human evaluators. This approach is resource ones by comparing their SBERT embeddings using cosine intensive - to account for diferences in individual pref- similarity, which we then use directly as the similarity erences, hiring multiple human evaluators from varying score. SBERT produces more meaningful sentence emdemographic backgrounds is preferred. This does not beddings compared to BERT, which is not designed for scale well to the evaluation of multiple models on large- sentence similarity tasks - therefore, cosine similarity scale benchmarks across multiple studies, making the with BERT embeddings could produce unwanted and ability to automatically evaluate the outputs of LLMs less interpretable results. essential.

Historically, n-gram overlap metrics like BLEU [ 11 ], 5.3. Custom metrics ROUGE [ 8 ], or METEOR [ 12 ] have been used to compare generated outputs with reference “gold standard” texts, Given the limitations of the current available metrics for but these metrics emphasise surface-level matching and the headlines generation task, we develop two custom are therefore not robust to paraphrasing or other vari- metrics employing classifiers based on Transformer [ 15 ] ations in acceptable outputs. Learned metrics such as models. We trained both classifiers on a subset of the COMET [ 13 ], a metric designed to mimic human quality “blogs” section of the “Testimole”3 dataset, which was judgement for machine translations, have been gaining obtained by web scraping various Italian media sources. in popularity. These are not easily transferable to other Our subset consists of only those parts of the dataset languages or tasks, and learnable metrics designed specif- scraped from professional media outlets. The criteria for ically for Italian headline generation are not available. the selection process, as well as the technical details for Additionally, such metrics typically produce a single nu- each classifier, are in Appendix B. merical score of ’quality’. To improve interpretability and ensure contextual flexibility, we would prefer to provide 5.3.1. HA Classifier individual scores for each dimension. We train two novel learned metrics for Italian headline generation, but leave others for future work.

We evaluate model performance on our benchmark using four metrics: ROUGE [ 8 ], SBERT [ 9 ], and two custom metrics - the Headline-Article and Natural-Synthetic classifiers. Within the context of the CALAMITA challenge, the model’s final score will be an aggregate in which four all metrics are weighted equally. Each metric is detailed in the following section.

Our first classifier is based on the Sentence Transformers [ 9 ] architecture, fine-tuned to discriminate between coherent and non-coherent pairs of headlines and articles. A generated headline can score between 0 and 1, representative of the degree of alignment between the headline and the content of the article. Following the work by De Mattei et al. [ 3 ], we call this classifier "HA", or Headline-Article.

To train the model, we used a non-finetuned Italian Sentence Bert model4 to compute an embedding for each article. We then find the headline of the article in the dataset with the highest cosine similarity, and create a new dataset where each row contains the article (anchor), the original title (positive), and the title of the most similar article (negative). Because the original dataset contained some duplicate items, we filtered all articles with "1" as the cosine similarity score. With this dataset, we were able to use Triplet Loss to train the classifier 5.1. ROUGE ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [ 8 ] is a popular metric used to evaluate automatically generated summarizations. It provides a measure of overlap between generated text and gold-standard references. ROUGE is easily interpretable and allows for easy comparison across many papers due to its widespread use. However, it is not robust to variations in input, making it less suitable for the assessment of tasks involving creativity, such as headline generation. Following others 3https://huggingface.co/datasets/mrinaldi/TestiMole 4https://huggingface.co/nickprock/ sentence-bert-base-italian-xxl-uncased to diferentiate between coherent and incoherent titles, allows us to build a positive feedback loop in which the starting from the assumption that the original title is headline generation system teaches itself to generate the one most coherent with the article’s content. We good headlines based on the classification of the discrimidecided to perform a cosine similarity search instead of nator. For instance, the model can be trained to ’fool’ the random shufling in order to increase the dificulty of the NS discriminator as often as possible while the NS disdiscriminator’s task. criminator uses the experience to improve at identifying

The drawback of this approach is the low context win- synthetic data, causing both models to improve simultadow of the model - all articles were truncated after the neously. This method, for instance, should quickly solve ifrst 512 tokens. While it is possible to develop a more the frequent use of the colon in automatically generated complex architecture to account for larger texts, we leave headlines outlined in Section 4. this for future work. 5.3.2. NS Classifier

7. Limitations 6. Future works

Our second classifier is called "NS", or Natural-Synthetic. Our benchmark is limited to articles and headlines from It is a binary regression classiefir based on an Italian only two journals, which restricts its representativeness BERT-base uncased model5, trained to discriminate be- across journalistic domains. As a result, it may not captween human-authored and machine-generated titles. ture the variability present in publications targeting difGiven a title as input, the classifier outputs a numerical ferent demographics, covering varied topics, or represcore indicating the likelihood of the title being close to senting a full spectrum of political perspectives. those written by journalists. We believe that similarity In training our classifiers, we took care to prevent to headlines written by journalists may be a useful indi- data contamination by ensuring non-overlapping splits cator of the quality and appropriateness of a generated between training and test sets. Nonetheless, given the headline. public availability of the articles online, there remains

Using the same subset of Testimole employed for the a possibility that some test data may indirectly overlap “HA” classifier, we generated over 90,000 synthetic head- with training data due to external access and prior expolines using LLMs of up to 9 billion parameters. To avoid sure. overfitting our classifier to the specific probability distribution of a single model, we generated synthetic head- 8. Ethical issues lines using diferent models; this process is detailed in Appendix C, along with details about the number of generated headlines per model. The result is a labelled dataset containing original as well as generated headlines.

The advantage of employing a “Natural-Synthetic” classifier is that the training objective is coarse, encouraging the classifier to consider a broad range of aspects that may account for the discrepancy of text generated by machines and humans.

This task is aimed at testing the factual knowledge which

LLMs acquire during their training process, whose objective is language modelling. This task should not suggest, or stimulate, that LLMs should commonly be used as knowledge bases or as reliable sources of factual information. The investigation underlying this challenge is research-oriented, aimed at a better understanding of LLMs’ abilities, and possibly suggest ways to discern when models might be providing more or less reliable knowledge and possibly making them more transparent in their generated output.

We see value in future research using classifiers and re

gressors to assess specific aspects of generated headlines. Such metrics have the potential to capture complex probability distributions over a multitude of dimensions of the data, including dimensions that are not directly interpretable to human observation. For instance, a learned metric that predicts the amount of attention a headline will generated would be highly useful.

Inspired by Generative Adversarial Networks (GANs), we find the employment of classification-based metrics promising for developing a model specialized in headline generation. A discriminator/generator training system

5https://huggingface.co/dbmdz/bert-base-italian-xxl-uncased 9. Data license and copyright issues Access to the data is granted for the evaluation but cannot be shared publicly at the moment, also for reasons related to data contamination. Acknowledgments The authors would like to thank ANSA Scienza and

Galileo, giornale di scienza - http:\www.galileonet.it for their interest in the GATTINA CALAMITA challenge and for the extremely valuable exchange of ideas that allowed us to shape a task of high potential impact in the ifeld of journalism. A. Examples of Good titles selected by professional journalists • Nella Via Lattea c’è un oggetto misterioso, è velocissimo • Nasce il gemello digitale del rischio ambientale in Italia • I cinque modi in cui il cervello invecchia • Covid-19, il mistero degli over 90 • A 44 e a 60 anni i due gradini chiave dell’invecchiamento • Palestra o snack? la scelta dipende da un messaggero chimico • Dagli stadi alle spiagge, sono i salti a sincronizzare il ballo • Dalle rose alle melanzane, ecco i geni delle spine • Così il Covid accelera l’invecchiamento • Uno zucchero naturale contro la calvizie, bene i test sui topi • Scoperto nel cervello il circuito dell’efetto placebo • Pronto il Google Earth del cuore umano • Una molecola può ringiovanire il sistema immunitario • Scoperto il dizionario dei sinonimi e contrari del cervello • Le farfalle nello stomaco non sono solo un modo di dire • Pronto il primo orologio nucleare, il più preciso del mondo • Gli uccelli in volo si comportano come gli atomi • L’Italia ritenta la sfida impossibile della geometria • Le auto nel trafico come i batteri in cerca di cibo • Robot come alleati, trovata la chiave per collaborare con gli umani • Dalle spugne di vetro grattacieli più sottili e resistenti • L’IA non è razionale, fa ragionamenti non logici

B. Examples of Model Outputs in Preliminary Evaluation

Original Title (Italian) No allo studente flessibile No to the flexible student Alzheimer, la proteina della speranza Alzheimer, the protein of hope Un laser estremo An extreme laser E’ italiano il primo modello genetico The first genetic model is Italian Carta d’identità per le balene Identity card for whales Scampati allo shock anafilattico Escaped from anaphylactic shock La casa in un clic The house in a click Deforestazione senza fine Endless deforestation "Più impegno contro il fumo" More commitment against smoking Una carovana contro l’Oleoducto A caravan against the pipeline Analisi dell’egemonia americana Analysis of American hegemony Censura sui dati scientifici Censorship on scientific data Siamo tutti africani: ricostruito il Dna degli antenati We are all Africans: reconstructed the DNA of ancestors Rubbia racconta Galileo divulgatore di scienza Rubbia tells Galileo as a science communicator Galileo Galilei nel pensiero di Thomas Khun Galileo Galilei in the thought of Thomas Khun Nella maglie della lingua In the mesh of the language Europa-America, la migrazione dei virus Europe-America, the migration of viruses Infiniti universi allacciati Infinite universes connected LLM Generated Title (Italian) Riforma Universitaria: Studenti in Rivolta contro la Selettività e la Flessibilità University reform: Students in revolt against selectivity and flexibility Nicastrina: nuova chiave per Alzheimer, scoperta in Calabria Nicastrin: new key for Alzheimer’s, discovered in Calabria Euv: Nuova Frontiera nella Manipolazione Molecolare con Luce Ultra-Sottile EUV: New frontier in molecular manipulation with ultra-thin light Creazione topi per studiare il ritardo mentale: scoperta genetica rivoluzionaria Creation of mice to study mental disability: revolutionary genetic discovery Genetica della balena: Identificazione di specie protette nel commercio illegale Whale genetics: Identification of protected species in illegal trade Scoperto il fattore eNOS nella reazione anafilattica: speranza per una cura eNOS factor discovered in anaphylactic reaction: hope for a cure Domotica: Software unificato per controllare tutti gli elettrodomestici, innovazione Cnr Home automation: Unified software to control all appliances, Cnr innovation Deforestazione in aumento in Amazzonia: il Brasile reagisce Deforestation increasing in the Amazon: Brazil reacts Bioetica del tabagismo: raccomandazioni per campagne di prevenzione e interventi regolativi Bioethics of smoking: recommendations for prevention campaigns and regulatory interventions Protesta contro l’Oleodotto Crudos Pasos: Ecuador, ambientalismo e scontro politico Protest against the Crudos Pasos pipeline: Ecuador, environmentalism and political clash Egemonia Americana: Stati Fuorilegge e Diritto Internazionale American hegemony: Rogue states and international law Scienza e Sicurezza: Accademia e Governo Discutono su Divulgazione nei Lavori Scientifici Science and security: Academia and government discuss disclosure in scientific works DNA dell’Africa: Prove definitive dell’origine umana DNA of Africa: Definitive proof of human origin Galileo Divulgatore: Scienza e Civiltà dal XVII al Mondo Moderno Galileo communicator: Science and civilization from the 17th century to the modern world Kuhn e Galileo: Rivoluzione scientifica attraverso la meccanica e l’esperimento Kuhn and Galileo: Scientific revolution through mechanics and experiment Realtà e Reti: L’Illusione di una Lingua Universale nella Scienza Cognitiva Reality and networks: The illusion of a universal language in cognitive science Scambi e mutua scoperta: Storia sanitaria dell’unificazione globale Exchanges and mutual discovery: Health history of global unification Inflazione cosmica: un universo di mondi nascosti Cosmic inflation: a universe of hidden worlds

C. Composition of the datasets

used to train the classifiers transformers library. We initialized the model using AutoModelForSequenceClassification and trained the model using a binary cross-entropy loss funcThe dataset we used as a source of material for both the tion (BCEWithLogitsLoss).

NS and HA classifiers is taken from "Testimole" [ 16 ], a Training was conducted with a batch size of 32, a learnmassive collection of Italian web scraping data that in- ing rate of 2 × 10ˆ-5}, and a warmup ratio of 0.1 to help cludes a "blogs" subset containing, as of November 2024, stabilize early training. A linear learning rate scheduler more than 2.8 million posts from various online blogs and the $AdamW$ optimizer with gradient clipping were and websites. From the original 2.8 million rows, we ob- employed to manage learning stability. We also impletained a much smaller dataset by filtering articles coming mented early stopping, monitoring the F1 score to save from sources that are, to our judgement, more similar the best model checkpoint and halt training if the model to professional media outlets. After this selection pro- failed to improve over multiple epochs. The resulting cess, which yielded a total of 715,335 articles, we filtered model obtained a 95% of accuracy on the test set. Acout articles written in languages diferent than Italian curacy is measured as the number of correctly guessed by using the "FastText Lang ID" field already present in labels divided for the total number of examples. The Testimole. After the foreign-languages pruning the count threshold to decide for a positive or negative label was of articles was 293,518 articles. Finally, we discarded all set at 0.5. Using a continuos score instead of the threshthe rows whose article was shorter than 350 characters old led to the same result, for this reason we decided to to arrive to a final dataset size of 264,455 articles. In kept only accuracy in this report. the following section, this dataset will be referred After having tested the model, we decided to further as "testimole-subset". In order to increase the diversity train it on the test set in order to have an improved model of data for the HA Classifier, we added to this dataset a to be used for the CALAMITA task. collection of 432.000 articles taken from the professional We then tested this further trained model on the Italian media outlet "Il Fatto Quotidiano": we had to add smaller "experimental-dataset" dataset containing 3007 this source manually because the articles were missing natural and 3007 synthetic headlines coming from the from the original Testimole dataset due to a scraping is- Galileo dataset. This evaluation obtained an accuracy of sue. In the section of HA Classifier, we will refer to this 87% additional subset as "testimole-subset-auxiliary". Finally, While initially we directly used PyTorch to train the we are going to refer to the small subset of Galileo used experimental versions of the model, we then decided in the testing process as "experimental-dataset". The ex- for simplicity to adopt the HuggingFace transformer liperimental dataset contains 3007 original headlines from brary to easily upload the model on the HuggingFace "Galileo" and 3007 headlines generated using Phi 3.5 Mini hub. The further trained version of model is available at Instruct from the same subset of Galileo’s articles. the address: https://huggingface.co/mrinaldi/flash-it-nsclassifier-fpt

D. NS Classifier

For the NS Classifier, we decided to split the testimolesubset dataset in two sets: 60% of the dataset was kept with the original headline ("natural") while in the remaining 40% the original headline was substituted with a generated one ("synthetic"). The original headline is kept as a reference as a separate column in the dataset. Specifically, we generated 93,921 headlines and kept 132,227 original headlines. There is no contamination between generated and original headlines: no synthetic headlines were generated for headlines that are present in the dataset with the "natural" label. The dataset was then divided in "test" (45230 entries, x natural, x syntethic) and "train" (180918 entries, 105885 natural, 75033 synthetic) split for training. For the generation, we ran Ollama on diferent models using the same prompt adopted for the evaluation. In Table 2 you can see the amount of generated headlines for each model used.

The classifier was created using Hugging Face’s

E. HA Classifier

In order to build the HA Classifier we first computed, for each article contained in the "testimole-subset" dataset, the embedding of the article’s text using SentenceBert with an Italian model 6 and added the embedding to a new column in the dataset. Then, we paired each article (source) of the dataset with the article (target) having the highest cosine similarity between the embeddings. After the pairing, both source and target were marked as "used" so that each article can appear no more than one time in the resulting dataset, either as a source or as a target. The resulting dataset 7 has 6 columns: • Anchor: the body of the "source" article • Positive: the original title of the "source" article

6https://huggingface.co/nickprock/

sentence-bert-base-italian-xxl-uncased 7https://huggingface.co/datasets/mrinaldi/ lfash-it-ha-dataset-cossim Model lama3.2:3b-instruct-fp16 qwen2.5:7b-instruct-q8_0 aya:8b-23-q8_0 mistral:7b-instruct-v0.3-q6_K phi3.5:3.8b-mini-instruct-fp16 • Negative: the original title of the "target" article performed every 1,000 steps to monitor model perfor• Cosine similarity: the Cosine Similarity be- mance, with checkpoints saved periodically to retain the tween the source’s and target’s embeddings com- best-performing model. We kept the "margin" value at puted on their texts "5" following the documentation of SentenceBert. 9 • Url positive: the URL of the source article, it can The resulting classifier outputs a score representing be used as a key to find the original article in the the alignment between the article and its headline.

Testimole dataset After having trained the HA Classifier on the • Url negative: the URL of the target article "testimole-subset" dataset, we decided to use an additional dataset (testimole-auxilliary) to further improve Given the procedure employed for generating this dataset, the classifier. Testimole-Auxiliary, halved due to matchthe resulting number of row is halved so that, starting ing, has 216562 articles of which 108281 were used as from the original 256530 entries in the "testimole-subset" train and 108281 as test. The same procedure used for dataset we obtained 128265 entries, divided into 102600 testimole-subset was applied to testimole-auxilliary. In train entries and 25665 test entries. We believe that using the following page we present a table summing up the the cosine similarity instead of randomly shufling the results of the various models on the test datasets. articles can improve the performance of the classifier by increasing the dificulty of the task. Results with a classifier trained on randomly paired articles is present in the table below.

The classifier was created using SentenceBERT, specifically by initializing the model with the SentenceTransformer class from the sentence_transformers library, using a pre-trained Italian model8. To fine-tune this model, we employed a TripletLoss function to enhance similarity-based ranking in embedding space. The triplet loss was the optimal choice given our dataset because it requires an anchor, a positive and a negative example. The goal of the triplet loss is to maximize the distance between the anchor and the negative example while at the same time minimize the distance between the anchor and the positive example. In this way, we encouraged the formation of meaningful embeddings tailored to minimize the distance between an article and a title coherent with its content, notwithstanding the 512 token length limitation.

Training was conducted over three epochs with a batch size of 64 for training and 16 for evaluation, using a learning rate of 2 × 10ˆ-5} and a warmup ratio of 0.1 to stabilize initial training steps. We used the $SentenceTransformerTrainingArguments$ to configure training, applying half-precision floatingpoint (fp16) to speed up processing. An evaluation was

8https://huggingface.co/nickprock/

sentence-bert-base-italian-xxl-uncased 21949 98913 106662

Accuracy 0.8552 0.9135 0.9850 ROC AUC Avg pos. dist. 0.73 0.72

[1]

Attanasio ,

Basile ,

Borazio ,

Croce ,

Francis ,

Gili , E. Musacchio,

Nissim ,

Patti ,

Rinaldi ,

Scalena , CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian , in: Proceedings of the 10th Italian Conference on Computational Linguistics (CLiC-it 2024 ), Pisa, Italy, December 4 - December 6, 2024 , CEUR Workshop Proceedings, CEUR-WS.org, 2024 .

[2]

Cafagna ,

L. D.

Mattei ,

Bacciu ,

Nissim , Suitable doesn't mean attractive. human-based evaluation of automatically generated headlines , in: R. Bernardi,

Navigli , G. Semeraro (Eds.), Proceedings of the Sixth Italian Conference on Computational Linguistics , Bari, Italy, November 13-15 , 2019 , volume 2481 of CEUR Workshop Proceedings , CEURWS.org, 2019 . URL: https://ceur-ws. org/ Vol- 2481 / paper13.pdf .

[3]

De Mattei ,

Cafagna ,

Dell'Orletta ,

Nissim , Invisible to people but not to machines: Evaluation of style-aware headlinegeneration in absence of reliable human judgment , in: Proceedings of the Twelfth Language Resources and Evaluation Conference , 2020 , pp. 6709 - 6717 .

[4]

Ao ,

Wang ,

Luo ,

Qiao ,

He ,

Xie , Pens: A dataset and generic framework for personalized news headline generation , in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1 : Long

Papers)

, 2021 , pp. 82 - 92 .

[5]

Liang ,

Duan ,

Gong ,

Wu ,

Guo ,

Qi ,

Gong ,

Shou ,

Jiang ,

Cao , et al., Xglue: A new benchmark dataset for cross-lingual pretraining, understanding and generation , arXiv preprint arXiv: 2004 . 01401 ( 2020 ).

[6]

Ding ,

Smith-Renner ,

Zhang ,

Tetreault ,

Jaimes , Harnessing the power of LLMs: Evaluating human-AI text co-creation through the lens of news headline generation , in: H. Bouamor , J. Pino , K. Bali (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2023 , Association for Computational Linguistics , Singapore, 2023 , pp. 3321 - 3339 . URL: https://aclanthology. org/ 2023 .findings-emnlp. 217 . doi: 10 .18653/v1/ 2023 .findings-emnlp. 217 .

[7]

Rush , A neural attention model for abstractive sentence summarization , arXiv Preprint, CoRR, abs/1509.00685 ( 2015 ).

[8]

C.-Y.

Lin , ROUGE: A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics , Barcelona, Spain, 2004 , pp. 74 - 81 . URL: https://aclanthology.org/W04-1013.

[9]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics , 2019 . URL: https: //arxiv.org/abs/ 1908 .10084.

[10]

Cai ,

Song ,

Cho ,

Wang ,

Yu ,

Liu ,

Yu , Generating user-engaging news headlines , in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, 2023 , pp. 3265 - 3280 .

[11]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation , in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics , 2002 , pp. 311 - 318 .

[12]

Lavie ,

M. J.

Denkowski , The meteor metric for automatic evaluation of machine translation , Machine translation 23 ( 2009 ) 105 - 115 .

[13]

Rei ,

Stewart ,

A. C.

Farinha ,

Lavie , Comet: A neural framework for mt evaluation , arXiv preprint arXiv: 2009 . 09025 ( 2020 ).

[14]

Krubiński ,

Pecina , Towards unified uni-and multi-modal news headline generation , in: Findings of the Association for Computational Linguistics: EACL 2024 , 2024 , pp. 437 - 450 .

[15]

Vaswani , Attention is all you need , Advances in Neural Information Processing Systems ( 2017 ).

[16]

Rinaldi , Testimole, 2024 . URL: https: //huggingface.co/datasets/mrinaldi/TestiMole.