=Paper= {{Paper |id=Vol-3878/90_main_long |storemode=property |title=To Click It or Not to Click It: An Italian Dataset for Neutralising Clickbait Headlines |pdfUrl=https://ceur-ws.org/Vol-3878/90_main_long.pdf |volume=Vol-3878 |authors=Daniel Russo,Oscar Araque,Marco Guerini |dblpUrl=https://dblp.org/rec/conf/clic-it/0004AG24 }} ==To Click It or Not to Click It: An Italian Dataset for Neutralising Clickbait Headlines== https://ceur-ws.org/Vol-3878/90_main_long.pdf
                                To Click it or not to Click it: An Italian Dataset for
                                Neutralising Clickbait Headlines
                                Daniel Russo1,2,∗ , Oscar Araque3 and Marco Guerini2
                                1
                                  University of Trento, Trento, Italy
                                2
                                  Fondazione Bruno Kessler, Trento, Italy
                                3
                                  Universidad Politécnica de Madrid, Madrid, Spain


                                               Abstract
                                               Clickbait is a common technique aimed at attracting a reader’s attention, although it can result in inaccuracies and lead to
                                               misinformation. This work explores the role of current Natural Language Processing methods to reduce its negative impact.
                                               To do so, a novel Italian dataset is generated, containing manual annotations for classification, spoiling, and neutralisation of
                                               clickbait. Besides, several experimental evaluations are performed, assessing the performance of current language models.
                                               On the one hand, we evaluate the performance in the task of clickbait detection in a multilingual setting, showing that
                                               augmenting the data with English instances largely improves overall performance. On the other hand, the generation tasks of
                                               clickbait spoiling and neutralisation are explored. The latter is a novel task, designed to increase the informativeness of a
                                               headline, thus removing the information gap. This work opens a new research avenue that has been largely uncharted in the
                                               Italian language.

                                               Keywords
                                               clickbait, natural language processing, natural language generation, large language model, language resource



                                1. Introduction                                                                                            Although clickbait headlines are considered one of
                                                                                                                                        the less harmful forms of fake news, as their main goal
                                Accuracy and truthfulness are essential characteristics of                                              is to increase profit by driving traffic to their website
                                journalism. Nevertheless, in an effort to improve revenue,                                              [6, 7], they can sometimes pose a danger, especially when
                                a large number of newspapers and magazines publish                                                      they deal with potentially harmful topics such as health
                                clickbait articles, a viral journalism strategy that seeks to                                           and science. To address this problem, Natural Language
                                attract users to click on a link to a page through tactics                                              Processing techniques have been widely employed to
                                such as sensationalist stories and catchy headlines that                                                detect clickbait headlines, with a particular focus on the
                                act as bait. The use of these tactics harms the quality of                                              English language [8, 9]. Hagen et al. [10] proposed the
                                news pieces and thus hinders the ability of citizens to                                                 clickbait spoiling task, i.e., the generation of a short text
                                obtain reliable and objective information. The literature                                               that satisfies the curiosity induced by a clickbait post.
                                distinguishes between two main types of clickbait. (i)                                                     In light of this, this work addresses the issue of click-
                                Classical clickbait [1] embeds within the headlines infor-                                              bait in the Italian language, studying its characteristics
                                mation gaps, also known as curiosity gaps [2, 3], in order                                              and the possibilities of current technology to reduce its
                                to arouse curiosity in the reader that is forced to access                                              negative impact. In doing so, we have generated a novel
                                the article’s content which is ultimately disappointing.                                                Italian dataset that gathers a large collection of clickbait
                                Classical clickbait usually makes use of hyperbolic lan-                                                articles, which is made public for the community to use 1 .
                                guage, caps lock, demonstrative pronouns and superla-                                                   We named the dataset ClickBaIT. This dataset contains
                                tive to grasp the user’s attention [1, 4, 5]. (ii) Deceptive                                            manually annotated instances as clickbait/non-clickbait,
                                clickbait [5] refers to headlines that resemble traditional                                             as well as manually generated spoilers and neutralised
                                media headlines by offering a summary of the article, still                                             headlines. We have also performed a thorough multi-
                                leading to content that differs from the reader’s expec-                                                lingual evaluation, exploiting the availability of English
                                tations. These headlines promise high news value but                                                    data to complement our dataset in the task of clickbait
                                deliver content with low news value, resulting in reader                                                detection. Finally, this work also explores the use of our
                                disappointment.                                                                                         annotated dataset and large language models to auto-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                                                                                                                        matically generate both spoilers and, as a novel task, a
                                Dec 04 — 06, 2024, Pisa, Italy                                                                          neutralised version of clickbait headlines. A graphical
                                ∗
                                     Corresponding author.                                                                              illustration of the experimental design is presented in
                                Envelope-Open drusso@fbk.eu (D. Russo); o.araque@upm.es (O. Araque);                                    Figure 1.
                                guerini@fbk.eu (M. Guerini)
                                Orcid 0009-0006-9123-5316 (D. Russo); 0000-0003-3224-0001
                                (O. Araque); 0000-0003-1582-6617 (M. Guerini)
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   1
                                         Attribution 4.0 International (CC BY 4.0).                                                         The dataset is available in https://github.com/oaraque/ClickBaIT




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
           1. CLICKBAIT DETECTION                         Quale malattia colpisce                    3. CLICKBAIT NEUTRALISATION
                                                          500mila persone?
                                                          Question
                                                          Rewriting                                                      La psoriasi: una malattia
                                                                                                                         che colpisce circa
                                                                                       La psoriasi                       500mila persone in Italia
 Una malattia che colpisce
 500mila persone               HEADLINE                                             SPOILER                           NEUTRALISED
                                                                                                                      HEADLINE
 HEADLINE

                               HEADLINE                                                 HEADLINE



                                                                    ARTICLE




                                            2. SPOILER GENERATION

Figure 1: The experimental design is depicted, encompassing three tasks: clickbait detection, spoiler generation, and clickbait
neutralisation. The robot icon represents the language model used for either classification or generation. We utilized DistilBERT
and Llama3-8B for task 1, and LLaMAntino-3-8B for tasks 2 and 3. The models were tested for generative tasks using zero-shot,
few-shot, and fine-tuning configurations, except for question rewriting, for which we employed a few-shot approach.



2. Related Work                                                       3. Dataset
The use of clickbait is common in many news outlets,                  3.1. Dataset Creation
and thus it has been extensively studied.
   There are several works that address clickbait detec-              Data were collected from fourteen news websites2 , noto-
tion: Potthast et al. [8] collected a corpus of clickbait             rious for acting as news aggregators, engaging in plagia-
articles, posted by well-known English-speaking newspa-               rism, lacking fact-checking, and using sensational head-
pers on Twitter, and proposed a set of lexical and semantic           lines to draw in readers. In all the websites, articles are
features to be used with a Random Forest classifier. Fol-             labelled according to specific categories; we decided to
lowing the general trend in Natural Language Processing               focus on four macro-categories: health, science, economy,
(NLP) field, clickbait detection has also been explored               and environment. These categories have been selected
using deep learning methods, such as convolutional [11]               to cover some of the most frequent - and potentially
and recurrent [12] neural networks, as well as more re-               hazardous - domains where clickbait is usually found.
cent Transformer-based approaches [9].                                Since the categories varied a lot from website to web-
   Other works leveraged Natural Language Generations                 site, we manually mapped each category into one of the
(NLG) strategies to create a piece of text, the spoiler, com-         four macro categories under analysis. Two annotators,
prising the information needed to fulfil the curiosity gap            knowledgeable in the area, were then provided with the
present in clickbait headlines. This task was proposed by             headlines and the related articles and were asked to la-
Fröbe et al. [13] with the name of spoiling generation. The           bel whether a headline was clickbait. For aiding in this
authors created the Webis Clickbait Spoiling Corpus 2022,             task, we have used as reference the clickbait measure as
and cast spoiler generation as a Question Answering task.             computed by Arthur et al. [21]. Eventually, given the
Eventually, they open the challenge to the community                  clickbait dataset, the two annotators were required to
through a SemEval-2023 shared task [13, 14]. The op-                  extract the gold spoilers from the article’s text and to pro-
timal spoiler generator operates with five independent                duce the neutralised forms for each headline. To this end,
sequence-to-sequence generative models. It selects the                we employed an author reviewer strategy [22]: an LLM
best spoiler through a majority vote, determined by com-              (ChatGPT gpt-3.5-turbo-0125 3 ) was used to generate
paring edit distances among the outputs [15].                         both the spoilers and the neutralised forms (author com-
   Regarding the languages studied, the majority of works             ponent)4 , and the native Italian speaking annotators were
are based on English. Other works were performed in                   asked to manually post-edited the generations (reviewer
Chinese [16], Turkish [17, 18] and Spanish [19, 20]. To               component).5 This procedure was proven to be more
the best of our knowledge, this is the first work that fully          effective and less time-consuming than writing the data
addresses the study of clickbait detection and spoiling               2
                                                                        Essere Informati, TGNewsItalia, Voxnews, DirettaNews, Informati,
in the Italian language. Moreover, we propose a novel
                                                                        Italia, Jeda News, News Cronaca, TG5Stelle, TG24-ore, ByoBlu,
task, i.e., clickbait neutralisation, which aims at filling             Mag24, WorldNotix, lo sapevi che, Fortementein
the curiosity gap by rewriting the headline levering the              3
                                                                        https://chat.openai.com
                                                                      4
information of the spoiler.                                             In Appendix A.3 we provide the prompt employed
                                                                      5
                                                                        Details in Appendix A.2
   Category            Headline                        Article                Clickbait            Spoiler               Neutralised title
Health          Frutto o fiore? gusto-        Tutti la conosciamo, im-          True      La fragola                   Fragola: gustosissima e
                sissima e attraente, una      mancabile sulle nostre                                                   attraente, una celebrità
                celebrità sulle nostre tav-   tavole, celebre in tutto                                                 sulle nostre tavole
                ole, sveliamo chi è           il mondo ma misteriosa
                                              la sua natura, frutto da
                                              gustare o fiore...
Science         Scoperto un metallo che       Il recente esperimento            True      Il platino                   Il metallo che si auto-
                si auto-ripara. Scienziati    ha rivelato un fenomeno                                                  ripara: il platino
                sbalorditi                    straordinario...
Health          Una malattia che colpisce     Parliamo di una malat-            True      La psoriasi colpisce circa   La psoriasi: una malattia
                500mila persone               tia sistemica cronica me-                   500 mila persone             che colpisce circa 500mila
                                              diata dal sistema immu-                                                  persone in Italia
                                              nitario che interessa...
Environment     Zanzare, ecco come elim-      Con l’arrivo del caldo, an-       True      Per eliminare una volta      Zanzare, ecco come elim-
                inarle senza insetticidi      che le zanzare si fanno                     per tutte le zanzare dalla   inarle senza insetticidi:
                                              largo nelle nostre case o                   vostra casa, dovreste ac-    basta acquistare un pip-
                                              nei nostri giardini...                      quistare un pipistrello      istrello

Table 1
An excerpt of the presented dataset showing the most relevant fields. Article bodies are shortened for space reasons. Translated
text can be found in Table 9 (Appendix B).



from scratch [23]. To assess the amount of post-editing                     3.2. Dataset Analysis
required, we employed Human-targeted Translation Edit
                                                                            The complete ClickBaIT dataset consists of 4,144 entries.
Rate [HTER; 24]. HTER quantifies the minimum edit
                                                                            Each entry includes the following fields: (i) source web-
distance, which is the least number of editing operations
                                                                            site, that specifies the source of the article; (ii) publica-
needed, between a machine-generated text and its post-
                                                                            tion date, which is captured from the original source;
edited counterpart. HTER values exceeding 0.4 indicate
                                                                            (iii) headline text; (iv) article text; (v) original URL;
low-quality outputs; under such circumstances, rewrit-
                                                                            (vi) macro category inferred from the original category
ing the text from scratch or extensive post-editing would
                                                                            extracted from the source; (vii) image URL associated
necessitate comparable effort [25].
                                                                            with the article as specified in the source; (viii) clickbait
   The obtained HTER results for the spoiler generation
                                                                            annotation; (ix) the associated spoiler; and (x) the
(0.4) are higher than those computed upon the neutrali-
                                                                            neutralised version of the title.
sation (0.3), in par or slightly lower than the 0.4 thresh-
                                                                                Table 2 shows the main statistics of the final version
old. The high HTER values, especially for the spoiler
                                                                            of the dataset. The golden set is manually annotated and
annotation, can be attributed to the model’s tendency
                                                                            thus contains high-quality information. Additionally, the
to generate spoilers comprising more details than those
                                                                            silver set has been annotated automatically as described
necessary to fill the curiosity gap. While in some cases
                                                                            and therefore contains a larger number of instances.
a simple deletion was sufficient, in others the annotator
                                                                                To gain a deeper understanding of the content of the
had to rewrite the spoiler almost completely. Regarding
                                                                            dataset we have used Variationist [26], a tool that allows
the annotation of the neutralisation texts, the higher re-
                                                                            to inspect useful statistics and patterns in textual data.
sults are a consequence of the spoiler generation, as the
                                                                            Upon inspection of the data, we have detected several
model was required to generate them simultaneously.
                                                                            patterns frequently used for generating the curiosity gap.
   With this, we have generated the golden set of the
                                                                                Of course, one of the most common strategies used in
dataset, in which all the instances were manually anno-
tated. Further details regarding the dataset creation can
be found in Appendix A. To expand this set, we have used                      Set         Clickbait (%)      Non-clickbait (%)         Total
a clickbait classifier (see Sect. 4.1) to automatically detect
clickbait headlines. This new set of data, automatically                      Golden          698 (53%)                  629 (47%)      1,327
                                                                              Silver        1,563 (56%)                1,224 (44%)      2,787
annotated, constitutes the silver set of our dataset. Sev-
eral examples of dataset entries are provided in Table 1.                     Total                2,261                     1,853      4,114

                                                                            Table 2
                                                                            Size of the presented dataset, considering both golden and
                                                                            silver sets.
clickbait headlines is the formulation of a question that       multilingual-cased 7 ) model trained in a multilingual
is later answered in the article, even though sometimes         setting, and (ii) the Llama3-8B language model (meta-
it is not. In the instance “Quanto è green il gas? ” (How       llama/Meta-Llama-3-8B 8 ). The composed dataset has
green is gas? ) the article explains that gas is not consid-    been split into train and test splits, which have been used
ered green. Another frequent strategy we have detected          to fine-tune and evaluate these models, respectively.
is the introduction to the content of the article, which           To assess the effect of using a mixture of both En-
invites the reader to click it: Beve un cucchiaio di aceto di   glish and Italian instances in the dataset, we evaluate
mele nell’acqua tutti i giorni, ecco cosa succede (Drinks a     the performance of the two models in a monolingual
tablespoon of apple cider vinegar in water every day, this      setting (e.g., fine-tuning in Italian and predicting in the
is what happens).                                               same language) as well as the multilingual variant (e.g.,
   Another usual pattern is the reference to enumerations,      fine-tuning in English and Italian text, and predicting on
frequently using round and manageable numbers such as           Italian instances).
10, 8, and 5. This can be done for introducing numbered
content, as in “Le 10 fantasie femminili più segrete” (The      4.2. Spoiler Generation
10 most secret female fantasies), or even to generate a re-
action in the reader: “Hai solo 10 secondi per salvarti. Ecco   The spoiler generation task consists in generating a
cosa devi fare:” (You only have 10 seconds to save yourself.    short message that fulfils the curiosity gap present in
Here’s what you have to do:). Other means can be used           a given clickbait title, by extracting the information from
to make headlines noticeable, such as introducing text          the linked article. To this end, we tested LLaMAntino-
in all caps, using striking vocabulary or even punctua-         3-ANITA-8B-Inst-DPO-ITA (LLaMAntino-3-8B here-
tion marks, as in “[ALLARME] Truffa AUTO USATE, fate            after) [30] on our clickbait dataset. The model was tested
attenzione!” ([ALERT] USED CAR scam, beware!).                  both in in-context learning (zero- and few-shot) and fine-
   See Table 8 (Appendix A.2) for a collection of patterns      tuning settings.
that have been considered during the manual annotation             Building on prior research that frames spoiler genera-
of the dataset. Besides, Appendix B includes a graphical        tion as a Question Answering task [31], we prompt the
summary of the dataset, while its interactive version can       model to rewrite clickbait headlines as questions and ex-
be accessed online.6 Details are provided in Appendix C.        tract the corresponding answers, i.e., the spoilers, from
                                                                the linked articles.

4. Experimental Design                                          4.3. Clickbait Neutralisation
The experimental design comprises three steps: clickbait The best-performing configuration was employed for the
detection, spoiler generation and clickbait neutralisation. neutralisation of the clickbait headlines. To this end,
                                                            we instructed the LLM to perform a style transfer task,
4.1. Clickbait Detection                                    from a clickbait headline style to a more journalistic one,
                                                            while integrating the spoiler information into the original
This is the first and most basic task aimed at addressing headline.
the clickbait phenomenon. To explore the effect of using
additional data in the training process, we use the Webis-
Clickbait-17 [27], an English dataset containing clickbait 5. Results and Discussion
that is also annotated in a binary fashion.
   Following the insights by Araque et al. [28], we use the 5.1. Evaluation Metrics
training on English data to improve the classification of
Italian data. The main idea is to harness the availability Firstly, for the evaluation of the clickbait detection
of large amounts of English data, generating a compound task we use the macro-averaged precision, recall and
dataset with a lower amount of Italian instances. To do f-score. This allows us to assess the performance even
so, a multilingual mixture dataset is created so that 35% in an unbalanced scenario. For the generation tasks, we
of the final dataset comprises Italian instances, while the assessed lexical similarity through ROUGE score [32]
rest are in English.                                        and semantic similarity. For the latter, text embed-
   We model the detection challenge as a binary clas- dings, computed     9
                                                                              using sentence-bert-base-italian-
sification task: clickbait/non-clickbait. To study the xxl-uncased , were compared using cosine similarity.
complexity of the task, we explore two different models
for classification: (i) a DistilBERT [29] (distil-base- 7 https://huggingface.co/distilbert-base-multilingual-cased
                                                                8
                                                                    https://huggingface.co/meta-llama/Meta-Llama-3-8B
                                                                9
                                                                    https://huggingface.co/nickprock/sentence-bert-base-italian-xxl-
6
    https://oaraque.github.io/ClickBaIT/clickbait.html              uncased
                                   zero-shot                     few-shot                        fine-tuning
                            R1      RL     SemSim         R1       RL      SemSim         R1         RL          SemSim
               headlines   0.189   0.157       0.567     0.250    0.221      0.667       0.260      0.234         0.659
               questions   0.271   0.249       0.645     0.286    0.258      0.630       0.250      0.224         0.646

Table 3
LLaMAntino-3-8B results for the spoiler generation task. We report ROUGE 1 and L (R1, RL) and semantic similarity (SemSim).



5.2. Clickbait detection                                      few examples provided in the few-shot approach, which
                                                              make the model aware of the task while allowing more
Table 4 shows the results of the evaluation in the task
                                                              creative outputs (resulting in lower ROUGE scores). Con-
of clickbait classification. As expected, introducing data
                                                              versely, the fine-tuned model learned from the training
instances in English improves the performance in Italian.
                                                              data to adhere more closely to the source article, which
In the case of classification in Italian, we see a staggering
                                                              comes at the expense of producing semantically richer
improvement for the Llama3 model of 8.43 points. This
                                                              responses (evidenced by lower SemSim scores).
further supports previous results [28]. We argue that
                                                                 Interestingly, casting spoiler generation as a question-
augmenting the training set with instances in a diverse
                                                              answering task yields higher results in the zero-shot set-
language is an effective strategy that can be generalised
                                                              ting compared to using headlines as input. However, the
to other tasks.
                                                              results for few-shot and fine-tuning scenarios tend to be
   We also see that the best model for the classification
                                                              on par. This can be explained by the fact that headlines
of clickbait is the one obtained with Llama3, trained with
                                                              may contain multiple gaps that the human-annotated
both English and Italian data. Hence, we use this model
                                                              dataset accounted for, but the non-supervised “question
to predict on the silver set of our dataset.
                                                              generation” module could not fully capture. Generally,
                                                              this approach leads to sufficiently good results; however,
  Test Train         Model          Prec.     Rec. M-F1       we believe that more attention should be given to the
                     DistilBERT      67.15    70.34    66.94  quality of the questions, either through more efficient
          EN
                     Llama3          68.42    66.46    67.18  prompts or with human-generated/curated data.
  EN
                    DistilBERT     70.28   70.14       70.12
          EN+IT
                    Llama3         71.20   71.15       71.15     5.4. Clickbait Neutralisation Results
                    DistilBERT     68,85    70.47      68.65
          IT                                                     In Table 5, we report the results for clickbait neutralisa-
                    Llama3         66.96    67.19      67.07
 IT                                                              tion. For this task, we prompted LLaMAntino-3-8B with
                     DistilBERT      72.87     74.85     71.77   a few-shot approach, employing the spoilers generated
          EN+IT
                     Llama3         76.32 75.51 75.50            with the three configurations of the previous experiments
Table 4                                                          (headlines as input). Using spoilers generated with the
Results for Clickbait detection. The ‘Test’ and ‘Train’ columns fine-tuned models leads to higher results both for lexi-
indicate the languages of the test and train sets, respectively. cal and semantic metrics. Interestingly, scores tend to
                                                                 increase when the training complexity of the input data
                                                                 increases. In Table 6 we report examples of headlines
                                                                 along with their generated spoilers (through the fine-
5.3. Spoiler Generation Results                                  tuned model) and their neutralisation.
Results for the spoiler generation task are reported in Ta-
ble 3. We evaluated the capabilities of LLaMAntino-3-8B                     input data         R1           RL       SemSim
in both in-context learning scenarios (zero- and few-shot)                zero-shot         0.250         0.212       0.675
and through fine-tuning. As inputs, we used clickbait                     few-shot          0.265         0.223       0.706
headlines and questions generated by ChatGPT, instruct-                   fine-tuning       0.286         0.247       0.715
ing the model to execute a Question Answering task for
                                                                 Table 5
the latter. When using headlines as input, few-shot and          Neutralisation generation results. Automatically generated
fine-tuning approaches outperform zero-shot methods.             spoilers from the previous experiments were used as input for
Few-shot approaches demonstrate higher performance               the few-shot generation of the data. We report ROUGE 1 and
in terms of semantic similarity, while fine-tuning exhibits      L (R1 and RL) and the semantic similarity scores.
stronger lexical adherence to the source document, as
reflected in ROUGE scores. This can be attributed to the
            Headline                                         Spoiler                                       Neutralisation
“Juventus in Serie B”: perché c’è    15 punti di penalizzazione                                     Juventus in grave difficoltà: 15
panico tra i tifosi, la scoperta                                                                    punti di penalizzazione e il ris-
delle ultime ore                                                                                    chio di cadere in Serie B
Lutto tremendo nello sport ital-     “Samuel Dilas era un giocatore di pallacanestro che mili-      Tragico decesso del palla-
iano, morto giovanissimo dopo        tava nel Virtus Lumezzane a Brescia, in Serie B” e “aveva      canestrista Samuel Dilas, 24
un malore                            24 anni” e “era alto 206 centimetri” e “nato a Novellara       anni, ex convalescente da
                                     (Reggio Emilia)” e “aveva un padre di nome Torsen, una         polmonite e giocatore della
                                     madre di nome Chiara e una sorella minore di nome Maia”        Virtus Lumezzane
                                     e “era in convalescenza dopo una polmonite” e “era ar-
                                     rivato alla Virtus Lumezzane nella scorsa stagione”.
Un papà si rifiuta di mangiare       Il cameriere ha sbattuto fuori il maleducato padre che         Un cameriere espelle un cliente
accanto a un bambino Down di         voleva essere spostato a causa della presenza di un bam-       maleducato che chiede di es-
5 anni, il cameriere decide di       bino con sindrome di Down.                                     sere spostato per non sedersi
fare questo                                                                                         accanto a un bambino con sin-
                                                                                                    drome di Down.
E’ doloroso e si forma tra le dita   “L’occhio di pernice è causato principalmente dalla pres-      Come rimuovere l’occhio di per-
dei piedi, ecco come rimuoverlo      sione della scarpa che favorisce la formazione di un’ispes-    nice, un problema di pressione
                                     simento di pelle che provoca dolore, in quanto è soggetto      e attrito causato dalle scarpe
                                     all’attrito tra le dita. Per rimuovere l’occhio di pernice
                                     è fondamentale ammorbidire prima la zona interessata
                                     per poi provare a rimuovere l’ispessimento utilizzando
                                     rimedi naturali senza dolore e in modo semplice.”
La chiamano “LA BOMBA” la            Lo zenzero è un rimedio naturale per il trattamento di         Miscela naturale di limone, zen-
miscela che in sole 24-48 ore        tosse, raffreddore e influenza. La miscela limone, zenzero     zero e miele allevia i sintomi di
elimina influenza, raffreddore e     e miele è ideale per alleviare i sintomi delle comuni malat-   tosse, raffreddore e influenza in
tosse                                tie. Basta prendere 2 o 3 cucchiai della miscela naturale,     pochi giorni.
                                     riempire una tazza con acqua calda e lasciare in infusione
                                     per 3 o 4 minuti.
Table 6
Examples of clickbait headlines, along with the automatically generated spoiler and neutralised version.



6. Conclusion                                               considering certain sensitive domains such as health.
                                                            Thus, we hope that this work facilitates future research
This work presents ClickBaIT, a novel Italian dataset on the topic for example, by addressing the link between
for clickbait modelling, as well as a diverse set of ex- clickbait and misinformation, considering both in a uni-
periments to assess the effectiveness of current models fied framework.
for clickbait detection, spoiling and neutralisation. The
dataset includes news articles that have been manually
annotated to indicate the presence of clickbait, spoilers Acknowledgments
associated with clickbait headlines, and their respective
neutral headlines.                                          This work was partly supported by: the AI4TRUST
   The experiments explore the effectiveness of current project - AI-based-technologies for trustworthy solu-
NLP methods for the modelling of clickbait headlines in tions against disinformation (ID: 101070190), the Euro-
Italian through ClickBaIT. The evaluation for clickbait pean Union’s CERV fund under grant agreement No.
detection shows how training data can be augmented in 101143249 (HATEDEMICS), the European Union’s Hori-
a multilingual setting, which leads to classification im- zon Europe research and innovation programme un-
provements that are in line with previous research [28]. der grant agreement No. 101135437 (AI-CODE). Oscar
The generation experiments, for both spoiling and neu- Araque acknowledges the support of the project UNICO
tralisation, evidence that the evaluated model does ben- I+D Cloud - AMOR, financed by the Ministry of Eco-
efit from in-domain knowledge extracted from the pro- nomic Affairs and Digital Transformation, and the Euro-
posed dataset. As seen, these informed generations are pean Union through Next Generation EU; as well as the
more accurate and align better with the golden text.        support of the project CPP2023-010437 financed by the
   Considering the effect of clickbait, we argue that while MCIN / AEI / 10.13039/501100011033 / FEDER, UE.
there are initially harmless articles, lack of accuracy can
have a detrimental effect on readers. This is clear when
References                                                        v1/2022.acl- long.484 .
                                                             [11] A. Agrawal, Clickbait detection using deep learn-
 [1] K. Scott,      You won’t believe what’s in this              ing, in: 2016 2nd International Conference on Next
     paper!      clickbait, relevance and the curios-             Generation Computing Technologies (NGCT), 2016,
     ity gap,       Journal of Pragmatics 175 (2021)              pp. 268–272. doi:10.1109/NGCT.2016.7877426 .
     53–66. URL: https://www.sciencedirect.com/              [12] S. Kaur, P. Kumar, P. Kumaraguru, Detecting
     science/article/pii/S0378216621000229. doi:https:            clickbaits using two-phase hybrid cnn-lstm biterm
     //doi.org/10.1016/j.pragma.2020.12.023 .                     model, Expert Systems with Applications 151 (2020)
 [2] J. N. Blom, K. R. Hansen, Click bait: Forward-               113350. URL: https://www.sciencedirect.com/
     reference as lure in online news headlines,                  science/article/pii/S0957417420301755. doi:https:
     Journal of Pragmatics 76 (2015) 87–100.                      //doi.org/10.1016/j.eswa.2020.113350 .
     URL:       https://www.sciencedirect.com/science/       [13] M. Fröbe, B. Stein, T. Gollub, M. Hagen, M. Pot-
     article/pii/S0378216614002410.            doi:https:         thast, SemEval-2023 task 5: Clickbait spoiling,
     //doi.org/10.1016/j.pragma.2014.11.010 .                     in: A. K. Ojha, A. S. Doğruöz, G. Da San Mar-
 [3] G. Loewenstein, The psychology of curiosity: A               tino, H. Tayyar Madabushi, R. Kumar, E. Sar-
     review and reinterpretation, Psychological Bulletin          tori (Eds.), Proceedings of the 17th Interna-
     116 (1994) 75–98. doi:10.1037/0033- 2909.116.1.              tional Workshop on Semantic Evaluation (SemEval-
     75 .                                                         2023), Association for Computational Linguis-
 [4] K. Scott, R. Jackson, When everything stands out,            tics, Toronto, Canada, 2023, pp. 2275–2286.
     nothing does, Relevance theory, figuration, and              URL: https://aclanthology.org/2023.semeval-1.312.
     continuity in pragmatics 8 (2020) 167–192.                   doi:10.18653/v1/2023.semeval- 1.312 .
 [5] K. Scott, “deceptive” clickbait headlines: Relevance,   [14] A. K. Ojha, A. S. Doğruöz, G. Da San Martino,
     intentions, and lies, Journal of Pragmatics 218              H. Tayyar Madabushi, R. Kumar, E. Sartori (Eds.),
     (2023) 71–82. URL: https://www.sciencedirect.com/            Proceedings of the 17th International Workshop
     science/article/pii/S0378216623002643. doi:https:            on Semantic Evaluation (SemEval-2023), Asso-
     //doi.org/10.1016/j.pragma.2023.10.004 .                     ciation for Computational Linguistics, Toronto,
 [6] S. Zannettou, M. Sirivianos, J. Blackburn,                   Canada, 2023. URL: https://aclanthology.org/2023.
     N. Kourtellis, The web of false information:                 semeval-1.0.
     Rumors, fake news, hoaxes, clickbait, and various       [15] H. Kurita, I. Ito, H. Funayama, S. Sasaki, S. Moriya,
     other shenanigans, J. Data and Information Quality           Y. Mengyu, K. Kokuta, R. Hatakeyama, S. Sone,
     11 (2019). URL: https://doi.org/10.1145/3309699.             K. Inui, TohokuNLP at SemEval-2023 task 5:
     doi:10.1145/3309699 .                                        Clickbait spoiling via simple Seq2Seq generation
 [7] E. Aïmeur, S. Amri, G. Brassard, Fake news, disin-           and ensembling, in: A. K. Ojha, A. S. Doğruöz,
     formation and misinformation in social media: a              G. Da San Martino, H. Tayyar Madabushi, R. Ku-
     review, Social Network Analysis and Mining 13                mar, E. Sartori (Eds.), Proceedings of the 17th
     (2023) 30.                                                   International Workshop on Semantic Evaluation
 [8] M. Potthast, S. Köpsel, B. Stein, M. Hagen, Clickbait        (SemEval-2023), Association for Computational Lin-
     detection, in: Advances in Information Retrieval:            guistics, Toronto, Canada, 2023, pp. 1756–1762.
     38th European Conference on IR Research, ECIR                URL: https://aclanthology.org/2023.semeval-1.243.
     2016, Padua, Italy, March 20–23, 2016. Proceedings           doi:10.18653/v1/2023.semeval- 1.243 .
     38, Springer, 2016, pp. 810–817.                        [16] T. Liu, K. Yu, L. Wang, X. Zhang, H. Zhou,
 [9] P. Rajapaksha, R. Farahbakhsh, N. Crespi, Bert,              X. Wu, Clickbait detection on wechat: A deep
     xlnet or roberta: The best transfer learning                 model integrating semantic and syntactic infor-
     model to detect clickbaits,          IEEE Access 9           mation, Knowledge-Based Systems 245 (2022)
     (2021) 154704–154716. doi:10.1109/ACCESS.2021.               108605. URL: https://www.sciencedirect.com/
     3128742 .                                                    science/article/pii/S0950705122002714. doi:https:
[10] M. Hagen, M. Fröbe, A. Jurk, M. Potthast, Click-             //doi.org/10.1016/j.knosys.2022.108605 .
     bait spoiling via question answering and pas-           [17] Şura Genç, E. Surer, Clickbaittr: Dataset for click-
     sage retrieval,       in: S. Muresan, P. Nakov,              bait detection from turkish news sites and social me-
     A. Villavicencio (Eds.), Proceedings of the 60th             dia with a comparative analysis via machine learn-
     Annual Meeting of the Association for Com-                   ing algorithms, Journal of Information Science 49
     putational Linguistics (Volume 1: Long Pa-                   (2023) 480–499. doi:10.1177/01655515211007746 .
     pers), Association for Computational Linguistics,       [18] A. Geçkil, A. A. Müngen, E. Gündogan, M. Kaya,
     Dublin, Ireland, 2022, pp. 7025–7036. URL: https://          A clickbait detection method on news sites, in:
     aclanthology.org/2022.acl-long.484. doi:10.18653/            2018 IEEE/ACM International Conference on Ad-
     vances in Social Networks Analysis and Min-                     Exploring multifaceted variation and bias in writ-
     ing (ASONAM), 2018, pp. 932–937. doi:10.1109/                   ten language data, arXiv preprint arxiv:2406.17647
     ASONAM.2018.8508452 .                                           (2024). URL: https://arxiv.org/abs/2406.17647.
[19] C. Oliva, I. Palacio-Marín, L. F. Lago-Fernández,          [27] M. Potthast, T. Gollub, K. Komlossy, S. Schuster,
     D. Arroyo, Rumor and clickbait detection by                     M. Wiegmann, E. Garces Fernandez, M. Hagen,
     combining information divergence measures and                   B. Stein, Crowdsourcing a Large Corpus of Clickbait
     deep learning techniques, in: Proceedings of                    on Twitter, in: E. Bender, L. Derczynski, P. Isabelle
     the 17th International Conference on Availabil-                 (Eds.), 27th International Conference on Compu-
     ity, Reliability and Security, ARES ’22, Association            tational Linguistics (COLING 2018), Association
     for Computing Machinery, New York, NY, USA,                     for Computational Linguistics, 2018, pp. 1498–1507.
     2022. URL: https://doi.org/10.1145/3538969.3543791.             URL: https://aclanthology.org/C18-1127/.
     doi:10.1145/3538969.3543791 .                              [28] O. Araque, M. F. L. Corniel, K. Kalimeri, Towards a
[20] I. García-Ferrero, B. Altuna, Noticia: A clickbait              multilingual system for vaccine hesitancy using a
     article summarization dataset in spanish, arXiv                 data mixture approach., in: Proceedings of the 9th
     preprint arXiv:2404.07611 (2024).                               Italian Conference on Computational Linguistics,
[21] T. E. C. L. Arthur, A. T. Cignarella, S. Frenda, M. Lai,        2023.
     M. A. Stranisci, A. Urbinati, et al., Debunker assis-      [29] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert,
     tant: a support for detecting online misinformation,            a distilled version of bert: smaller, faster, cheaper
     in: Proceedings of the Ninth Italian Conference                 and lighter, arXiv preprint arXiv:1910.01108 (2019).
     on Computational Linguistics (CLiC-it 2023), vol-          [30] M. Polignano, P. Basile, G. Semeraro, Advanced
     ume 3596, Federico Boschetti, Gianluca E. Lebani,               natural-based interaction for the italian language:
     Bernardo Magnini, Nicole Novielli, 2023, pp. 1–5.               Llamantino-3-anita, 2024. arXiv:2405.07101 .
[22] S. S. Tekiroğlu, Y.-L. Chung, M. Guerini, Generat-         [31] M. Woźny, M. Lango, Generating clickbait spoilers
     ing counter narratives against online hate speech:              with an ensemble of large language models, arXiv
     Data and strategies, in: D. Jurafsky, J. Chai,                  preprint arXiv:2405.16284 (2024).
     N. Schluter, J. Tetreault (Eds.), Proceedings of the       [32] C.-Y. Lin, Rouge: A package for automatic eval-
     58th Annual Meeting of the Association for Com-                 uation of summaries, in: Text summarization
     putational Linguistics, Association for Computa-                branches out, 2004, pp. 74–81.
     tional Linguistics, Online, 2020, pp. 1177–1190. URL:
     https://aclanthology.org/2020.acl-main.110. doi:10.
     18653/v1/2020.acl- main.110 .
[23] D. Russo, S. Kaszefski-Yaschuk, J. Staiano,
     M. Guerini,        Countering misinformation via
     emotional response generation, in: H. Bouamor,
     J. Pino, K. Bali (Eds.), Proceedings of the 2023
     Conference on Empirical Methods in Natural Lan-
     guage Processing, Association for Computational
     Linguistics, Singapore, 2023, pp. 11476–11492. URL:
     https://aclanthology.org/2023.emnlp-main.703.
     doi:10.18653/v1/2023.emnlp- main.703 .
[24] M. Snover, B. Dorr, R. Schwartz, L. Micciulla,
     J. Makhoul, A study of translation edit rate with tar-
     geted human annotation, in: Proceedings of the 7th
     Conference of the Association for Machine Trans-
     lation in the Americas: Technical Papers, Associa-
     tion for Machine Translation in the Americas, Cam-
     bridge, Massachusetts, USA, 2006, pp. 223–231. URL:
     https://aclanthology.org/2006.amta-papers.25.
[25] M. Turchi, M. Negri, M. Federico, Coping with the
     subjectivity of human judgements in MT quality
     estimation, in: Proceedings of the Eighth Workshop
     on Statistical Machine Translation, Association for
     Computational Linguistics, Sofia, Bulgaria, 2013, pp.
     240–251. URL: https://aclanthology.org/W13-2231.
[26] A. Ramponi, C. Casula, S. Menini, Variationist:
       scienza       insetti, animali, AI, scienza, smartphone, Spazio, tecnologia, TECNOLOGIE, SCIENZA, ufo, biochimica,
                     eclissi, bomba atomica, terra piatta, idroelettrico, temperatura, coltivazione, robot, fisica quantistica,
                     macchie solari, ricerca, vulcano, titanio, universo, fotovoltaico, intelligenza, iPhone, hacker, microonde,
                     motori di ricerca, onde elettromagnetiche, tecnologia, sole, scienza, radioterapia, pesticidi, armi
                     chimiche, comete, case farmaceutiche, psichiatria, smartphone, formiche, elettrodomestici, solare,
                     macrobiologi, mondo, lampadine a basso consumo, tecnologia, scienze-e-tech, scienza, scienza,
                     innovazione, scienza, tecnologia-2, animali intelligenti, funzione cognitiva, microchip, cani, samsung,
                     wi fi, tecnologia-e-tv, SCIENZE, TECNOLOGIA, bioetica, biologia, fisica, covid, coronavirus
        salute       Salute, CORONAVIRUS, VAIOLO SCIMMIE, TUBERCOLOSI, SALUTE, SCABBIA, AIDS, salute, hiv,
                     cocaina, antidepressivi, veleni, infezioni, carne, tabacco, infibulazione, fluoro, alcool, alimentari, aids,
                     antibatterico, dieta, insetticida, cibo, benessere, farmaci, digitopressione, caffè, sigarette, ministero
                     della salute, autismo, limoni, cure naturali, paracetamolo, cancro, antiossidante, droga, olio, medicina
                     alternativa, fragole, vegetariano, eroina, dislessia, veleno, zenzero, virus, psicologia, biologico, magne-
                     sio, frutta, psicofarmaci, pollo al cloro, fiori di bach, medico, sonno, birra, vitamina e, ulivi, proteine,
                     stress, banana, pensieri negativi, tumori, benzodiazepine, latte, miele, cuore, epilessia, longevità, mari-
                     juana, diabete, sale, ibernazione, vecchiaia, fegato, vegan, prevenzione, dentifricio, cervello, sistema
                     immunitario, sodio, suicidio, rimedi naturali, maltempo, canapa, pillola, mal di gola, depressione,
                     psiche, alimentazione, ebola, aspartame, dentifricio senza fluoro, tiroide, mangiare, cure proibite,
                     Alzheimer, smog, gas, malattie, calamità, mammografia, verdura, aloe, masticazione, farmaco, igiene,
                     batteri, medicina, vitamina c, epatite c, forfora, energia, vaccini, ormoni, flora batterica, sorbitolo,
                     antibiotici, piedi, obesità, arsenico, cortisolo, chemioterapia, contraccezione, Neurotrasmettitori, semi,
                     melograno, celiachia, Coca cola, salute-benessere, salute, salute-e-benessere, bellezza, dimagrante,
                     benessere, salute-benessere, rimedi-naturali, pianeta-mamma, grano antico, acqua ossigenata, alimet-
                     nazione, ansia, dentisti, curcuma, casa-e-cucina, hobby-e-sport, SPORT, crescita-consapevolezza,
                     la-salute-che-viene, sport, stile-di-vita, consigli, lifestyle, pomodori
      ambiente       Cambiamenti climatici, energia, energia elettrica, Natura, AMBIENTE, ECOLOGIA, global warming,
                     geoingegneria, alberi, pianeta terra, natura, inquinamento, mare, terra, manipolazione climatica,
                     clima, rinnovabili, Dissesto idrogeologico, ecologia, ambiente, green, ambiente-attuale, ecologia,
                     salute-benessere, natura, ambiente, METEO, tempesta solare, astronomia, acido
      economia       affari-online, economia, ECONOMIA, consumi-risparmi, microchip r-fid, bollo auto, tasso d’interesse,
                     finanza, bollette, banche, profitto, spese, economia-finanza, economia, economia, economia-dellanima,
                     fisco-e-tasse, economia, economia, economia, economia-e-finanza

Table 7
Split of the categories into the four macro-categories.



A. Dataset Creation Details                              notators received both a score indicating how much the
                                                         headline was clickbait and automatic ChatGPT gpt-3.5-
A.1. Category Assessment                                 turbo-0125 generated suggestions for the spoilers and
                                                         the neutralized versions of the headlines. Below, we have
In Table 7 we report how the heterogeneous categories
                                                         outlined the annotation guidelines that the annotators
scraped directly from the misleading websites were di-
                                                         were to follow.
vided into the four macro-category of scienza (science),
salute (well-being), ambiente (environment), economia
(economy).                                               Clickbait labelling In order to select the clickbait
                                                         headlines present in the scraped data, the annotators
                                                         were provided with specific guidelines. Table 8 provides
A.2. Annotation Guidelines                               the main key points taken into consideration in order to
Three components of our datasets were subject to human label the data.
intervention to: (i) determine if the headline was click-
bait, (ii) identify the related article’s spoiler, that is, the    Spoiler post-editing For the post-editing of the
information required to satisfy the curiosity gap within           spoiler the annotator was required to spot in the headline
the headline, and (iii) revise the headline to include the         the information gap and to check if the generated spoiler
spoiler information, thereby neutralizing it. During all           was providing that information checking the related ar-
three annotation stages, we employed a machine-human               ticle. If the model failed to find the proper spoiler, the
collaboration to expedite the work of annotators. The an-          annotator had to rewrite it sticking as much as possible to
           Characteristic                          Original example (IT)                           Translated example (EN)

Lack of essential information,           “Ora riposa in pace”. Calcio in lutto, morto       “Now rest in peace”. Football in mourning, one
i.e., the subject the article is talk-   uno dei grandi protagonisti dell’Italia            of Italy’s great protagonists dead
ing about
Sensationalist tone                      Fan ubriaca le salta addosso sul palco. La         Drunk fan jumps on her on stage. Her reaction
                                         sua reazione è incredibile e sconvolge tutti i     is incredible and shocks everyone present
                                         presenti
Questions raised but answered            Tratti della nostra colonna: quali sono? Come      Traits of our column: what are they? How to
in the article body                      evitare lesioni?                                   avoid injuries?
Enumeration of elements                  10 cibi per sbarazzarsi del gonfiore di stomaco    10 foods to get rid of bloated stomach and
                                         e pancia                                           tummy
Use of capitalization                    INFARTO: sopravvivere quando si è soli. Hai        HEART ATTACK: surviving when alone. You
                                         solo 10 secondi per salvarti. Ecco cosa devi       only have 10 seconds to save yourself. Here’s
                                         fare:                                              what you have to do:
Introduction of the content              Zanzare, ecco come eliminarle senza insetti-       Mosquitoes, this is how to eliminate them with-
without actually giving the in-          cidi                                               out insecticide
formation
Use of quotations that do not            Omicron, Ilaria Capua: “Ecco perché i vacci-       Omicron, Ilaria Capua: “This is why the vacci-
give information                         nati si infettano di più rispetto a prima”         nated get more infected than before”

Table 8
Key points used for the annotation of the dataset. Please note that some instances can exemplify more than one point.



the document’s text. If the spoiler was correct but added                     The clickbait headline typically omits key
extra info, the annotator had to keep those extra informa-                    information to create a curiosity gap for
tion only if those were essential for having a complete                       the reader. Your task is to extract this
headline. If the spoiler was correct, then the annotator                      missing information, known as a “spoiler,”
could leave it as it was.                                                     from the article’s text. The spoiler can be
                                                                              a single keyword, a short text passage, or a
Neutralised Clickbait Post-Editing The annotator                              list of keywords. Once you have identified
was required to check if the neutralised forms comprises                      the spoiler, rewrite the clickbait headline
both the headline and the spoiler information. If the                         by incorporating this information to elim-
spoiler was very long (e.g., long listing), then the anno-                    inate the curiosity gap. The output must
tator had to summarise the spoiler as much as possible                        be in JSON format and written in Italian.
aiming to embed in the final novel headline enough in-                        The JSON should include two entries: one
formation to reduce or remove the information gap. If                         called “spoiler” that contains the extracted
the model failed at addressing the spoiler information in                     spoiler(s), and another called “new_head-
the neutralised version of the headline, then the anno-                       line” that has the revised headline.
tator had to manually add it. Moreover, the annotator                         Example Input:
was required to remove sensationalist tones as much as
                                                                              Clickbait headline: “Questo attore ha
possible, if this tone was still creating useless curiosity
                                                                              fatto qualcosa di incredibile sul set di un
in the reader.
                                                                              famoso film!” Article: “Durante le riprese
                                                                              del film ‘Il Gladiatore’, l’attore Russell
A.3. Author Component Instruction                                             Crowe ha deciso di fare un gesto di grande
Hereafter, we provide the instruction employed to au-                         generosità donando una parte significa-
tomatically generate spoilers and the neutralised ver-                        tiva del suo stipendio al fondo per i mem-
sions of the clickbait headlines through ChaGPT gpt-                          bri della troupe.”
3.5-turbo-0125 .                                                              Example Output:
                                                                              {“spoiler”: “Russell Crowe ha donato
         I have a clickbait headline and its corre-                           una parte significativa del suo stipen-
         sponding article, both written in Italian.
                                                                                                                                                                                                                                                                                                                                  Search the chart
                                                                                                                                                                                                                                                                                             Top Non-Clickbait   Characteristic

              Non-Clickbait Frequency
                                                                                                                                                                                                                                                                                             sinner              lutto
                                                                                                                                                                                                                                                                 e        la
 Frequent

                                                                                                                                                                                                                                                            perina                           grazie              tumore
                                                                                                                                                                                                                                                                   i        di
                                                                                                                                                                                                                                                       con del l’                            coronavirus         covid
                                                                                                                                                                                                                                                                            il
                                                                                                                                                                                                                                 italia ha           gli            le   question            contro              infarto
                                                                                                                                                                                            coronavirus           un’nel                                                che
                                                                                                                                                                         all’                                                dei                        lo                                   anche               vaccino
                                                  contro                                                                                                                         alla                        su                                                   cosa
                                                                                                                                                                                                                                              ma                                             italia              sintomi
                                        sinner            anche                                                                                                                               rischio                              dell’                lutto
                                                                                                            italiani                                                             d’
                                         grazie                       ospedale                                            come si                                   mondo              covid                    morto                      cancro         dopo più     ecco                  all’                scoperta
                                                                                         così    ’
                                                 sarà                dove                                                     ora                                                                                         ci                                                                 sarà                scoperto
                                                                                                                                                                               in lutto nella         tumore           ai                 sul
                                                              è morto                                                                         euro                    alle                                                                       tutti
                                                    primo l'                                                                                                    vaccino                    addio                                                                                             alla                campionessa
                                                                           nuova                                     3           tra                                                                                                           |
                                                                                         virus                                                                                                               giorno        fa italiano                                                       sui                 muore
                                             sui       storia consigli                                                       suo
                                                                                                                                                               arriva           vita
                                                                                                                                                                                           mi può                                          delle sport
                                                                                                 %          vialli                                                                                                                                                                           governo             cibi
                                        governo                                                                                                                          sta                                                 solo
                                                                             allerta                                                                             moglie                                                                           perché
                                         contro il dall’ europa                gianluca                            italiana nuovo       tumori                                                                                                                                               italiani            segnali
                                             nei                                                                       campionessa                                                                                  prima
                                                                                                                                                                                                                 funziona            10           questo                                     nel                 capitato
                                             potrebbe                                   sempre                                                                                                       questa
                                                             cura                                                                c’       notizia
                                            scienza                 studiocos’ oggi                                                                                                                                   acqua                                                                  primo               allerta
                                         mihajlovic                                                                                                                                                                                                       casa
                                                         cos’ è                                 fanno foto                           tumore   al                                        tutto          quello che                                 mai                                                            benefici
                                                                                                                                                                                                                                      salute
 Average




                                                                   figli           è un’ il suo                      quale                                              sulla
                                                                                                c’ è         farmaci                         muore                                                                 quello                                                                    Top Clickbait       uccide
                                                        olio             mamma                                                                                                                                                  vi                          quali
                                                                                        malore                                                           2naturale                                tempo             allarme                                                                  ecco
                                                        tre                                           giorni      dieta                                                                                                                attenzione                                                                malore
                                                                  l’ annuncio essere        molto                                                            prezzo           ogni                                                                                                           ecco cosa           salvarti
                                                                                                         nell’                                  scoperta                                                                             ecco come           fare
                                                                                            opinioni e             farmacia question                                                                    malattia          dal                                                                quali
                                                               trova la moglie                                                                                     in farmacia                                                                                                                                   sognare
                                                                                               modo                                                   figlio      farmacia                                                                             sintomi
                                                                                          bene                                            video                                                      in casa                   choc                                                          sintomi             succede
                                                                 poi                                                                                                                                                                               quando
                                                                                       capelli      dolore                                          ne                                                         infarto        corpo                                                          fare
                                                          dovete                                                                                                                                                                                                                                                 polmoni
                                                             bere        cellule     parla          la sua           caso     4                                il campione                           sua
                                                                                                                                 benefici    cosa   significa                                               hanno        ed
                                                                                                                                                                                                                                                                                             casa                morto
                                                                         mangiamo due al giorno                  prima di                      significa                                                    5
                                                                                             coming                          fatto                             dai                                                chi                                                                        cosa                malattia
                                                                                                           naturali                         sono le                                  scoperto
                                                                    previene                                                                               nello                                                                                                                             mai
                                                                                                             morte         sapere       incidente                                                                                           quali sono                                                           dovete
                                                                         uno                                                                           le  donne                                  i sintomi           donne                                                                  quando
                                                                                                    causemese                       avere
                                                                                                                                              sotto choc
                                                                                                                                                                                   cui
                                                                                                                                                                                                              hai
                                                                                                                                                                                                                                          se                                                                     previene
                                                                     famiglia
                                                                                                    trucco                    incredibile
                                                                                                                                                sono i      davvero          uomo da non                                soli                                                                 ti                  mangiamo
                                                                                                               rimedi       7 terribile                    di sognare             8                       sotto                                                                              più
                                                                                                                                                          succede
                                                                                                                                                                                          è mai                                        si è                                                                      rimedi
                                                                                                        la malattia                                                                             mai capitato
                                                                                                                           a cui
                                                                                                                                   auto
                                                                                                                                                 è soli                                                                                                                                      quali sono          sottovalutare
                                                                                                                                                 devi fare                           vi è    capitato                                                                                        questo              pulire
 Infrequent




                                                                                                                                 trucchi                            salvarti
                                                                                                                                               ecco perché                                  segnali                                         ti
                                                                                                                                                                                                       cibi                     devi                       ecco cosa                         se                  terribile
                                                                                                                               6 pulire                                             alimenti


                                                                                                                                                                                                                                                                       Clickbait Frequency

                                           Infrequent                                                                                                  Average                                                                                                     Frequent

                                                                                                                                           Non-Clickbait document count: 481; word count: 5,642
                                                                                                                                            Clickbait document count: 846; word count: 10,647



Figure 2: Frequency of words for both clickbait and non-clickbait categories. On the right, most frequent words
for each class, and both (Characteristic). An interactive version of the graph can be accessed at the following link
https://oaraque.github.io/clickIT/clickbait.html



                                        dio al fondo per i membri della troupe”,                                                                                                        B.2. Dataset Excerpt Translation
                                        “new_headline”: “Russell Crowe ha fatto
                                                                                                                                                                                       Table 9 includes the English translations for the Italian
                                        qualcosa di incredibile sul set di ‘Il Gladi-
                                                                                                                                                                                       examples presented in Table 1.
                                        atore’: ha donato una parte significativa
                                        del suo stipendio al fondo per i membri
                                        della troupe”}                                                                                                                                  C. Experimental Design Details
                                        Please ensure the output is formatted in
                                        JSON as specified and that all content is                                                                                                       C.1. Question Generation
                                        in Italian.
                                                                                                                                                                                        Questions were generated with ChatGPT gpt-3.5-
                                        Now do it for the following headline.                                                                                                           turbo-0125 using the following prompt:
                                        Clickbait headline: “{headline}”
                                                                                                                                                                                                            You will be provided with a clickbait head-
                                        Article:“{article}”                                                                                                                                                 line written in Italian. Your task is to gen-
                                                                                                                                                                                                            erate a question that addresses any miss-
                                                                                                                                                                                                            ing or vague information in the headline.
B. Additional Dataset Details                                                                                                                                                                               Here are some examples:
B.1. Dataset Visualisation                                                                                                                                                                                  Headline: Si chiama la benedizione di Dio:
                                                                                                                                                                                                            rimuove l’alta pressione, il diabete e il
Figure 2 shows a frequency-based visualization of the                                                                                                                                                       grasso nel sangue Question: Che cosa
dataset. It considers the frequency of appearance of rel-                                                                                                                                                   viene chiamato ’benedizione di Dio’?
evant uni and bi-grams for both the clickbait and non-
                                                                                                                                                                                                             Headline: “Emorragia cerebrale”. Italia in
clickbait categories. The figure shows common strategies
                                                                                                                                                                                                             apprensione per il suo campione: ricover-
that are frequent in clickbait content, such as the use of
                                                                                                                                                                                                             ato in condizioni gravissime
“ecco cosa” (this is what) or “quali sono” (what are) that
can be seen in the lower right part.                                                                                                                                                                         Question: Chi è il campione?
                                                                                                                                                                                                             Please generate the question in Italian, en-
                                                                                                                                                                                                             suring it seeks to clarify the ambiguous or
                                                                                                                                                                                                             incomplete details present in the headline.
   Category               Headline                      Article                Clickbait           Spoiler               Neutralised title
Health            Fruit or flower? Tasty        We all know it, inevitable       True      The strawberry              Strawberry: tasty and at-
                  and attractive, a celebrity   on our tables, world-                                                  tractive, a celebrity on
                  on our tables, we reveal      famous, but mysterious is                                              our tables
                  who she is                    its nature, fruit to enjoy
                                                or flower to decorate?
Science           Self-repairing metal dis-     The recent experiment re-        True      Platinum                    The metal that repairs it-
                  covered. Scientists as-       vealed an extraordinary                                                self: platinum
                  tounded                       phenomenon...
Health            A disease that affects        We are talking about             True      Psoriasis affects about     Psoriasis: a disease that
                  500,000 people                a chronic immune-                          500,000 people              affects about 500,000 peo-
                                                mediated         systemic                                              ple in Italy
                                                disease that affects about
                                                1.8 million patients...
Environment       Mosquitoes, here’s how        With the arrival of hot          True      To eliminate mosquitoes     Mosquitoes, here’s how
                  to get rid of them with-      weather, mosquitoes also                   from your home once and     to get rid of them with-
                  out insecticides              make their way into our                    for all, you should buy a   out insecticides: just buy
                                                homes or gardens...                        bat                         a bat

Table 9
Translated from the original Italian. An excerpt of the presented dataset showing the most relevant fields. Article bodies are
shortened for space reasons.



C.2. Spoiler Generation                                                      C.3. Fine-Tuning Details
For the zero-shot spoiler generation task we employed                        The LLaMAntino-3-8B [30] model underwent training
the following prompt:                                                        on a single Ampere A40 GPU with 48GB of memory,
                                                                             employing the QLoRA strategy with a low-rank approxi-
          Ti verranno forniti un titolo clickbait e il                       mation of 64, a low-rank adaptation of 16, and a dropout
          suo articolo corrispondente. Il titolo click-                      rate of 0.1. It was set to evaluate every 50 steps, with a
          bait di solito omette, o non esplicita, in-                        batch size of 4, across 3 epochs, using a learning rate of
          formazioni chiave per creare curiosità nel                         10−4 .
          lettore. Estrai dall’articolo le informazioni                         In the clickbait detection experiments, the DistilBERT
          mancanti o vaghe nel titolo che servono                            and Llama3-8b models have been fine-tuned on the same
          per colmare questa curiosità. La risposta                          GPU. The DistilBERT model has been trained on 10
          può essere un messaggio estremamente                               epochs with a learning rate of 2 ⋅ 10−4 . For the Llama3
          coinciso oppure un elenco. Formatta la                             model, we have used QLoRa with the same characteris-
          risposta nel seguente modo. “Risposta:                             tics as described above, trained on two epochs, with a
          ”                                                          learning rate of 2 ⋅ 10−4 .
          Titolo: {headline}
          Articolo: {article}                                                C.4. Neutralised Clickbait Generation
                                                      The following system prompt (enriched with three exam-
  The same instruction was employed with the fine-
                                                      ples) has been utilised with LLaMAntino-3-8B:
tuned model. For few-shot generation of the spoiler,
we enriched the instruction with two examples.                Ti verrano forniti due testi: un titolo click-
  When casting spoiler generation as a Question An-           bait e un testo, chiamato spoiler, che con-
swering task, the following instruction was employed:         tiene le informazioni mancanti nel titolo.
                                                              Il tuo compito è di riscrivere il titolo
       Ti verrà fornita una domanda e un doc-                 clickbait integrando le informazioni dello
       umento. Trova nel documento le infor-                  spoiler. Il nuovo titolo deve essere infor-
       mazioni per rispondere alla domanda. La                mativo, privo di toni sensazionalistici, e
       risposta può essere un messaggio conciso               breve. Se Lo spoiler contine tante infor-
       oppure un elenco. Formatta la risposta                 mazioni, puoi riassumerle in concetti più
       nel seguente modo. “Risposta: ”                generali.
                                                                                    Titolo: {headline}
                                                                                     Spoiler: {spoiler}
D. Ethical Statement
No specific ethical conflicts have been reported during
the development of this work. The dataset was compiled
from publicly available sources. It is important to ac-
knowledge that the examples in this document are not
indicative of the authors’ opinions or beliefs. Addition-
ally, the ideas or assertions contained within these texts
may be misleading or harmful; therefore, the dataset
should be utilized strictly for research purposes.