<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Journal of Pragmatics 175 (2021) pp. 268-272. doi:10.1109/NGCT.2016.7877426.
53-66. URL: https://www.sciencedirect.com/ [12] S. Kaur</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/NGCT.2016.7877426</article-id>
      <title-group>
        <article-title>Click it or not to Click it: An Italian Dataset for Neutralising Clickbait Headlines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Russo</string-name>
          <email>drusso@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Araque</string-name>
          <email>o.araque@upm.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Guerini</string-name>
          <email>guerini@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Politécnica de Madrid</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Trento</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1</volume>
      <fpage>268</fpage>
      <lpage>272</lpage>
      <abstract>
        <p>Clickbait is a common technique aimed at attracting a reader's attention, although it can result in inaccuracies and lead to misinformation. This work explores the role of current Natural Language Processing methods to reduce its negative impact. To do so, a novel Italian dataset is generated, containing manual annotations for classification, spoiling, and neutralisation of clickbait. Besides, several experimental evaluations are performed, assessing the performance of current language models. On the one hand, we evaluate the performance in the task of clickbait detection in a multilingual setting, showing that augmenting the data with English instances largely improves overall performance. On the other hand, the generation tasks of clickbait spoiling and neutralisation are explored. The latter is a novel task, designed to increase the informativeness of a headline, thus removing the information gap. This work opens a new research avenue that has been largely uncharted in the</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
1. Introduction
journalism. Nevertheless, in an efort to improve revenue,
a large number of newspapers and magazines publish
clickbait articles, a viral journalism strategy that seeks to
attract users to click on a link to a page through tactics
such as sensationalist stories and catchy headlines that
act as bait. The use of these tactics harms the quality of
news pieces and thus hinders the ability of citizens to
obtain reliable and objective information. The literature
distinguishes between two main types of clickbait. (i)
Classical clickbait [1] embeds within the headlines
information gaps, also known as curiosity gaps [2, 3], in order
to arouse curiosity in the reader that is forced to access
Classical clickbait usually makes use of hyperbolic
language, caps lock, demonstrative pronouns and
superlative to grasp the user’s attention [1, 4, 5]. (ii) Deceptive
clickbait [5] refers to headlines that resemble traditional
media headlines by ofering a summary of the article, still
leading to content that difers from the reader’s
expectations. These headlines promise high news value but
deliver content with low news value, resulting in reader
disappointment.
English language [8, 9]. Hagen et al. [10] proposed the
clickbait spoiling task, i.e., the generation of a short text
that satisfies the curiosity induced by a clickbait post.</p>
      <p>In light of this, this work addresses the issue of
clickbait in the Italian language, studying its characteristics
and the possibilities of current technology to reduce its
negative impact. In doing so, we have generated a novel
articles, which is made public for the community to use 1.</p>
      <p>We named the dataset ClickBaIT. This dataset contains
manually annotated instances as clickbait/non-clickbait,
as well as manually generated spoilers and neutralised
headlines. We have also performed a thorough
multilingual evaluation, exploiting the availability of English
data to complement our dataset in the task of clickbait
detection. Finally, this work also explores the use of our
annotated dataset and large language models to
automatically generate both spoilers and, as a novel task, a
neutralised version of clickbait headlines. A graphical
illustration of the experimental design is presented in
the article’s content which is ultimately disappointing. Italian dataset that gathers a large collection of clickbait
nEvelop-O
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License 1The dataset is available in https://github.com/oaraque/ClickBaIT
Attribution 4.0 International (CC BY 4.0).
1. CLICKBAIT DETECTION
3. CLICKBAIT NEUTRALISATION
Quale malattia colpisce
500mila persone?
Question
Rewriting</p>
      <p>ARTICLE</p>
      <p>La psoriasi
SPOILER</p>
      <p>HEADLINE</p>
      <p>La psoriasi: una malattia
che colpisce circa
500mila persone in Italia
NEUTRALISED
HEADLINE</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The use of clickbait is common in many news outlets, 3.1. Dataset Creation
and thus it has been extensively studied.</p>
      <p>There are several works that address clickbait detec- Data were collected from fourteen news websites2,
nototion: Potthast et al. [8] collected a corpus of clickbait rious for acting as news aggregators, engaging in
plagiaarticles, posted by well-known English-speaking newspa- rism, lacking fact-checking, and using sensational
headpers on Twitter, and proposed a set of lexical and semantic lines to draw in readers. In all the websites, articles are
features to be used with a Random Forest classifier. Fol- labelled according to specific categories; we decided to
lowing the general trend in Natural Language Processing focus on four macro-categories: health, science, economy,
(NLP) field, clickbait detection has also been explored and environment. These categories have been selected
using deep learning methods, such as convolutional [11] to cover some of the most frequent - and potentially
and recurrent [12] neural networks, as well as more re- hazardous - domains where clickbait is usually found.
cent Transformer-based approaches [9]. Since the categories varied a lot from website to
web</p>
      <p>Other works leveraged Natural Language Generations site, we manually mapped each category into one of the
(NLG) strategies to create a piece of text, the spoiler, com- four macro categories under analysis. Two annotators,
prising the information needed to fulfil the curiosity gap knowledgeable in the area, were then provided with the
present in clickbait headlines. This task was proposed by headlines and the related articles and were asked to
laFröbe et al. [13] with the name of spoiling generation. The bel whether a headline was clickbait. For aiding in this
authors created the Webis Clickbait Spoiling Corpus 2022, task, we have used as reference the clickbait measure as
and cast spoiler generation as a Question Answering task. computed by Arthur et al. [21]. Eventually, given the
Eventually, they open the challenge to the community clickbait dataset, the two annotators were required to
through a SemEval-2023 shared task [13, 14]. The op- extract the gold spoilers from the article’s text and to
protimal spoiler generator operates with five independent duce the neutralised forms for each headline. To this end,
sequence-to-sequence generative models. It selects the we employed an author reviewer strategy [22]: an LLM
best spoiler through a majority vote, determined by com- (ChatGPT gpt-3.5-turbo-01253) was used to generate
paring edit distances among the outputs [15]. both the spoilers and the neutralised forms (author
com</p>
      <p>Regarding the languages studied, the majority of works ponent)4, and the native Italian speaking annotators were
are based on English. Other works were performed in asked to manually post-edited the generations (reviewer
Chinese [16], Turkish [17, 18] and Spanish [19, 20]. To component).5 This procedure was proven to be more
the best of our knowledge, this is the first work that fully efective and less time-consuming than writing the data
addresses the study of clickbait detection and spoiling
in the Italian language. Moreover, we propose a novel
task, i.e., clickbait neutralisation, which aims at filling
the curiosity gap by rewriting the headline levering the
information of the spoiler.
2Essere Informati, TGNewsItalia, Voxnews, DirettaNews, Informati,
Italia, Jeda News, News Cronaca, TG5Stelle, TG24-ore, ByoBlu,
Mag24, WorldNotix, lo sapevi che, Fortementein
3https://chat.openai.com
4In Appendix A.3 we provide the prompt employed
5Details in Appendix A.2
Frutto o fiore?
gustosissima e attraente, una
celebrità sulle nostre
tavole, sveliamo chi è
Scoperto un metallo che
si auto-ripara. Scienziati
sbalorditi
Una malattia che colpisce
500mila persone
Zanzare, ecco come
eliminarle senza insetticidi
Tutti la conosciamo,
immancabile sulle nostre
tavole, celebre in tutto
il mondo ma misteriosa
la sua natura, frutto da
gustare o fiore...</p>
      <p>Il recente esperimento
ha rivelato un fenomeno
straordinario...</p>
      <p>Parliamo di una
malattia sistemica cronica
mediata dal sistema
immunitario che interessa...</p>
      <p>Con l’arrivo del caldo,
anche le zanzare si fanno
largo nelle nostre case o
nei nostri giardini...</p>
      <p>True
True
True
True
La fragola
Il platino
La psoriasi colpisce circa
500 mila persone
Fragola: gustosissima e
attraente, una celebrità
sulle nostre tavole
Il metallo che si
autoripara: il platino
La psoriasi: una malattia
che colpisce circa 500mila
persone in Italia
Per eliminare una volta
per tutte le zanzare dalla
vostra casa, dovreste
acquistare un pipistrello</p>
      <p>Zanzare, ecco come
eliminarle senza insetticidi:
basta acquistare un
pipistrello
from scratch [23]. To assess the amount of post-editing 3.2. Dataset Analysis
required, we employed Human-targeted Translation Edit
Rate [HTER; 24]. HTER quantifies the minimum edit The complete ClickBaIT dataset consists of 4,144 entries.
distance, which is the least number of editing operations Each entry includes the following fields: (i) source
webneeded, between a machine-generated text and its post- site, that specifies the source of the article; (ii)
publicaedited counterpart. HTER values exceeding 0.4 indicate tion date, which is captured from the original source;
low-quality outputs; under such circumstances, rewrit- (iii) headline text; (iv) article text; (v) original URL;
ing the text from scratch or extensive post-editing would (vi) macro category inferred from the original category
necessitate comparable efort [ 25]. extracted from the source; (vii) image URL associated</p>
      <p>The obtained HTER results for the spoiler generation with the article as specified in the source; (viii) clickbait
(0.4) are higher than those computed upon the neutrali- annotation; (ix) the associated spoiler; and (x) the
sation (0.3), in par or slightly lower than the 0.4 thresh- neutralised version of the title.
old. The high HTER values, especially for the spoiler Table 2 shows the main statistics of the final version
annotation, can be attributed to the model’s tendency of the dataset. The golden set is manually annotated and
to generate spoilers comprising more details than those thus contains high-quality information. Additionally, the
necessary to fill the curiosity gap. While in some cases silver set has been annotated automatically as described
a simple deletion was suficient, in others the annotator and therefore contains a larger number of instances.
had to rewrite the spoiler almost completely. Regarding To gain a deeper understanding of the content of the
the annotation of the neutralisation texts, the higher re- dataset we have used Variationist [26], a tool that allows
sults are a consequence of the spoiler generation, as the to inspect useful statistics and patterns in textual data.
model was required to generate them simultaneously. Upon inspection of the data, we have detected several</p>
      <p>With this, we have generated the golden set of the patterns frequently used for generating the curiosity gap.
dataset, in which all the instances were manually anno- Of course, one of the most common strategies used in
tated. Further details regarding the dataset creation can
be found in Appendix A. To expand this set, we have used Set Clickbait (%) Non-clickbait (%) Total
a clickbait classifier (see Sect. 4.1) to automatically detect
clickbait headlines. This new set of data, automatically Golden 698 (53%) 629 (47%) 1,327
annotated, constitutes the silver set of our dataset. Sev- Silver 1,563 (56%) 1,224 (44%) 2,787
eral examples of dataset entries are provided in Table 1. Total 2,261 1,853 4,114
clickbait headlines is the formulation of a question that multilingual-cased7) model trained in a multilingual
is later answered in the article, even though sometimes setting, and (ii) the Llama3-8B language model
(metait is not. In the instance “Quanto è green il gas? ” (How llama/Meta-Llama-3-8B8). The composed dataset has
green is gas? ) the article explains that gas is not consid- been split into train and test splits, which have been used
ered green. Another frequent strategy we have detected to fine-tune and evaluate these models, respectively.
is the introduction to the content of the article, which To assess the efect of using a mixture of both
Eninvites the reader to click it: Beve un cucchiaio di aceto di glish and Italian instances in the dataset, we evaluate
mele nell’acqua tutti i giorni, ecco cosa succede (Drinks a the performance of the two models in a monolingual
tablespoon of apple cider vinegar in water every day, this setting (e.g., fine-tuning in Italian and predicting in the
is what happens). same language) as well as the multilingual variant (e.g.,</p>
      <p>Another usual pattern is the reference to enumerations, fine-tuning in English and Italian text, and predicting on
frequently using round and manageable numbers such as Italian instances).
10, 8, and 5. This can be done for introducing numbered
content, as in “Le 10 fantasie femminili più segrete” (The 4.2. Spoiler Generation
10 most secret female fantasies), or even to generate a
reaction in the reader: “Hai solo 10 secondi per salvarti. Ecco The spoiler generation task consists in generating a
cosa devi fare:” (You only have 10 seconds to save yourself. short message that fulfils the curiosity gap present in
Here’s what you have to do:). Other means can be used a given clickbait title, by extracting the information from
to make headlines noticeable, such as introducing text the linked article. To this end, we tested
LLaMAntinoin all caps, using striking vocabulary or even punctua- 3-ANITA-8B-Inst-DPO-ITA (LLaMAntino-3-8B
heretion marks, as in “[ALLARME] Trufa AUTO USATE, fate after) [ 30] on our clickbait dataset. The model was tested
attenzione! ” ([ALERT] USED CAR scam, beware! ). both in in-context learning (zero- and few-shot) and
fine</p>
      <p>See Table 8 (Appendix A.2) for a collection of patterns tuning settings.
that have been considered during the manual annotation Building on prior research that frames spoiler
generaof the dataset. Besides, Appendix B includes a graphical tion as a Question Answering task [31], we prompt the
summary of the dataset, while its interactive version can model to rewrite clickbait headlines as questions and
exbe accessed online.6 Details are provided in Appendix C. tract the corresponding answers, i.e., the spoilers, from
the linked articles.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Design</title>
      <p>The experimental design comprises three steps: clickbait
detection, spoiler generation and clickbait neutralisation.</p>
      <sec id="sec-4-1">
        <title>4.1. Clickbait Detection</title>
        <p>This is the first and most basic task aimed at addressing
the clickbait phenomenon. To explore the efect of using
additional data in the training process, we use the
WebisClickbait-17 [27], an English dataset containing clickbait
that is also annotated in a binary fashion.</p>
        <p>Following the insights by Araque et al. [28], we use the
training on English data to improve the classification of
Italian data. The main idea is to harness the availability
of large amounts of English data, generating a compound
dataset with a lower amount of Italian instances. To do
so, a multilingual mixture dataset is created so that 35%
of the final dataset comprises Italian instances, while the
rest are in English.</p>
        <p>We model the detection challenge as a binary
classification task: clickbait/non-clickbait. To study the
complexity of the task, we explore two diferent models</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Clickbait Neutralisation</title>
        <p>The best-performing configuration was employed for the
neutralisation of the clickbait headlines. To this end,
we instructed the LLM to perform a style transfer task,
from a clickbait headline style to a more journalistic one,
while integrating the spoiler information into the original
headline.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Evaluation Metrics</title>
        <p>Firstly, for the evaluation of the clickbait detection
task we use the macro-averaged precision, recall and
f-score. This allows us to assess the performance even
in an unbalanced scenario. For the generation tasks, we
assessed lexical similarity through ROUGE score [32]
and semantic similarity. For the latter, text
embeddings, computed using
sentence-bert-base-italianxxl-uncased9, were compared using cosine similarity.
for classification: (i) a DistilBERT [ 29] (distil-base- 7https://huggingface.co/distilbert-base-multilingual-cased
8https://huggingface.co/meta-llama/Meta-Llama-3-8B
9https://huggingface.co/nickprock/sentence-bert-base-italian-xxl6https://oaraque.github.io/ClickBaIT/clickbait.html uncased
headlines
questions</p>
        <p>R1</p>
        <p>RL</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Clickbait detection</title>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Spoiler Generation Results</title>
        <p>Results for the spoiler generation task are reported in
Table 3. We evaluated the capabilities of LLaMAntino-3-8B
in both in-context learning scenarios (zero- and few-shot)
and through fine-tuning. As inputs, we used clickbait
headlines and questions generated by ChatGPT,
instructing the model to execute a Question Answering task for
the latter. When using headlines as input, few-shot and
ifne-tuning approaches outperform zero-shot methods.
Few-shot approaches demonstrate higher performance
in terms of semantic similarity, while fine-tuning exhibits
stronger lexical adherence to the source document, as
reflected in ROUGE scores. This can be attributed to the
few examples provided in the few-shot approach, which
make the model aware of the task while allowing more
creative outputs (resulting in lower ROUGE scores).
Conversely, the fine-tuned model learned from the training
data to adhere more closely to the source article, which
comes at the expense of producing semantically richer
responses (evidenced by lower SemSim scores).</p>
        <p>Interestingly, casting spoiler generation as a
questionanswering task yields higher results in the zero-shot
setting compared to using headlines as input. However, the
results for few-shot and fine-tuning scenarios tend to be
on par. This can be explained by the fact that headlines
may contain multiple gaps that the human-annotated
dataset accounted for, but the non-supervised “question
generation” module could not fully capture. Generally,
this approach leads to suficiently good results; however,
we believe that more attention should be given to the
quality of the questions, either through more eficient
prompts or with human-generated/curated data.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Clickbait Neutralisation Results</title>
        <p>In Table 5, we report the results for clickbait
neutralisation. For this task, we prompted LLaMAntino-3-8B with
a few-shot approach, employing the spoilers generated
with the three configurations of the previous experiments
(headlines as input). Using spoilers generated with the
ifne-tuned models leads to higher results both for
lexical and semantic metrics. Interestingly, scores tend to
increase when the training complexity of the input data
increases. In Table 6 we report examples of headlines
along with their generated spoilers (through the
finetuned model) and their neutralisation.</p>
        <p>input data
zero-shot
few-shot
fine-tuning</p>
        <p>R1
0.250
0.265
0.286</p>
        <p>RL
0.212
0.223
0.247</p>
        <sec id="sec-5-4-1">
          <title>SemSim</title>
          <p>0.675
0.706
0.715
“L’occhio di pernice è causato principalmente dalla
pressione della scarpa che favorisce la formazione di
un’ispessimento di pelle che provoca dolore, in quanto è soggetto
all’attrito tra le dita. Per rimuovere l’occhio di pernice
è fondamentale ammorbidire prima la zona interessata
per poi provare a rimuovere l’ispessimento utilizzando
rimedi naturali senza dolore e in modo semplice.”
Lo zenzero è un rimedio naturale per il trattamento di
tosse, rafreddore e influenza. La miscela limone, zenzero
e miele è ideale per alleviare i sintomi delle comuni
malattie. Basta prendere 2 o 3 cucchiai della miscela naturale,
riempire una tazza con acqua calda e lasciare in infusione
per 3 o 4 minuti.</p>
        </sec>
        <sec id="sec-5-4-2">
          <title>Neutralisation</title>
          <p>Juventus in grave dificoltà: 15
punti di penalizzazione e il
rischio di cadere in Serie B
Tragico decesso del
pallacanestrista Samuel Dilas, 24
anni, ex convalescente da
polmonite e giocatore della
Virtus Lumezzane
Un cameriere espelle un cliente
maleducato che chiede di
essere spostato per non sedersi
accanto a un bambino con
sindrome di Down.</p>
          <p>Come rimuovere l’occhio di
pernice, un problema di pressione
e attrito causato dalle scarpe
Miscela naturale di limone,
zenzero e miele allevia i sintomi di
tosse, rafreddore e influenza in
pochi giorni.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>considering certain sensitive domains such as health.</p>
      <p>Thus, we hope that this work facilitates future research
This work presents ClickBaIT, a novel Italian dataset on the topic for example, by addressing the link between
for clickbait modelling, as well as a diverse set of ex- clickbait and misinformation, considering both in a
uniperiments to assess the efectiveness of current models ifed framework.
for clickbait detection, spoiling and neutralisation. The
dataset includes news articles that have been manually
annotated to indicate the presence of clickbait, spoilers Acknowledgments
associated with clickbait headlines, and their respective
neutral headlines. This work was partly supported by: the AI4TRUST</p>
      <p>The experiments explore the efectiveness of current project - AI-based-technologies for trustworthy
soluNLP methods for the modelling of clickbait headlines in tions against disinformation (ID: 101070190), the
EuroItalian through ClickBaIT. The evaluation for clickbait pean Union’s CERV fund under grant agreement No.
detection shows how training data can be augmented in 101143249 (HATEDEMICS), the European Union’s
Horia multilingual setting, which leads to classification im- zon Europe research and innovation programme
unprovements that are in line with previous research [28]. der grant agreement No. 101135437 (AI-CODE). Oscar
The generation experiments, for both spoiling and neu- Araque acknowledges the support of the project UNICO
tralisation, evidence that the evaluated model does ben- I+D Cloud - AMOR, financed by the Ministry of
Ecoefit from in-domain knowledge extracted from the pro- nomic Afairs and Digital Transformation, and the
Europosed dataset. As seen, these informed generations are pean Union through Next Generation EU; as well as the
more accurate and align better with the golden text. support of the project CPP2023-010437 financed by the</p>
      <p>Considering the efect of clickbait, we argue that while MCIN / AEI / 10.13039/501100011033 / FEDER, UE.
there are initially harmless articles, lack of accuracy can
have a detrimental efect on readers. This is clear when
vances in Social Networks Analysis and Min- Exploring multifaceted variation and bias in
writing (ASONAM), 2018, pp. 932–937. doi:10.1109/ ten language data, arXiv preprint arxiv:2406.17647
ASONAM.2018.8508452. (2024). URL: https://arxiv.org/abs/2406.17647.
[19] C. Oliva, I. Palacio-Marín, L. F. Lago-Fernández, [27] M. Potthast, T. Gollub, K. Komlossy, S. Schuster,
D. Arroyo, Rumor and clickbait detection by M. Wiegmann, E. Garces Fernandez, M. Hagen,
combining information divergence measures and B. Stein, Crowdsourcing a Large Corpus of Clickbait
deep learning techniques, in: Proceedings of on Twitter, in: E. Bender, L. Derczynski, P. Isabelle
the 17th International Conference on Availabil- (Eds.), 27th International Conference on
Compuity, Reliability and Security, ARES ’22, Association tational Linguistics (COLING 2018), Association
for Computing Machinery, New York, NY, USA, for Computational Linguistics, 2018, pp. 1498–1507.
2022. URL: https://doi.org/10.1145/3538969.3543791. URL: https://aclanthology.org/C18-1127/.
doi:10.1145/3538969.3543791. [28] O. Araque, M. F. L. Corniel, K. Kalimeri, Towards a
[20] I. García-Ferrero, B. Altuna, Noticia: A clickbait multilingual system for vaccine hesitancy using a
article summarization dataset in spanish, arXiv data mixture approach., in: Proceedings of the 9th
preprint arXiv:2404.07611 (2024). Italian Conference on Computational Linguistics,
[21] T. E. C. L. Arthur, A. T. Cignarella, S. Frenda, M. Lai, 2023.</p>
      <p>M. A. Stranisci, A. Urbinati, et al., Debunker assis- [29] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert,
tant: a support for detecting online misinformation, a distilled version of bert: smaller, faster, cheaper
in: Proceedings of the Ninth Italian Conference and lighter, arXiv preprint arXiv:1910.01108 (2019).
on Computational Linguistics (CLiC-it 2023), vol- [30] M. Polignano, P. Basile, G. Semeraro, Advanced
ume 3596, Federico Boschetti, Gianluca E. Lebani, natural-based interaction for the italian language:
Bernardo Magnini, Nicole Novielli, 2023, pp. 1–5. Llamantino-3-anita, 2024. arXiv:2405.07101.
[22] S. S. Tekiroğlu, Y.-L. Chung, M. Guerini, Generat- [31] M. Woźny, M. Lango, Generating clickbait spoilers
ing counter narratives against online hate speech: with an ensemble of large language models, arXiv
Data and strategies, in: D. Jurafsky, J. Chai, preprint arXiv:2405.16284 (2024).</p>
      <p>N. Schluter, J. Tetreault (Eds.), Proceedings of the [32] C.-Y. Lin, Rouge: A package for automatic
eval58th Annual Meeting of the Association for Com- uation of summaries, in: Text summarization
putational Linguistics, Association for Computa- branches out, 2004, pp. 74–81.
tional Linguistics, Online, 2020, pp. 1177–1190. URL:
https://aclanthology.org/2020.acl-main.110. doi:10.</p>
      <p>18653/v1/2020.acl- main.110.
[23] D. Russo, S. Kaszefski-Yaschuk, J. Staiano,</p>
      <p>M. Guerini, Countering misinformation via
emotional response generation, in: H. Bouamor,
J. Pino, K. Bali (Eds.), Proceedings of the 2023
Conference on Empirical Methods in Natural
Language Processing, Association for Computational
Linguistics, Singapore, 2023, pp. 11476–11492. URL:
https://aclanthology.org/2023.emnlp-main.703.</p>
      <p>doi:10.18653/v1/2023.emnlp- main.703.
[24] M. Snover, B. Dorr, R. Schwartz, L. Micciulla,</p>
      <p>J. Makhoul, A study of translation edit rate with
targeted human annotation, in: Proceedings of the 7th
Conference of the Association for Machine
Translation in the Americas: Technical Papers,
Association for Machine Translation in the Americas,
Cambridge, Massachusetts, USA, 2006, pp. 223–231. URL:
https://aclanthology.org/2006.amta-papers.25.
[25] M. Turchi, M. Negri, M. Federico, Coping with the
subjectivity of human judgements in MT quality
estimation, in: Proceedings of the Eighth Workshop
on Statistical Machine Translation, Association for
Computational Linguistics, Sofia, Bulgaria, 2013, pp.</p>
      <p>240–251. URL: https://aclanthology.org/W13-2231.
[26] A. Ramponi, C. Casula, S. Menini, Variationist:
scienza
salute
ambiente
economia
insetti, animali, AI, scienza, smartphone, Spazio, tecnologia, TECNOLOGIE, SCIENZA, ufo, biochimica,
eclissi, bomba atomica, terra piatta, idroelettrico, temperatura, coltivazione, robot, fisica quantistica,
macchie solari, ricerca, vulcano, titanio, universo, fotovoltaico, intelligenza, iPhone, hacker, microonde,
motori di ricerca, onde elettromagnetiche, tecnologia, sole, scienza, radioterapia, pesticidi, armi
chimiche, comete, case farmaceutiche, psichiatria, smartphone, formiche, elettrodomestici, solare,
macrobiologi, mondo, lampadine a basso consumo, tecnologia, scienze-e-tech, scienza, scienza,
innovazione, scienza, tecnologia-2, animali intelligenti, funzione cognitiva, microchip, cani, samsung,
wi fi, tecnologia-e-tv, SCIENZE, TECNOLOGIA, bioetica, biologia, fisica, covid, coronavirus
Salute, CORONAVIRUS, VAIOLO SCIMMIE, TUBERCOLOSI, SALUTE, SCABBIA, AIDS, salute, hiv,
cocaina, antidepressivi, veleni, infezioni, carne, tabacco, infibulazione, fluoro, alcool, alimentari, aids,
antibatterico, dieta, insetticida, cibo, benessere, farmaci, digitopressione, cafè, sigarette, ministero
della salute, autismo, limoni, cure naturali, paracetamolo, cancro, antiossidante, droga, olio, medicina
alternativa, fragole, vegetariano, eroina, dislessia, veleno, zenzero, virus, psicologia, biologico,
magnesio, frutta, psicofarmaci, pollo al cloro, fiori di bach, medico, sonno, birra, vitamina e, ulivi, proteine,
stress, banana, pensieri negativi, tumori, benzodiazepine, latte, miele, cuore, epilessia, longevità,
marijuana, diabete, sale, ibernazione, vecchiaia, fegato, vegan, prevenzione, dentifricio, cervello, sistema
immunitario, sodio, suicidio, rimedi naturali, maltempo, canapa, pillola, mal di gola, depressione,
psiche, alimentazione, ebola, aspartame, dentifricio senza fluoro, tiroide, mangiare, cure proibite,
Alzheimer, smog, gas, malattie, calamità, mammografia, verdura, aloe, masticazione, farmaco, igiene,
batteri, medicina, vitamina c, epatite c, forfora, energia, vaccini, ormoni, flora batterica, sorbitolo,
antibiotici, piedi, obesità, arsenico, cortisolo, chemioterapia, contraccezione, Neurotrasmettitori, semi,
melograno, celiachia, Coca cola, salute-benessere, salute, salute-e-benessere, bellezza, dimagrante,
benessere, salute-benessere, rimedi-naturali, pianeta-mamma, grano antico, acqua ossigenata,
alimetnazione, ansia, dentisti, curcuma, casa-e-cucina, hobby-e-sport, SPORT, crescita-consapevolezza,
la-salute-che-viene, sport, stile-di-vita, consigli, lifestyle, pomodori
Cambiamenti climatici, energia, energia elettrica, Natura, AMBIENTE, ECOLOGIA, global warming,
geoingegneria, alberi, pianeta terra, natura, inquinamento, mare, terra, manipolazione climatica,
clima, rinnovabili, Dissesto idrogeologico, ecologia, ambiente, green, ambiente-attuale, ecologia,
salute-benessere, natura, ambiente, METEO, tempesta solare, astronomia, acido
afari-online, economia, ECONOMIA, consumi-risparmi, microchip r-fid, bollo auto, tasso d’interesse,
finanza, bollette, banche, profitto, spese, economia-finanza, economia, economia, economia-dellanima,
fisco-e-tasse, economia, economia, economia, economia-e-finanza
notators received both a score indicating how much the
headline was clickbait and automatic ChatGPT
gpt-3.5turbo-0125 generated suggestions for the spoilers and
the neutralized versions of the headlines. Below, we have
outlined the annotation guidelines that the annotators
were to follow.</p>
      <p>Clickbait labelling In order to select the clickbait
headlines present in the scraped data, the annotators
were provided with specific guidelines. Table 8 provides
the main key points taken into consideration in order to
label the data.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Dataset Creation Details</title>
      <sec id="sec-7-1">
        <title>A.1. Category Assessment</title>
        <p>In Table 7 we report how the heterogeneous categories
scraped directly from the misleading websites were
divided into the four macro-category of scienza (science),
salute (well-being), ambiente (environment), economia
(economy).</p>
      </sec>
      <sec id="sec-7-2">
        <title>A.2. Annotation Guidelines</title>
        <p>Three components of our datasets were subject to human
intervention to: (i) determine if the headline was
clickbait, (ii) identify the related article’s spoiler, that is, the Spoiler post-editing For the post-editing of the
information required to satisfy the curiosity gap within spoiler the annotator was required to spot in the headline
the headline, and (iii) revise the headline to include the the information gap and to check if the generated spoiler
spoiler information, thereby neutralizing it. During all was providing that information checking the related
arthree annotation stages, we employed a machine-human ticle. If the model failed to find the proper spoiler, the
collaboration to expedite the work of annotators. The an- annotator had to rewrite it sticking as much as possible to</p>
        <sec id="sec-7-2-1">
          <title>Characteristic</title>
        </sec>
        <sec id="sec-7-2-2">
          <title>Original example (IT)</title>
        </sec>
        <sec id="sec-7-2-3">
          <title>Translated example (EN)</title>
          <p>Lack of essential information,
i.e., the subject the article is
talking about
Sensationalist tone
Questions raised but answered
in the article body
Enumeration of elements
Use of capitalization
Introduction of the content
without actually giving the
information
Use of quotations that do not
give information
“Ora riposa in pace”. Calcio in lutto, morto
uno dei grandi protagonisti dell’Italia
“Now rest in peace”. Football in mourning, one
of Italy’s great protagonists dead
Fan ubriaca le salta addosso sul palco. La
sua reazione è incredibile e sconvolge tutti i
presenti
Tratti della nostra colonna: quali sono? Come
evitare lesioni?
10 cibi per sbarazzarsi del gonfiore di stomaco
e pancia
INFARTO: sopravvivere quando si è soli. Hai
solo 10 secondi per salvarti. Ecco cosa devi
fare:
Drunk fan jumps on her on stage. Her reaction
is incredible and shocks everyone present
Traits of our column: what are they? How to
avoid injuries?
10 foods to get rid of bloated stomach and
tummy
HEART ATTACK: surviving when alone. You
only have 10 seconds to save yourself. Here’s
what you have to do:
Zanzare, ecco come eliminarle senza
insetticidi
Mosquitoes, this is how to eliminate them
without insecticide
Omicron, Ilaria Capua: “Ecco perché i
vaccinati si infettano di più rispetto a prima”
Omicron, Ilaria Capua: “This is why the
vaccinated get more infected than before”
the document’s text. If the spoiler was correct but added
extra info, the annotator had to keep those extra
information only if those were essential for having a complete
headline. If the spoiler was correct, then the annotator
could leave it as it was.</p>
          <p>Neutralised Clickbait Post-Editing The annotator
was required to check if the neutralised forms comprises
both the headline and the spoiler information. If the
spoiler was very long (e.g., long listing), then the
annotator had to summarise the spoiler as much as possible
aiming to embed in the final novel headline enough
information to reduce or remove the information gap. If
the model failed at addressing the spoiler information in
the neutralised version of the headline, then the
annotator had to manually add it. Moreover, the annotator
was required to remove sensationalist tones as much as
possible, if this tone was still creating useless curiosity
in the reader.</p>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>A.3. Author Component Instruction</title>
        <p>Hereafter, we provide the instruction employed to
automatically generate spoilers and the neutralised
versions of the clickbait headlines through ChaGPT
gpt3.5-turbo-0125.</p>
        <p>I have a clickbait headline and its
corresponding article, both written in Italian.</p>
        <p>The clickbait headline typically omits key
information to create a curiosity gap for
the reader. Your task is to extract this
missing information, known as a “spoiler,”
from the article’s text. The spoiler can be
a single keyword, a short text passage, or a
list of keywords. Once you have identified
the spoiler, rewrite the clickbait headline
by incorporating this information to
eliminate the curiosity gap. The output must
be in JSON format and written in Italian.</p>
        <p>The JSON should include two entries: one
called “spoiler” that contains the extracted
spoiler(s), and another called
“new_headline” that has the revised headline.</p>
        <p>Example Input:
Clickbait headline: “Questo attore ha
fatto qualcosa di incredibile sul set di un
famoso film!” Article: “Durante le riprese
del film ‘Il Gladiatore’, l’attore Russell
Crowe ha deciso di fare un gesto di grande
generosità donando una parte
significativa del suo stipendio al fondo per i
membri della troupe.”
Example Output:
{“spoiler”: “Russell Crowe ha donato
una parte significativa del suo
stipenInfrequent</p>
        <p>Average
Non-Clickbait document count: 481; word count: 5,642
Clickbait document count: 846; word count: 10,647
Clickbait Frequency
Frequent
Fruit or flower? Tasty
and attractive, a celebrity
on our tables, we reveal
who she is
Self-repairing metal
discovered. Scientists
astounded
A disease that afects
500,000 people
Mosquitoes, here’s how
to get rid of them
without insecticides
We all know it, inevitable
on our tables,
worldfamous, but mysterious is
its nature, fruit to enjoy
or flower to decorate?
The recent experiment
revealed an extraordinary
phenomenon...</p>
        <p>We are talking about
a chronic
immunemediated systemic
disease that afects about
1.8 million patients...</p>
        <p>With the arrival of hot
weather, mosquitoes also
make their way into our
homes or gardens...</p>
        <p>True
True</p>
        <p>True</p>
        <p>True
The strawberry
Strawberry: tasty and
attractive, a celebrity on
our tables
The metal that repairs
itself: platinum
Psoriasis: a disease that
afects about 500,000
people in Italy
Platinum
Psoriasis afects about
500,000 people
To eliminate mosquitoes
from your home once and
for all, you should buy a
bat</p>
        <p>Mosquitoes, here’s how
to get rid of them
without insecticides: just buy
a bat</p>
      </sec>
      <sec id="sec-7-4">
        <title>C.2. Spoiler Generation</title>
        <p>For the zero-shot spoiler generation task we employed
the following prompt:</p>
        <p>Ti verranno forniti un titolo clickbait e il
suo articolo corrispondente. Il titolo
clickbait di solito omette, o non esplicita,
informazioni chiave per creare curiosità nel
lettore. Estrai dall’articolo le informazioni
mancanti o vaghe nel titolo che servono
per colmare questa curiosità. La risposta
può essere un messaggio estremamente
coinciso oppure un elenco. Formatta la
risposta nel seguente modo. “Risposta:
&lt;output&gt;”
Titolo: {headline}</p>
        <p>Articolo: {article}</p>
        <p>The same instruction was employed with the
finetuned model. For few-shot generation of the spoiler,
we enriched the instruction with two examples.</p>
        <p>When casting spoiler generation as a Question
Answering task, the following instruction was employed:
Ti verrà fornita una domanda e un
documento. Trova nel documento le
informazioni per rispondere alla domanda. La
risposta può essere un messaggio conciso
oppure un elenco. Formatta la risposta
nel seguente modo. “Risposta: &lt;output&gt;”</p>
      </sec>
      <sec id="sec-7-5">
        <title>C.3. Fine-Tuning Details</title>
        <p>The LLaMAntino-3-8B [30] model underwent training
on a single Ampere A40 GPU with 48GB of memory,
employing the QLoRA strategy with a low-rank
approximation of 64, a low-rank adaptation of 16, and a dropout
rate of 0.1. It was set to evaluate every 50 steps, with a
batch size of 4, across 3 epochs, using a learning rate of
10−4.</p>
        <p>In the clickbait detection experiments, the DistilBERT
and Llama3-8b models have been fine-tuned on the same
GPU. The DistilBERT model has been trained on 10
epochs with a learning rate of 2 ⋅ 10−4. For the Llama3
model, we have used QLoRa with the same
characteristics as described above, trained on two epochs, with a
learning rate of 2 ⋅ 10−4.</p>
      </sec>
      <sec id="sec-7-6">
        <title>C.4. Neutralised Clickbait Generation</title>
        <p>The following system prompt (enriched with three
examples) has been utilised with LLaMAntino-3-8B:
Ti verrano forniti due testi: un titolo
clickbait e un testo, chiamato spoiler, che
contiene le informazioni mancanti nel titolo.</p>
        <p>Il tuo compito è di riscrivere il titolo
clickbait integrando le informazioni dello
spoiler. Il nuovo titolo deve essere
informativo, privo di toni sensazionalistici, e
breve. Se Lo spoiler contine tante
informazioni, puoi riassumerle in concetti più
generali.</p>
        <p>Titolo: {headline}
Spoiler: {spoiler}
No specific ethical conflicts have been reported during
the development of this work. The dataset was compiled
from publicly available sources. It is important to
acknowledge that the examples in this document are not
indicative of the authors’ opinions or beliefs.
Additionally, the ideas or assertions contained within these texts
may be misleading or harmful; therefore, the dataset
should be utilized strictly for research purposes.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>