<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Twitbaiter: Model of Clickbait Detection in Spanish</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Deyson Gómez Sánchez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeison D. Jimenez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elizabeth Ruíz Padilla</string-name>
          <email>elizabethrp0818@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jairo E. Serrano</string-name>
          <email>jserrano@utb.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan C. Martinez-Santos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edwin Puertas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universidad Tecnológica de Bolívar; School of Engineering</institution>
          ,
          <addr-line>Architecture, and Design; Cartagena de Indias; 130013;</addr-line>
          <country country="CO">Colombia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad de Cartagena; Faculty of Humanities; Linguistics and Literature Program; Cartagena de Indias;</institution>
          <addr-line>130001;</addr-line>
          <country country="CO">Colombia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper addresses the binary classification of clickbait in Spanish tweets using three distinct approaches to improve detection. The datasets were provided as part of the TA1C challenge in IberLEF 2025, and the phases of pretraining, data preparation, fine-tuning, and evaluation were structured to train the models. Techniques such as undersampling and oversampling were used to handle class imbalance, aiming to generate a balanced dataset for model training. The three models evaluated were: RoBERTuito with manual features, RoBERTuito with class weighting, and Llama 3.2 - 3B. The results indicated that Model 3, based on Llama 3.2 - 3B, achieved the best performance, with an F1-score of 95.04%, outperforming the other models presented in this work. Furthermore, the model's robustness in a competitive environment was highlighted, as it ranked sixth in the TA1C challenge with a score of 80.115% on the evaluation dataset. While Model 1 showed competitive performance, its reliance on manual features limited its generalization capacity. Model 2 exhibited overfitting, emphasizing the importance of improving balancing and generalization techniques. This study demonstrates the efectiveness of advanced architectures like Llama 3.2 for the clickbait detection task and highlights areas for improvement in future implementations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Binary classification</kwd>
        <kwd>Clickbait</kwd>
        <kwd>Spanish tweets</kwd>
        <kwd>Detection</kwd>
        <kwd>TA1C challenge</kwd>
        <kwd>RoBERTuito</kwd>
        <kwd>Llama 3</kwd>
        <kwd>2 - 3B</kwd>
        <kwd>F1-score</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. State of the Art</title>
      <p>
        As part of the background related to the linguistic analysis approach for detecting clickbait in Spanish,
two studies directly addressing this topic have been identified (Robles, 2020; Loayza, 2024) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These
investigations share a concern about unveiling the linguistic-discursive trends used for generating
clickbait in Spanish (hereafter CB) and its pragmatic functionality. The following will present the
background that constitutes this trend in the order previously outlined.
      </p>
      <p>
        Robles S. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] conducts a discourse analysis focused on detecting trends in syntactic categories, which,
according to the author, are selected to deploy pragmatic values efectively with the purpose of hooking
and manipulating the reader’s will. Thus, this article reveals a series of recurring linguistic-discursive
configurations in CBs to achieve this goal. To this end, the author examines 540 headlines obtained
between 2017 and 2019 from media outlets specializing in clickbait (Buzzfeed, Hufpost, and Upsolc).
This work is relevant as it ofers an extensive repertoire of the prototypical word classes in CBs and
presents a clear and concise idea of the pragmatic functionality of these syntactic selections.
Loayza E. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] performs a discourse-pragmatic analysis in which, based on the proposals by Austin, Searle,
and their followers, CBs are analyzed as directive illocutionary acts, highlighting the illocutionary
force present in them. In this way, after analyzing CBs from videos shared on prominent social media
platforms (Facebook, X, YouTube, and Instagram) collected randomly between 2021 and 2023, the
author identifies the psychoemotional efects that CBs aim to generate and how these are linguistically
designed to attract and retain the reader’s attention.
      </p>
      <p>Various studies have addressed the problem of clickbait detection from multiple approaches,
incorporating both traditional feature extraction techniques and advanced deep learning models. The following is
a chronological review of the most representative works in this field, focusing on those related to the
use of Spanish and deep learning for headline classification.</p>
      <p>
        The study by Omidvar et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] was one of the first to apply recurrent neural networks with bidirectional
GRU units for detecting clickbait on Twitter, achieving the best performance in the 2017 Clickbait
Challenge using only the postText field as input. In the same competition, Wiegmann et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed
an alternative approach based on Ridge regression and heuristic manual feature selection, highlighting
that these methods, with proper feature engineering, could still be competitive against neural models.
Subsequently, Rajapaksha et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] evaluated Transfer Learning models such as BERT, XLNet, and
RoBERTa on Twitter headlines, concluding that RoBERTa consistently outperformed the other models,
especially when using hidden outputs and structured fine-tuning strategies. This research laid important
foundations for the use of pre-trained models in clickbait classification tasks.
      </p>
      <p>
        More recently, Broscoteanu et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] presented the RoCliCo corpus for clickbait in Romanian, along
with a RoBERTa model trained with contrastive learning. The approach demonstrated a high capacity
for modeling the semantic relationship between headlines and the body of the text, achieving notable F1
scores, especially in the non-clickbait class. The central idea of measuring semantic similarity between
text components provided an innovative perspective to improve detection accuracy.
Meanwhile, Kydd et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] developed "Deep Breath," a system that integrates machine learning with
browser extensions to alert users about potentially misleading headlines. Although its evaluation
accuracy was limited, the work stands out for its user-centric approach and real-time interaction.
In the Spanish-speaking context, García-Ferrero et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] introduced NoticIA, a dataset for summarizing
articles with clickbait headlines in Spanish, evaluating the performance of LLMs in ultra-summary tasks.
Specifically trained models, such as ClickbaitFighter, outperformed LLMs in a zero-shot setup in terms
of conciseness and accuracy, highlighting the importance of fine-tuning on domain-specific datasets.
The study by Mordecki et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] significantly advances the field by proposing a revised and operational
definition of clickbait based on the concept of curiosity gap. The authors developed TA1C (Te Ahorré
Un Click), the first open-source dataset for Spanish clickbait detection, composed of 3,500 annotated
tweets from 18 major media outlets. The dataset achieves high annotation consistency (Fleiss’ K =
0.825) and supports baseline models reaching 0.84 in F1-score. This work is particularly relevant as it
ofers a well-defined framework and benchmark for future models addressing clickbait detection in
Spanish-language social media.
      </p>
      <p>
        Later, Gamage et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proposed BaitRadar, a multi-model approach for detecting clickbait on YouTube
by combining six sources of video information, achieving 98% accuracy. This study emphasizes the
usefulness of integrating multiple heterogeneous signals to strengthen detection in audiovisual contexts.
Based on the information presented by Broscoteanu et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], this work will implement a
RoBERTabased model for detecting clickbait in Spanish, applying fine-tuning strategies adjusted to the domain.
Additionally, following the approach of Wiegmann et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], manually selected heuristic features will be
integrated into the model’s input vector to improve its detection capability. Finally, the performance of
at least one LLM will also be evaluated to analyze its efectiveness in comparison to the aforementioned
strategies.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <p>
        For the development of this work, we used the datasets provided by the organizers of Task 1 in the
TA1C: Clickbait Detection and Spoiling in Spanish competition [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], which is part of IberLEF 2025
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. This dataset is designed to determine whether the content of a tweet linking to a news article is
clickbait or not, based on the definition of clickbait established by the organizers. Both the tweet and
the linked news article are in Spanish, and the corpus was created with the aim of representing as many
Spanish language varieties as possible, including news from media outlets in 12 diferent countries, as
well as international sources. Considering that this is a binary classification problem, the distribution
of each class is presented in Table 1.
      </p>
      <sec id="sec-3-1">
        <title>Dataset</title>
        <p>TA1C_dataset_detection_train
TA1C_dataset_detection_dev_gold
TA1C_dataset_detection_test</p>
      </sec>
      <sec id="sec-3-2">
        <title>No (class 0)</title>
        <p>2002
497</p>
      </sec>
      <sec id="sec-3-3">
        <title>Yes (class 1)</title>
        <p>798
203</p>
      </sec>
      <sec id="sec-3-4">
        <title>Total</title>
        <p>2800
700
700</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This study follows a structured methodology consisting of pre-training, data preparation, fine-tuning,
and evaluation phases for the binary classification of clickbait on Twitter. The pipeline used in the
implementation of the models for this study is shown in Figure 1.</p>
      <p>The pipeline comprises several stages. First, data preprocessing is performed to remove as much
information as possible that may introduce noise into the classification model. To this end, URLs,
mentions, emojis, hashtags, etc., were handled. Subsequently, class balancing techniques are applied,
relevant features are extracted, and the model is trained using diferent techniques depending on the
case.</p>
      <p>To evaluate the model’s performance, the metrics of accuracy, precision, recall, and F1-score were
considered, with the latter being the primary metric. To assess the performance of the fine-tuned model,
the dataset was split into 80% for training and 20% for testing.</p>
      <p>We initially approached the clickbait detection task using traditional machine learning techniques.
In particular, we trained an Extra Trees Classifier with a resampling strategy that combined random
Input
Output</p>
      <p>Preprocessing
Read and load</p>
      <p>data
Clean dats</p>
      <p>Evaluate model</p>
      <p>Calculate the
weight of
classes</p>
      <p>Tokenization
Regularization</p>
      <p>Random
OverSampling
Insertion of
synonyms</p>
      <p>Splitting data
Models training</p>
      <p>Config Model
Model training
oversampling and undersampling to address class imbalance. This model achieved an F1-score of 0.416.
We then implemented a CatBoost Regressor under similar conditions, which improved performance to
an F1-score of 0.664. However, both results remained below the levels required for a robust solution.
Given these limitations, we shifted our focus to transformer-based models, which have demonstrated
superior performance in natural language processing tasks.</p>
      <p>For this study, we selected the three best-performing transformer architectures. The pipeline in Figure 1
is common to the three evaluated models, although each presents specific variations in data processing,
feature extraction, and model architecture. The following section describes the specific details of each
implemented model.</p>
      <sec id="sec-4-1">
        <title>4.1. Model 1: RoBERTuito with Hybrid Features</title>
        <p>In this work, an initial dataset was used, consisting of a training dataset whose structure is detailed
in Table 1. The unequal distribution between the classes was a common challenge in the context of
binary classification, as the minority class, "Clickbait," was less represented compared to the majority
class. To address this imbalance, two main techniques were used: undersampling and oversampling.
The undersampling technique involved reducing the size of the majority class, while oversampling was
implemented to increase the number of samples from the minority class. In this case, oversampling
generated 702 duplicates of the "Clickbait" class, increasing its representation within the dataset. As a
result of these techniques, the final dataset consisted of 3,000 examples, with a balanced 50% distribution
for each class.</p>
        <p>
          The model was a combination of representations generated by the RoBERTuito uncased model described
by Pérez et al. (2022) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] and manually extracted features from the text. Specifically, eight numerical
features were extracted based on 128 keywords or phrases associated with clickbait. These words were
determined through the analysis of the works of Loayza Maturrano (2024) and Sararobles Ávila (2020)
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], who identified the most representative terms in the context of clickbait.
        </p>
        <p>The manually extracted features were as follows:
1. Clickbait word count: The number of times a word from the list appears in the text.
2. Clickbait ratio: The percentage of words in the text that are part of the clickbait list.
3. Boolean presence: A binary indicator that determines whether the text contains at least one
clickbait word.
4. Numbers: A binary indicator indicating whether the text contains numeric digits.
5. Exclamations: A binary indicator indicating whether the text contains exclamation marks (!).
6. Question marks: A binary indicator indicating whether the text contains question marks (?).
7. Uppercase letters: The proportion of uppercase letters in the text.</p>
        <p>8. Length: The total number of words in the text.</p>
        <p>Once these eight manual features were extracted, they were combined with the 768 dimensions generated
automatically by the RoBERTuito model. The concatenation of both representations resulted in a final
776-dimensional vector, which was used for the binary classification of the texts.</p>
        <p>The RoBERTuito uncased model corresponds to a variant of the RoBERTa architecture. This model was
pre-trained using a corpus of 500 million Spanish-language tweets, providing it with an excellent ability
to capture specific patterns from the Twitter platform, such as mentions, emojis, hashtags, and diverse
content. Additionally, RoBERTuito has demonstrated its efectiveness in previous classification tasks,
such as hate speech detection in the SemEval 2019 Task 5, HatEval dataset, sentiment and emotion
analysis in the TASS 2020 dataset, and irony detection in the IrosVa 2019 dataset.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model 2: RoBERTuito + Class Weighting</title>
        <p>
          The dataset used, like in the previous model, consists of the training dataset shown in Table 1. In the
preprocessing stage, as in the previous model, URLs and representations of mentions, emojis, hashtags,
and diverse content were left to the RoBERTuito uncased tokenizer. To address this imbalance, the class
weighting technique was used instead of increasing the data through data augmentation techniques.
The class weights were calculated using the compute_class_weight function from sklearn [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], with the
"balanced" strategy to adjust to the class distribution in the training set.
4.3. Model 3: Llama 3.2 - 3B
In this study, the meta-llama/Llama-3.2-3B model [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] was fine-tuned. The dataset used consisted of
merging the training and dev gold datasets, shown in Table 1. Data cleaning was carried out through a
function that encompasses all the aspects of the text mentioned earlier. To address the class imbalance,
the class weights were first calculated for the original dataset to apply class weights, disproportionately
penalizing errors made in the minority class.
        </p>
        <p>Subsequently, random oversampling was used, randomly duplicating 499 examples from the minority
class to raise its representation to 1,500 examples. As a third strategy, semantic data augmentation
was applied based on synonym replacement using WordNet, with a modification rate of 30%, focusing
only on words longer than four letters. This technique generated 600 additional synthetic examples,
increasing the "Clickbait" class to a total of 2,100 entries. The model was quantized to 4-bit (QLoRA)
with alpha=32 over 7 transformer modules for eficiency, with 2.94% of trainable parameters (97M of
3.31B).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>We evaluated the efectiveness of the models based on the metrics previously mentioned, along with
the training and validation losses. Once the data was read and processed, the class weighting obtained
is as stipulated in Table 2.</p>
      <sec id="sec-5-1">
        <title>Model</title>
        <p>Model 2
Model 3</p>
      </sec>
      <sec id="sec-5-2">
        <title>Class</title>
        <p>Class 0
Class 1
Class 0
Class 1
Taking into account this data, it is noted that the model will indeed give more weight to errors in class
1 because it is the class with fewer data. Model 1 was trained for 5 epochs with a decay rate of 0.01,
while models 2 and 3 were trained for 10 epochs with the same decay rate and early stopping with a
patience of 3 evaluations. The training results of the three models are as follows:
The results presented in Table 3 show that the model consistently improved in performance. The
training loss decreased from 0.5271 in the first epoch to 0.0080 in the fifth, while the validation loss
dropped from 0.4502 to 0.3224. Accuracy, recall, and F1 also progressively improved, reaching 94.67%,
94.57%, and 0.9464, respectively, by the end of training. These metrics reflect a good model fit, with
a reduction in both training and validation loss, indicating an improvement in the model’s ability to
generalize.</p>
      </sec>
      <sec id="sec-5-3">
        <title>Feature</title>
        <p>Clickbait words
Has questions
Has numbers
Number of words
Has clickbait
Clickbait ratio
Uppercase ratio
Has exclamations
Additionally, Table 4 shows that the most decisive features for distinguishing clickbait content include
both the frequency of specialized vocabulary words, with an average of 1.8 per text compared to 1.65
in normal texts. Regarding emotional and formatting patterns, there is a higher usage of exclamation
marks (81.8% vs 61.3%) and uppercase letters (4.7% vs 5.4%). In contrast, non-clickbait texts tend to
include more figures (36.2% vs 25.3%), suggesting a more informative and factual approach.
According to the data in Table 5, the training of Model 2 showed consistent improvement in accuracy,
F1score, and recall, with a notable decrease in training loss (from 0.2207 to 0.0001), indicating memorization,
and validation loss (from 0.3293 to 0.6235). The model stopped at step 600 (epoch 8/10). The key metrics
remained stable and high, indicating that the model generalizes well to validation data and has achieved
a good balance between the classes.</p>
        <p>Finally, according to the data in Table 6, Model 3 showed rapid convergence, reducing the training loss
Training Loss
0.341900
0.346200
0.094400
0.010500
0.021400
0.000400
0.000000
to nearly zero. The validation loss reached its minimum during the second and third epochs before
showing a slight increase, which may indicate potential overfitting. Nevertheless, the final validation
metrics F1-score: 94.73%, recall: 96.43%, along with corresponding accuracy and precision demonstrate
solid performance and good generalization ability on the validation set.</p>
        <sec id="sec-5-3-1">
          <title>5.1. Model testing</title>
          <p>Once the training was completed, the trained model was evaluated using the test set, corresponding to
20% of the data initially reserved, resulting in the metrics shown in Table 7.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>Model</title>
        <p>Model 1
Model 2
Model 3
The performance analysis on the test set reveals that Model 3 (Llama 3.2 – 3B) achieved the best
overall results, with an F1 score of 0.9504, outperforming Models 1 and 2, which achieved values of
0.9464 and 0.9141, respectively. Model 1 (RoBERTuito with Hybrid Features), although not reaching
the performance of Model 3, demonstrated competitive performance (F1 = 0.9464), suggesting that
the combination of RoBERTuito model embeddings provides significant value in detecting clickbait
patterns. This hybrid approach enhances the model’s ability to capture subtle signals that are not easily
representable in the embedding space without the strategy.</p>
        <p>On the other hand, Model 2 (RoBERTuito with Class Weighting), which used class weighting to mitigate
the dataset imbalance, showed the lowest performance (F1 = 0.9141). Although the weighting technique
aims to favor the minority class during training, in this case, it was not as efective as the previous
strategies, possibly due to an underrepresentation of relevant patterns that were not suficiently captured
by the adjusted weights.</p>
        <sec id="sec-5-4-1">
          <title>5.2. Competition Evaluation</title>
          <p>The results revealed that the strategies implemented by our team, VerbNex, identified as "gsdeyson,"
ranked 6th among the participating teams in the TA1C challenge at IberLEF 2025 Subtask 1. The model’s
performance metrics on the evaluation dataset provided by the competition organizers are shown in
Table 8. The Table 9 presents the oficial ranking of participants in Subtask 1 of the challenge.
The results confirm the trend observed previously in the validation sets. Model 3 (Llama 3.2 – 3B)
achieved the best performance, with a score of 0.8011, establishing it as the most efective architecture
for the task in our implementations. It was followed by Model 2 (RoBERTuito + Class Weighting) with
a score of 0.7878, showing an improvement compared to its previous performance, suggesting that
the class weighting adjustment may have had a more positive efect in this setting. Finally, Model 1
(RoBERTuito with Hybrid Features) reached a result of 0.7666, the lowest among the three, indicating
that while its hybrid approach was competitive in the test environment, it failed to maintain the same
level of efectiveness against a more diverse competitive set. These findings reinforce the robustness of
Model 3 and its adaptability for clickbait classification.</p>
          <p>According to the data in Table 9, the diference with the fourth and fifth positions was marginal—less
than 0.003 points—which highlights the competitiveness of the proposed system. This result supports
the efectiveness of Model 3 (Llama 3.2 – 3B) and its consistency compared to systems developed by
other research groups, consolidating a solid foundation on which incremental improvements can be
applied to climb rankings in future editions of the competition.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>The work developed for the binary classification of clickbait on Twitter has shown interesting results in
terms of the performance of the evaluated models. Although the technique employed to handle class
imbalance, specifically through undersampling and oversampling, achieved an acceptable balance in
the dataset, it was observed that the implementation of these methods could be improved to maximize
their efectiveness.</p>
      <p>Model 1, which combines the RoBERTuito model with manually defined features based on a predefined
vocabulary, achieved acceptable performance during internal testing. However, its performance in the
competitive evaluation was lower compared to other models, reflecting limitations in its generalization
capability. The reliance on the predefined keyword vocabulary turned out to be a restrictive factor,
as not all relevant expressions in the competition dataset were covered. Despite these drawbacks,
the model shows high improvement potential through the dynamic incorporation of new words and
the integration of strategies such as embeddings or contextual synonyms, which would allow better
adaptation to new domains and greater robustness against linguistic variations in clickbait.
Model 2 showed good overall performance, with stable accuracy, particularly in the weighted F1 score,
which remained close to 91%. However, the model exhibited clear overfitting, reflected in the increase of
validation loss while the training loss decreased. This phenomenon suggests that the model may have
memorized specific patterns from the training set, which afects its ability to generalize. Furthermore, it
showed a bias towards the majority class, with suboptimal performance in detecting the minority class
(clickbait), where recall only reached 83.77%. To improve, it would be crucial to implement additional
techniques such as threshold optimization and the use of SMOTE to increase the diversity of examples.
Additionally, performing a detailed error analysis will help identify patterns in incorrect predictions,
which could fine-tune the detection ability for complex cases.</p>
      <p>Model 3, based on the Llama 3.2 model and trained with the QLoRA strategy, demonstrated excellent
performance in clickbait classification, achieving a near-perfect balance between precision and recall
for both classes. The use of an efective data balancing strategy significantly improved the model’s
ability to handle imbalanced classes. Although the model only completed 7 of the 8 planned epochs due
to early stopping activation, no performance degradation was observed.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>The authors would like to acknowledge the support provided by the master’s degree scholarship program
in engineering at the Universidad Tecnologica de Bolivar (UTB) in Cartagena, Colombia.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 for grammar, spelling, and translation
assistance. After using this tool, the author(s) reviewed and edited the content as needed and take full
responsibility for the publication’s content.
The GitHub repository containing the implementation and resources of this work is available via:
• GitHub</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Loayza</surname>
          </string-name>
          <string-name>
            <surname>Maturrano</surname>
          </string-name>
          , Los títulos de los
          <article-title>clickbait en las redes sociales desde la teoría de los actos de hablathe clickbait titles in social networks from speech act theory</article-title>
          ,
          <source>Tierra Nuestra</source>
          <volume>18</volume>
          (
          <year>2024</year>
          )
          <fpage>35</fpage>
          -
          <lpage>46</lpage>
          . doi:
          <volume>10</volume>
          .21704/rtn.v18i1.
          <year>1867</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Ávila</surname>
          </string-name>
          ,
          <article-title>El clickbait: clases de palabras para la construcción de un titular engañoso</article-title>
          , in: [2]
          <fpage>107</fpage>
          -
          <lpage>124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Omidvar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <article-title>Using neural network for identifying clickbaits in online news media</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1806</year>
          .07713. arXiv:
          <year>1806</year>
          .07713.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Völske</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>Heuristic feature selection for clickbait detection</article-title>
          ,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1802</year>
          .01191. arXiv:
          <year>1802</year>
          .01191.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajapaksha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Farahbakhsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Crespi</surname>
          </string-name>
          ,
          <article-title>Bert, xlnet or roberta: The best transfer learning model to detect clickbaits</article-title>
          ,
          <source>IEEE Access 9</source>
          (
          <year>2021</year>
          )
          <fpage>154704</fpage>
          -
          <lpage>154716</lpage>
          . URL: https://www.scopus.com/inward/record.uri?eid=
          <fpage>2</fpage>
          -
          <lpage>s2</lpage>
          .
          <fpage>0</fpage>
          -
          <lpage>85120472076</lpage>
          &amp;doi=10. 1109%
          <fpage>2fACCESS</fpage>
          .
          <year>2021</year>
          .
          <volume>3128742</volume>
          &amp;partnerID=
          <volume>40</volume>
          &amp;md5=b335f674b85faab3690a249980e74b11. doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2021</year>
          .
          <volume>3128742</volume>
          , cited by:
          <volume>46</volume>
          ; All Open Access, Gold Open Access.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D</given-names>
            <surname>.-M. Broscoteanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <article-title>A novel contrastive learning method for clickbait detection on roclico: A romanian clickbait corpus of news articles</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2310.06540. arXiv:
          <volume>2310</volume>
          .
          <fpage>06540</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kydd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Shepherd</surname>
          </string-name>
          ,
          <article-title>Deep breath: A machine learning browser extension to tackle online misinformation</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2301.03301. arXiv:
          <volume>2301</volume>
          .
          <fpage>03301</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>García-Ferrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Altuna</surname>
          </string-name>
          ,
          <article-title>Noticia: A clickbait article summarization dataset in spanish, 2024</article-title>
          . URL: https://arxiv.org/abs/2404.07611. arXiv:
          <volume>2404</volume>
          .
          <fpage>07611</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mordecki</surname>
          </string-name>
          , G. Moncecchi,
          <string-name>
            <given-names>J.</given-names>
            <surname>Couto</surname>
          </string-name>
          ,
          <article-title>Te ahorré un click: A revised definition ofnbsp;clickbait andnbsp;detection innbsp;spanish news</article-title>
          ,
          <source>in: Advances in Artificial Intelligence - IBERAMIA</source>
          <year>2024</year>
          : 18th Ibero-American Conference on AI, Montevideo, Uruguay,
          <source>November 13-15</source>
          ,
          <year>2024</year>
          , Proceedings, Springer-Verlag, Berlin, Heidelberg,
          <year>2025</year>
          , p.
          <fpage>387</fpage>
          -
          <lpage>399</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>031</fpage>
          -80366-6_
          <fpage>32</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -80366-6_
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Gamage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Labib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joomun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <article-title>Baitradar: A multi-model clickbait detection algorithm using deep learning, 2025</article-title>
          . URL: https://arxiv.org/abs/2505.17448. arXiv:
          <volume>2505</volume>
          .
          <fpage>17448</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mordecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chiruzzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Laguna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Prada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosá</surname>
          </string-name>
          , I. Sastre, G. Moncecchi, Overview of TA1C at IberLEF 2025:
          <article-title>Detecting and Spoiling Clickbait in Spanish-Language News</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS</article-title>
          . org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>J. M. Pérez</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          <string-name>
            <surname>Furman</surname>
            ,
            <given-names>L. Alonso</given-names>
          </string-name>
          <string-name>
            <surname>Alemany</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Luque</surname>
          </string-name>
          ,
          <article-title>RoBERTuito: a pre-trained language model for social media text in Spanish, in: Proceedings of the Thirteenth Language Resources</article-title>
          and Evaluation Conference, European Language Resources Association, Marseille, France,
          <year>2022</year>
          , pp.
          <fpage>7235</fpage>
          -
          <lpage>7243</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .lrec-
          <volume>1</volume>
          .
          <fpage>785</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <article-title>AI@Meta, Llama 3 model card (</article-title>
          <year>2024</year>
          ). URL: https://github.com/meta-llama/llama3/blob/main/ MODEL_CARD.md.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>