<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Rojas-Simón);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>UAEMex Participation at TA1C-2025: Leveraging Lexical and Semantic Information for Clickbait Detection in Spanish</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jonathan Rojas-Simón</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Ruiz-Ugalde</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Noé Torres-Pedroza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>María del</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmen García-Galindo</string-name>
          <email>marycarmeng142@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Verónica Neri-Mendoza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yulia Ledeneva</string-name>
          <email>ynledeneva@uaemex.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>René Arnulfo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>NLPClassKit</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ASCII</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Logistic Regression</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Multilayer Perceptron (MLP)</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Autonomous University of the State of Mexico, Instituto Literario 100</institution>
          ,
          <addr-line>Toluca 50000</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Secretariat of Science, Humanities, Technology and Innovation</institution>
          ,
          <addr-line>No. 1582, Insurgentes Sur Avenue, Credito Constructor, Benito Juarez, Mexico City</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In today's digital landscape, social media information has become a valuable source for capturing readers' attention quickly through clickbait. However, this information sometimes leads users to low-quality or misleading information. For this reason, the TA1C shared task in IberLEF 2025 focuses on detecting clickbait from social media posts in Spanish. This paper presents an NLP framework developed by the UAEMex team that exploits lexical and semantic information from texts to detect this kind of information. This framework was also complemented with traditional classifiers (KNN, LR, NaiveBayes, and MLP) to create models that best fit the task. According to the obtained results, semantic representations provided by BETO (Spanish version of BERT) provide better representations with LR and MLP classifiers, obtaining 0.7538 in F1-score, which is a competitive performance compared to state-of-the-art approaches and baselines in the task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        According to Mordecki, clickbait is defined as “a technique for generating headlines and teasers
that deliberately omit part of the information with the goal of raising the readers’ curiosity,
capturing their attention and enticing them to click” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This definition emphasizes the intentional
use of curiosity as an attention-grabbing mechanism. By exploiting this cognitive bias, media
outlets capture the user’s attention almost involuntarily, distracting them from more relevant
content. Other researchers have defined clickbait as “a viral journalism strategy that seeks to
provoke users into clicking a hyperlink by means of news selection and writing techniques that
function as bait” [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] a definition that approaches the phenomenon more as an editorial technique
than a psychological effect.
      </p>
      <p>
        In practice, clickbait has detrimental effects on the quality of journalism and the formation of
public opinion, as it displaces relevant information in favor of attractive but superficial or
misleading content. Various authors agree that clickbait undermines the quality of public discourse
and the healthy functioning of democratic societies. Palau-Sampio warns that leading newspapers
such as El País have adopted sensationalist writing strategies that blur the boundaries between
quality journalism and entertainment [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Mordecki similarly argues that when citizens receive
incomplete or irrelevant information, their ability to make informed decisions is compromised [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Potthast et. al., also criticizes media outlets that prioritize maximizing clicks for economic gain at
the expense of content quality [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        To enable automatic clickbait detection using machine learning, some researchers have
compiled datasets to train models that have proven effective for natural language processing (NLP)
tasks. However, Mordecki and colleagues in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] point out two major issues with existing datasets:
(a) most datasets are in English language, and there is no consensus on a precise definition of
clickbait, and (b) the dataset construction methods may introduce bias into what is considered
clickbait. For instance, reputation-based methods label all headlines from sensationalist sources as
clickbait, even if some are legitimate. Likewise, gatekeeper-based methods rely on expert
judgments, which may reflect subjective biases rooted in experience, editorial values, or personal
perceptions.
      </p>
      <p>
        To address these problems, Mordecki and collaborators recently introduced the TA1C (Te
Ahorré Un Click) dataset—the first manually annotated Spanish-language corpus for clickbait
detection, consisting of 3,500 tweets from 18 media outlets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Its main contribution lies in
providing rigorous and less subjective annotations. The dataset was built using clear operational
criteria to identify deliberate omission of information, considering the reader’s context and prior
knowledge, and recognizing when adjectives create informational gaps.
      </p>
      <p>
        The dataset was publicly presented under the title Detecting and Spoiling Clickbait in
SpanishLanguage News [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] at IberLEF 2025 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], within the Content Curation and Generation track. The
authors of the present article participated in that shared task and report here on the results
obtained and we will present a component-based framework designed to facilitate the development
of classification models for clickbait classification projects, which will be detailed in the following
sections.
      </p>
      <p>This paper is organized as follows: Section 2 reviews related work on clickbait detection using
various datasets and NLP methods. Section 3 describes the proposed system for clickbait
classification. Section 4 discusses the results and hyperparameters for each method, and Section 5
outlines the main conclusions and feature experimentations.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Studies</title>
      <p>
        Automatic clickbait detection has gained increasing attention in NLP community due to its
implications for digital misinformation, user trust, and the quality of online journalism [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Various
datasets have been carefully compiled, typically consisting of short headlines or social media posts
labeled as clickbait or non-clickbait, which are used to train machine learning models.
      </p>
      <p>
        To enable these models to operate effectively, a vectorization phase is required in which textual
data is transformed into numerical representations suitable for computational processing.
Vectorization techniques range from traditional approaches such as bag-of-words or TF-IDF [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to
dense representations like Word2Vec [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], GloVe [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or contextual embeddings derived from
transformer-based models like BERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The choice of vectorization method directly influences the
model's ability to capture semantic structures and signals associated with clickbait, such as vague
pronouns, emotional cues, or curiosity gaps.
      </p>
      <p>
        Regarding machine learning algorithms, models commonly used for clickbait detection range
from traditional approaches such as Logistic Regression (LR), Support Vector Machines (SVM),
Random Forests (RF), and Multilayer Perceptron (MLP), to more recent and complex deep learning
architectures—including Convolutional Neural Networks (CNN), Recurrent Neural Networks
(RNN), and transformer-based models—designed to more effectively capture the contextual
information inherent in natural language [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Below are some studies related to clickbait detection,
highlighting the diversity of datasets, vectorization techniques, machine learning models, and
evaluation metrics employed.
      </p>
      <p>
        One early approach is proposed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], who built a custom dataset using Facebook posts
categorized based on the reputation of their sources. The author applied Word2Vec for word
embedding and trained a fully connected deep neural network with dropout and batch
normalization, achieving an accuracy of 0.93.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the researchers also used Facebook posts to build their dataset and employed Word2Vec
for vectorization. Their experiments included both Feedforward Neural Networks and
Convolutional Neural Networks (CNN), with CNN achieving an impressive accuracy of 98.3%.
Their study demonstrated the effectiveness of deep learning architectures in capturing textual
patterns indicative of clickbait.
      </p>
      <p>
        Other researchers developed a dataset of headlines drawn from both reputable media outlets
and sources known for their clickbait [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Using TF-IDF for text representation, they trained
several classifiers, including Logistic Regression (LR), Support Vector Machines (SVM), and
Random Forests (RF), with RF yielding the best performance at an F1-score of 0.93. This work
underscored the importance of feature engineering in traditional machine learning pipelines [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors worked with the Webis Clickbait Corpus 2017, comprising 38,000 tweets
annotated with clickbait intensity scores on a scale from 0 to 1. They used bag-of-words and
ngram features, applying SVM and Ridge Regression for regression-based predictions, with their
ensemble model achieving a Mean Squared Error (MSE) of 0.04.
      </p>
      <p>
        In a recent study, academics presented in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] the TA1C dataset. This study explored both TF-IDF
and transformer-based embeddings (fastText and Beto) and evaluated LR, XGBoost, and GPT-4,
achieving the highest F1 score of 0.84 using the Beto model.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed System</title>
      <p>This section provides a detailed overview of the proposed approach. First, we detail the data
stratification process used for evaluation. Second, we provide a description of the preprocessing
steps applied to the datasets. Next, we describe the text representation methods employed, and
finally, it introduces the classification algorithms used, along with their corresponding
hyperparameter settings.</p>
      <sec id="sec-3-1">
        <title>3.1. Data stratification</title>
        <p>Following the TA1C shared task (Te Ahorré Un Click – Clickbait Detection and Spoiling in Spanish),
we used the official clickbait detection corpus made available for the competition.</p>
        <p>
          The TA1C dataset consists of 3,500 news items collected from Twitter (currently known as X),
originating from 18 reputable media sources that are national or international (excluding local
outlets), with a generalist focus (not topic-specific) and high popularity. The corpus is divided into
three subsets: 2,100 instances for training, 700 for validation, and 700 for testing [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>The TA1C shared task includes two subtasks. Subtask 1, called Clickbait Detection, aims to
determine if the content of a tweet that links to a piece of news is clickbait. In contrast, Subtask 2,
named Clickbait Spoiling, consists of generating or extracting from the article a short text – up to
280 characters – that, as concisely as possible, either satisfies the curiosity sparked by the headline
or indicates that the article does not provide a clear answer. In this paper we focus on Subtask 1.</p>
        <p>During the first phase, the organizers released two datasets: a training set and a development
set. Only the training set included labels, allowing participants to build supervised classification
models. At this stage, the development set labels were not disclosed; however, participants could
evaluate their models using this data through the CodaLab platform [13]. Table 1 shows the class
distribution of the training set.</p>
        <p>In the second phase, the organizers re-released the training set along with the now-labeled
development set and an unlabeled test set, which was used for the final evaluation. The class
distribution of the development set is detailed in Table 2.</p>
        <p>The class distribution imbalance, with 71.50% clickbait texts in the training set and 71.00% in the
development set, indicates a high prevalence of clickbait content. This disproportion may bias
models toward the majority of the class. Furthermore, detecting clickbait requires analyzing not
only the superficial content of the text but also semantic and contextual aspects that enable
understanding of the communicative intent and potential ambiguities present in the message.</p>
        <p>Each dataset contains the followings columns: Tweet ID, Tweet Date, Media Name, Media
Origin, Teaser Text and Tag Value. The data from Teaser Text column was processed and
vectorized to generate input features for the classification models and the data from Tag Value column
was used as the label to predict. Table 3 shows two representative examples of texts labeled as
clickbait and two corresponding to the non-clickbait class.</p>
        <p>The following examples illustrate certain linguistic and discursive features that differentiate
content labeled as clickbait from non-clickbait. In general, texts classified as clickbait tend to
employ questions, ambiguous phrasing, or curiosity-inducing expressions aimed at encouraging
the reader to click in order to access the full content. In contrast, non-clickbait texts present
information in a clear, direct, and specific manner, without omitting key elements of the text.</p>
        <sec id="sec-3-1-1">
          <title>México llega a 668 mil 381 casos de Covid; hay 70 mil muertes.</title>
          <p>#ÚltimaHora México llegó a 70 mil 821 muertes por Covid-19,
con 668 mil 381 casos confirmados de coronavirus</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Policías recibían hasta $400.000 por avisarles a ladrones cuándo robar en TransMilenio, según investigación de la Fiscalía</title>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Preprocessing</title>
        <p>To ensure the quality and consistency of the input data before training the models, a preprocessing
stage was applied to the TA1C dataset. This step is crucial in NLP tasks, as it reduces noise,
standardizes textual input, and facilitates the extraction of relevant features. The processing aimed
to transform raw tweets into structures and clean format for machine learning algorithms. Below
we describe the preprocessing techniques considered in this study:
• Tokenization: As an initial step, tokenization was applied to split the text into individual
units (tokens), typically corresponding to characters or words to improve the association or
“understanding” of words for each sentence.
• Normalization: In this process we removed non-alphanumeric characters, and all texts were
converted to lowercase.
• Stopwords removal: In some cases, we removed common words that carry little semantic
weight, such as “el”, “la”, “y”, or “de” in Spanish — were removed using a predefined list
from the NLTK library. This step reduces noise and helps focus the model on more
informative terms.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model architecture</title>
        <p>The model’s architecture is built upon a component-based framework called NLPClassKit. This
toolkit consists of a set of software components designed to facilitate the development of
classification models for NLP projects. It is currently available in a GitLab repository
(https://gitlab.com/JohnRojas/NLPClassKit) and follows a process-oriented architecture as
illustrated in Figure 1.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.3.1. Text vectorization</title>
        <p>Once the text documents were extracted or preprocessed, they were fed into various vectorization
methods. These techniques are crucial for representing and extracting relevant information from
the text, whether at the lexical or semantic level. Some of these approaches are implemented in a
software component called Text2Vec, developed in Python and hosted in a GitLab repository
(https://gitlab.com/MLComponents1/Text2Vec). Technical aspects related to syntax; parameters
and the execution process are described in the README file of the repository. This module encodes
text using the following strategies:</p>
        <p>One-hot-encoding (OHE): A straightforward technique that represents words as binary vectors.
First, vocabulary is built from the text, assigning a unique index to each term. Then, each word is
converted into a binary vector with a value of 1 in the position corresponding to its presence in the
document and 0 elsewhere. Although OHE requires minimal computational resources, it produces
high-dimensional vectors, especially with large vocabularies [14]. The length of the output vectors
is variable and depends on the input vocabulary; for example, during our training phase, a vector of
dimension 18,770 was produced using this method with the test set.</p>
        <p>TF-IDF (Term Frequency-Inverse Document Frequency): A frequency-based method that evaluates
the importance of a word in a document relative to a corpus. It combines Term Frequency (TF),
which counts how often a word appears in a document with Inverse Document Frequency (IDF)
that measures its rarity across the corpus. This weighting reduces the impact of common terms and
highlights those more relevant to the document’s content [15]. On the other hand, TF-IDF encoding
outputs a sparse vector representation in which each dimension corresponds to a specific term
from the corpus vocabulary. The value of each dimension reflects the weight of that term in the
document based on its TF-IDF score. The length of the resulting vector depends on the size of the
vocabulary, which is determined by steps such as tokenization, stop words removal, etc.</p>
        <p>Doc2Vec: An extension of the Word2Vec model that generates vector representations for entire
documents, paragraphs, or sentences rather than individual words. This approach captures the
overall semantic context of the text and encodes it into a fixed-length vector, facilitating tasks like
document comparison, classification, or clustering [16]. The length of the output vector is variable
and can be controlled through its parameters. Other parameters such as context size, the use of
iterations, or the use of bag-of-words or skip-grams, among others, can also be controlled by the
model user.</p>
        <p>BERT (Bidirectional Encoder Representations from Transformers): A pre-trained model based on
attention mechanisms, capable of interpreting word context bidirectionally (analyzing both
preceding and subsequent words). Unlike traditional methods, BERT uses a transformer
architecture to capture complex relationships and deep contextual meanings. Additionally, it can be
fine-tuned with domain-specific texts to enhance performance in specialized tasks [17]. BERT-base
is a language model with 12 layers, 768 hidden units, 12 attention heads, and a vocabulary of 30,522
tokens. It can process sequences of up to 512 tokens and has approximately 110 million parameters.
The extended version, BERT-large, reaches 340M parameters.</p>
        <p>ASCII: An approach that analyzes the frequency of ASCII characters in a document to estimate
the probability of each character’s occurrence. Though simpler than other techniques, it is useful
for specific tasks like style or formatting classification [18]. This encoding always outputs a
256dimensional vector, in which the first 128 characters correspond to upper- and lower-case letters,
numbers, punctuation marks and some control characters from the English language, and the
remaining characters include symbols and accented letters from other languages.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.3.2. Classification</title>
        <p>The vector representations generated in the preceding stage serve as input for multiple supervised
machine learning models, each specifically tailored to categorize textual data through pattern
recognition. Below is a concise overview of the primary algorithms employed in this research:</p>
        <p>Naive Bayes (NB): This probabilistic classification model operates on the foundation of Bayes'
Theorem. It computes the likelihood of a text sample belonging to a specific category by analyzing
its feature set. Although NB employs a simplified approach by assuming feature independence, it
demonstrates notable efficacy in text classification scenarios, especially when this assumption
approximately holds true.</p>
        <p>K-Nearest Neighbors (KNN): As a non-parametric classification method, KNN determines class
membership for new instances by identifying the predominant class among its K closest
neighboring points within the vector space. The algorithm operates on a majority voting system
and proves particularly valuable for complex, non-linear decision boundaries. The selection of
neighbors depends crucially on distance computations, commonly employing measures such as
Euclidean distance or cosine similarity.</p>
        <p>Logistic Regression (LR): Logistic Regression is a linear classification algorithm used to predict
binary outcomes. It models the relationship between a set of input features X and a binary target
variable y ∈{0 , 1 }. The prediction is made using the sigmoid function:</p>
        <p>P ( y = 1) = 1 +1e- z , z = ∑ wi xi ,
(1)
where wi are the model parameters and xi are the input features. The output is a probability value
between 0 and 1, indicating the likelihood that the input belongs to class 1.</p>
        <p>Multilayer Perceptron (MLP): The MLP represents a fundamental deep learning architecture
comprising three distinct fully connected layer types: (i) an input layer that receives feature
representations, (ii) multiple hidden layers capable of learning hierarchical feature transformations,
and (iii) an output layer that produces classification probabilities. This neural network architecture
demonstrates effectiveness in modeling non-linear relationships within complex datasets, making it
well-suited for text classification applications where it can learn intricate patterns in textual
representations.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and obtained results</title>
      <p>Finally, the outcome of the classification process is presented in the final stage, where each
algorithm described in the previous section produces the following output:
•
•
•</p>
      <p>Confusion Matrix (CM): Displays the number of texts correctly and incorrectly classified.
Predictions: Lists the predicted labels generated during the training phase of each algorithm.
Model Performance: Provides a detailed evaluation of each classification model based on
metrics such as Recall, Precision, and F1-score.</p>
      <p>During the experimentation and testing phase, the team members conducted over 1,500
experiments to explore various parameter configurations in both vectorization methods (VecM)
and classifiers (Clf). We shared the best results among ourselves, aiming to explore as many
options as possible. In the final phase, predictions were made on the test set of 700 unlabeled
tweets, which were used to generate the official scores shown in Table 4. The performance of each
model is presented in the Precision (P), Recall (R), and F1-score (F1) metrics. Moreover, we show
the dimensions of vector representations (Dvec) and the number of parameters of BERT-based
models ( N p) for each experiment.</p>
      <p>Table 4 presents a comparative analysis of different features and classifiers tested in the
NLPClassKit system. The best overall performance on the test set was achieved by UAEMex-1 and
UAEMex-2, both using the pre-trained model albert-base-spanish combined with a LR classifier.
These models obtained the highest F1-score of 0.7538, indicating a strong balance between
precision and recall. UAEMex team corresponded to the set of classifiers with simple majority
voting.</p>
      <p>Interestingly, UAEMex-3, which used the same text vectorization but replaced LR with a MLP,
showed a slightly lower precision (0.7273) but improved recall (0.7665) resulting in a competitive
F1-score of 0.7463. In contrast, when we used ASCII vectorization combined with
bert-basemultilingual-cased (UAEMex-4-2 and UAEMex-4-3), we achieved the lowest performance with F of
0.6610 and 0.6369, suggesting that this combination is less effective for the clickbait detection task.
Meanwhile, configurations that combined bert-base-spanish-wwm-cased with albert-base-spanish
(UAEMex-4-4 and UAEMex-4-5) outperformed the multiligual-based system, reaching an F1-score
of 0.7122 and 0.7352. However, these results were slightly below those of the best models, where
only alBERT features were used. These results demonstrate that not only the classifier architecture
but also the quality and linguistic adequacy of the vector representations are key factors in
optimizing performance in Spanish text classification tasks.</p>
      <p>Finally, we combined the embedding generated by bert-base-spanish-wwm-cased and
albertbase-spanish. Although this may initially seem redundant, given that both models are pre-trained
on Spanish; However, we decided to carry out this experiment because they have different
architectures, which means they learn and represent distinct aspects of the language. On one hand,
BERT has a greater capacity to capture complex semantic relationships; while albert, being a lighter
and more efficient model, tends to generalize better in simpler or noisier contexts.
ABS: albert-base-spanish, ASC: ASCII, ABBM: ASCII + bert-base-multilingual-cased, and BSWC:
bert-base-spanish-wwm-cased + albert-base-spanish.</p>
      <p>In general, we expected that combination of both features would allow us to leverage their
complementary strengths and thus enhance the classification system is performance, but in this
case, the results showed a slight decrease in the F-measure compared to using albert-base-spanish.
This drop in performance could be attributed to redundancies or conflicts in the embeddings
generated by the two models, which may introduce noise rather than providing additional value.</p>
      <p>Regarding vectorization methods, the best performance was obtained with albert, a linguistic
model based on BERT, trained specifically for the Spanish language by the Computer Science
Department of the University of Chile and available in the Hugging Face repository [13]. It should
be noted that other Spanish-oriented BERT models were also evaluated, such as
bert-base-spanishwwm-uncased and bert-base-multilingual-cased, which showed lower performance than albert
when used with the same classifiers. Table 5 presents the hyperparameter settings corresponding
to the machine learning models that obtained the best results.</p>
      <p>
        After the evaluation phase was completed, the results were published on the CodaLab platform,
an open-source system designed to host and manage scientific challenges [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The performance of
the participating teams, including our best experiments, is summarized in Table 6. It presents both
the F1 scores obtained during the official competition classification and the internal evaluation
metrics calculated during the training phase. This table compares the results of our team with those
of the other participants.
      </p>
      <p>According to the F1 scores, our team ranked 10th, 11th, 12th, and 13th. Compared to the
topperforming team (gsdeyson), the difference was 0.047 points. It is worth noting that all the metrics
presented in the table were calculated using a combined training set composed of the 2,800 labeled
tweets provided during the development phase and an additional 700 labeled tweets from the
evaluation phase.</p>
      <p>Additionally, the confusion matrices corresponding to the best results are displayed in Figure 2.
It can be observed that the models exhibit a slight bias toward predicting that a given text does not
contain clickbait. This tendency is primarily attributed to the class imbalance in the training data,
with 1,001 clickbait examples versus 2,499 non-clickbait examples (see Tables 1 and 2). On the
other hand, we observe that the best-performing model (UAEMex-2) does not show a better
prediction of clickbait in the training stage compared to UAEMex-3 and UAEMex-4-5 models. We
assume that this may be an indicator of generalization in the proposed model, enabling a balance
between non-clickbait (0) and clickbait (1).</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Works</title>
      <p>In this study, the team collaborated to address the clickbait detection task proposed by Mordecki
and colleagues as part of the IberLEF 2025 forum, with the goal of contributing to the development
of algorithms capable of eliminating unethical practices exhibited by some digital media creators
and promoting high-quality journalism.</p>
      <p>During the competition, various text vectorization techniques were explored, allowing for a
comparative analysis between several custom BERT-based models for Spanish and traditional
vectorization approaches. This experimentation enabled the identification of the most suitable
method for this specific natural language processing task. Among the tested models, Albert proved
to be the most effective, consistently outperforming other Spanish-language BERT models during
the evaluation phase.</p>
      <p>Additionally, four classic supervised machine learning algorithms were tested, yielding several
noteworthy findings. For instance, while the Multilayer Perceptron achieved strong results during
training, Logistic Regression demonstrated superior generalization performance during the
competition, which is the primary goal of a classification model. Similarly, the ensemble classifier
based on majority voting achieved comparable performance to logistic regression, likely due to its
internal inclusion of similar models.</p>
      <p>For future work, we recommend further exploration of BERT-based Spanish word embedding
models and experimentation with alternative classifiers beyond those evaluated in this study.
Furthermore, no dimensionality reduction techniques were applied in this project; consequently,
classification efficiency could potentially be enhanced through the implementation of methods such
as Principal Component Analysis (PCA), Genetic Algorithms (GAs), or Information Gain, among
others. Additionally, the implementation of a probability-based ensemble classifier is suggested as a
promising line of research.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly tool in order to: Grammar and
spelling check. After using this tool, the authors reviewed and edited the content as needed and
take full responsibility for the publication’s content.
advances in social networks analysis and mining (ASONAM). pp. 9-16.
https://doi.org/10.1109/ASONAM.2016.7752207
[13] Pavao, A., Guyon, I., Letournel, A.-C., Tran, D.-T., Baro, X., Escalante, H. J., Escalera, S.,
Thomas, T., &amp; Xu, Z. (2023). CodaLab Competitions: An Open Source Platform to Organize
Scientific Challenges. Journal of Machine Learning Research, 24(198), pp. 1-6.
[14] Rojas-Simon, J., Ledeneva, Y., &amp; Garcia-Hernandez, R. A. (2022). Evaluation of Text Summaries
Based on Linear Optimization of Content Metrics (Vol. 1048). Springer International
Publishing. https://doi.org/10.1007/978-3-031-07214-7.
[15] Le, Q., &amp; Mikolov, T. (2014). Distributed representations of sentences and documents. In</p>
      <p>International conference on machine learning. 32(2), pp. 1188-1196.
[16] Abubakar, H. D., Umar, M., &amp; Bakale, M. A. (2022). Sentiment classification: Review of text
vectorization methods: Bag of words, Tf-Idf, Word2vec and Doc2vec. SLU Journal of Science
and Technology, 4(1), pp. 27-33. https://doi.org/10.56471/slujst.v4i.266
[17] Devlin J., Chang M.-W., Lee K., &amp; Toutanova, K. (2019). BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding. pp. 4171–4186, Accessed: May 20,
2025. [Online]. Available: https://github.com/tensorflow/tensor2tensor.
[18] Departamento de Ciencias de la Computación de la Universidad de Chile (DCC UChile). (2023).</p>
      <p>ALBERT Base Spanish BERT: dccuchile/albert-base-spanish. Available:
https://huggingface.co/dccuchile/albert-base-spanish.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Mordecki</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moncecchi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Couto</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>Te Ahorré Un Click: A Revised Definition of Clickbait and Detection in Spanish News</article-title>
          . En L.
          <string-name>
            <surname>Correia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rosá</surname>
          </string-name>
          , &amp; F. Garijo (Eds.),
          <source>Advances in Artificial Intelligence - IBERAMIA</source>
          <year>2024</year>
          (pp.
          <fpage>387</fpage>
          -
          <lpage>399</lpage>
          ). Springer Nature Switzerland. https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -80366-6_
          <fpage>32</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bazaco</surname>
            ,
            <given-names>Á.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Redondo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Sánchez-García</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>El clickbait, como estrategia del periodismo viral: Concepto y metodología</article-title>
          .
          <source>Revista Latina de Comunicación Social</source>
          ,
          <volume>74</volume>
          , pp.
          <fpage>94</fpage>
          -
          <lpage>115</lpage>
          . https://doi.org/10.4185/RLCS-2019-1323
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Palau-Sampio</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Reference press metamorphosis in the digital context: clickbait and tabloid strategies in Elpais.com</article-title>
          .
          <source>Communication &amp; Society</source>
          ,
          <volume>29</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>63</fpage>
          -
          <lpage>80</lpage>
          . https://doi.org/10.15581/003.29.35924
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gollub</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Komlossy</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiegmann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Garces</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            ,
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            , &amp;
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Crowdsourcing a Large Corpus of Clickbait on Twitter</article-title>
          .
          <string-name>
            <surname>En E. M. Bender</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Derczynski</surname>
          </string-name>
          , &amp; P. Isabelle (Eds.),
          <source>Proceedings of the 27th International Conference on Computational Linguistics</source>
          (pp.
          <fpage>1498</fpage>
          -
          <lpage>1507</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          . https://aclanthology.org/C18-1127
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Mordecki</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laguna</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prada</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosá</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sastre</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Moncecchi</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>Overview of TA1C at IberLEF 2025: Detecting and Spoiling Clickbait in SpanishLanguage News</article-title>
          . Procesamiento de Lenguaje Natural,
          <volume>75</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>J. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiménez-Zafra</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          (
          <year>2025</year>
          ).
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages</article-title>
          .
          <source>In Proceedings of the Iberian Languages Evaluation Forum (IberLEF</source>
          <year>2025</year>
          ),
          <article-title>co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEURWS</article-title>
          .org
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Efficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>In 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . https://doi.org/10.48550/arXiv.1301.3781
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>GloVe: Global Vectors for Word Representation</article-title>
          . En A.
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Pang</surname>
          </string-name>
          , &amp; W. Daelemans (Eds.),
          <source>In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          (pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          ).
          <article-title>Association for Computational Linguistics</article-title>
          . https://doi.org/10.3115/v1/
          <fpage>D14</fpage>
          -1162
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , M.-W.,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          , Association for Computational Linguistics. https://doi.org/10.18653/v1/
          <fpage>N19</fpage>
          -1423
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Clickbait detection using deep learning</article-title>
          .
          <source>In 2016 2nd International Conference on Next Generation Computing Technologies (NGCT)</source>
          , pp.
          <fpage>268</fpage>
          -
          <lpage>272</lpage>
          . https://doi.org/10.1109/NGCT.
          <year>2016</year>
          .7877426
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Rony</surname>
            ,
            <given-names>M. M. U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hassan</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Yousuf</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2017</year>
          ).
          <article-title>Diving Deep into Clickbaits: Who use them to what extents in which Topics with what Effects?</article-title>
          <source>In Proceedings of the 2017 IEEE/ACM international conference on advances in social networks analysis and mining 2017</source>
          , pp.
          <fpage>232</fpage>
          -
          <lpage>239</lpage>
          . https://doi.org/10.1145/3110025.3110054
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Chakraborty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paranjape</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kakarla</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ganguly</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Stop clickbait: Detecting and preventing clickbaits in online news media</article-title>
          . In 2016 IEEE/ACM international conference on
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>