<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>P. P. Molina);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Gambling Disorders and Type of Addiction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pedro Pablo Molina</string-name>
          <email>pedropmolinaa@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabel Segura-Bedmar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos III of Madrid (UC</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leganés</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Early Risk Detection, Support Vector Machine, Transformer, Data Augmentation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Human Language and Accessibility Technologies Group (HULAT), Computer Science and Engineering Department, University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>The following paper describes the solution to the two tasks presented by MentalRiskES 2025 [1], a competition dedicated to the early detection of mental disorders. In this case, the tasks were focused on detecting problems related to gambling addiction. During the design of the solutions, we aimed to explore various approaches to better adapt to the environment in which the data was collected. For the first task, we used: SVM, SVM with zero-shot assistance, and a transformer model. For the second task, we applied Data Augmentation for an SVM model and for two transformer models. However, the primary objective of this paper is to demonstrate the diferent techniques used, their limitations, and the conclusions drawn from the experiments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
roberta-large-xnli-anli [9]. The model was applied with the labels ”sano” (healthy) and ”enfermo”
(ill) to classify user messages from a health user and those from a patient with a mental condition.
3. Finally, we used the pre-trained model
myahan007/bert-base-spanish-wwm-cased-finetunedtweets, model based on BERT [10] that was pretrained using a Whole Word Masking strategy on
a Spanish corpus derived from tweets and other informal texts [11], in which diferent training
data were considered using Data Augmentation.</p>
      <p>For task 2:
1. An SVM model with Data Augmentation was used for the lootboxes class since it contained very
few data samples.
2. We used the same training dataset as the previous task but used the pre-trained model
myahan007/bert-base-spanish-wwm-cased-finetuned-tweets [ 10].
3. Finally, the pre-trained model myahan007/bert-base-spanish-wwm-cased-finetuned-tweets [ 10]
was used, but Data Augmentation was applied to several classes, better balancing them,
considering that generalization could lead to better results when facing new messages.</p>
      <p>Those are mainly the runs we submitted, but we will later discuss in more detail all the attempts we
made to achieve them.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Description of Dataset and Tasks</title>
      <p>The competition provided a series of messages sent by diferent users, both on Telegram and Twitch. As
mentioned earlier, these platforms are designed for the constant exchange of information, with minimal
iflters, and their importance has grown significantly in recent years.</p>
      <p>A single dataset was provided for both tasks, containing all messages from each user along with their
label, applicable to both Task 1 and Task 2.</p>
      <p>The purpose of this competition is to identify mental disorders at an early stage. To simulate a
realistic environment, the organization implemented a server that emulates an authentic conversation,
delivering data packets containing one message per user. The system must assign a label to each user
based on both the current message and all previous messages before receiving the next packet. The
ultimate goal is to predict the possible mental disorder of each user as quickly as possible.</p>
      <sec id="sec-2-1">
        <title>2.1. Task 1: Risk Detection of Gambling Disorders</title>
        <p>This is a binary classification task aimed at determining whether a user is at high risk (label = 1) or low
risk (label = 0) of developing a gambling-related disorder based on their messages. The objective is to
enable early detection and facilitate timely interventions. Table 1 shows the class distribution for Task
1 for the trial and training datasets provided by the organizers.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Task 2: Type of Addiction Detection</title>
        <p>This is a multi-label classification task aimed at determining the specific type of addiction associated
with the disorder. The available labels are Betting, Online Gaming, Trading, and Lootboxes. The class
distribution is shown in Table 2.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. System architecture and Techniques</title>
      <p>Due to the characteristics of the dataset, it was decided to train all models with concatenated messages,
separated by a line break. Additionally, it is important to consider that this approach may result in very
long message lengths, so we tested diferent models capable of handling varying input sizes.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Augmentation</title>
        <p>Due to the small size of the dataset, it was decided to use the Data Augmentation (DA) technique, which
involves the artificial creation of training data for machine learning through transformations. This is a
widely studied research field across various machine learning disciplines [ 12]. In the case of Natural
Language Processing (NLP), there are various techniques available. In this context, we will diferentiate
between Task 1 and Task 2, as the classes in each task are diferent. Our goal is to generate specific
types of data tailored to each task’s unique requirements.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Task 1: Risk Detection of Gambling Disorders</title>
        <p>In this case, we will apply three techniques:
• Back Translation: is a common DA technique that translates a sentence into another language
and then back to the original language [13]. This helps generate paraphrased versions of the
original texts without changing their semantic meaning. We used the MarianMT models from
the Helsinki-NLP project, specifically Helsinki-NLP/opus-mt-es-en [ 14] and
Helsinki-NLP/opusmt-en-es [15] for Spanish-English-Spanish translation. This method is particularly efective in
improving generalization for text classification tasks by increasing data diversity without manual
annotation [16].
• Summarization: summarization-based augmentation involves using a language model to
compress messages while preserving their most relevant information. In this context, we
employed prompting using LLMs such as filipealmeida/Mistral-7B-Instruct-v0.1-sharded [ 17] and
NousResearch/Nous-Hermes-2-Mistral-7B-DPO [18]. Each input text was passed with an
instruction to summarize the message while retaining the semantic intent associated with its labeled
class. This technique serves to create concise variants of original messages, which helps the
model generalize by focusing on the core content [19].
• LLM prompting: Prompts were specifically designed to simulate user messages from
informal platforms (e.g., Twitch or Discord) related to high and low risk. We used the model
NousResearch/Nous-Hermes-2-Mistral-7B-DPO, which is optimized using Direct Preference
Optimization (DPO) and trained on high-quality synthetic datasets. This method provides
highvariability, realistic texts aligned with the characteristics of each class [20].</p>
        <p>These data augmentation techniques are widely recognized as efective strategies in scenarios with
limited training data, as they help increase dataset diversity and improve model generalization [21].
However, in our experiments, we observed that these methods did not significantly improve our models’
performance. This is mainly due to the subtle semantic overlap between texts labeled as high and low
risk, which makes it challenging for augmented data to enhance class discrimination. Notably, the most
efective technique in our case was LLM prompting. For instance, back translation sometimes altered
or lost the meaning of key words critical for accurate class identification, reducing its efectiveness in
distinguishing between classes.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Task 2: Type of Addiction Detection</title>
        <p>In this case, considering the classes we need to predict and the imbalance between classes, the best
approach is to generate synthetic data using the previously mentioned LLM prompting technique.</p>
        <p>We used the Hugging Face model NousResearch/Nous-Hermes-2-Mistral-7B-DPO, which is a flagship
7B model improved through DPO (Direct Preference Optimization). This model has shown significant
improvements across multiple benchmarks, including AGIEval, BigBench Reasoning, GPT4All, and
TruthfulQA. It was previously trained on 1,000,000 instructions and chats of GPT-4 quality or higher,
primarily using synthetic data and other high-quality datasets. This choice was made due to its ability
to generate realistic and contextually appropriate texts for the task.</p>
        <p>First, we did it for the lootboxes class with the following parameters:
• Temperature (0.95), Top-k (60), Top-p (0.95), Repetition Penalty (1.1): Configured to ensure
diversity and realism.</p>
        <p>• Minimal cleaning: Removal of dashes and quotation marks at the beginning.</p>
        <p>This was the prompt translate from Spanish to English:
prompt_base = ””” Generate a message from a user in a chat while playing video games with lootboxes.
Do not literally mention “boxes” all the time. Use phrases that express frustration, excitement, or
addiction when opening random rewards. The style should be very informal, like Twitch or Discord: use
expressions like ”xd”, ”haha”, ”I blew all my money”, ”one more”, ”this is a scam”, etc. You can mention:
knives, skins, stickers, drops, boosters, packs, legendary drops, credits, gems, stattrak, foil items, trash
or epic rewards, duplicates, odds, pity counter, upgrades, reroll, or any typical lootbox dynamics. It
can also reference games like CS:GO, FIFA, Overwatch, Clash Royale, Valorant, Fortnite, etc. Message: ”””</p>
        <p>This prompt simulates typical messages from players on platforms like Twitch or Discord while
interacting with lootboxes. The style is informal and expressive, including phrases like ”xd”, ”haha”,
”this is a scam”, and references to common items: skins, knives, trash rewards, duplicates, probabilities.
It also mentions games like CS:GO, FIFA, Overwatch, and Valorant.</p>
        <p>Similarly, this process was carried out for online gaming and betting, with the same intention but
adjusted to the characteristics of each class:</p>
        <p>prompt_base = ””” Generate a message from a user in a chat while participating in online gambling
(online gaming). Do not use the word “casino” in every message. It should show frustration or addiction
with a very informal style, as if chatting on Twitch or Discord: ”xd”, ”the slot machine doesn’t pay
anything”, ”it gave me 50 free spins and I got 10 cents”, ”my entire salary went on blackjack”, ”one
more round and I’m done (or not haha)”. You can mention: roulette, slot machines, blackjack, jackpots,
welcome bonuses, free spins, multipliers, mega wins, wilds, dead spins, etc. Message: ”””
prompt_base = ””” Generate a message from a user in a chat while participating in sports betting. Do
not repeat the word “bet” all the time. The message should reflect emotions like frustration, euphoria,
or addiction when betting money on sports events, especially football. The style should be very
informal, as if speaking on Twitch or Discord: use expressions like ”xd”, ”I blew it on the combo”, ”VAR
ruined me again”, ”Haaland saved me”, ”haha this parlay is hopeless”. You can mention odds, combos,
penalties, live betting, VAR, goals in the 90+3, corner bets, cards, cashout, losing by one goal or winning
by a miracle, etc. Message: ”””</p>
        <p>These prompts were designed to generate realistic, context-specific messages, preserving the unique
linguistic features of each class while maintaining an informal and spontaneous conversational style.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Approaches for Task 1: Risk Detection of Gambling Disorders</title>
        <p>During the development phase, to properly train and compare our model, we used the same dataset,
dividing the combined trial+train set into training (80%) and test (20%) subsets.</p>
        <sec id="sec-3-4-1">
          <title>3.4.1. Classical Machine Learning Classifiers</title>
          <p>Classical algorithms represent an efective alternative for text classification, which is why it was decided
to give them a chance. Additionally, they are capable of adapting to any text length.</p>
          <p>Firstly, it was decided to experiment with diferent classical models using TF-IDF and a basic
preprocessing technique, in which the text was converted to lowercase, stopwords and non-alphabetic
characters were removed. The SnowballStemmer algorithm was applied to obtain the root of each word.</p>
          <p>We can see that the SVM model already stands out. We also decided to perform hyperparameter
tuning with RandomSearch using F1-score as the objective metric to truly confirm if it is the best, and
we obtained:</p>
          <p>• Lemmatization with SpaCy: This approach aims to retain the linguistically accurate form
of each word, which is especially useful in tasks where lexical meaning is important. The
SpaCy library was used for this purpose [22], as it supports lemmatization while considering
both grammatical category and syntactic context. Standard normalization steps were applied
beforehand: lowercasing, tokenization via nltk.word_tokenize() [23], stopword removal (using
nltk.corpus.stopwords), and filtering of non-alphabetic tokens. This approach was particularly
useful for evaluating models based on richer semantic representations, such as TF-IDF [24].
• Stemming with SnowballStemmer: The SnowballStemmer algorithm from NLTK was used [23],
configured for Spanish. Although stemming does not guarantee valid words in the language, it can
be advantageous for frequency-focused methods such as bag-of-words by reducing vocabulary
variability [25]. The same preprocessing steps as in the previous method were applied, replacing
lemmatization with stemming to allow a direct comparison in terms of performance and processing
speed.
• Stemming with TweetTokenizer: Given that the competition dataset includes brief and
informal messages, often with social media expressions, a variant using NLTK’s TweetTokenizer [ 23]
was tested. This tokenizer is designed to preserve symbols like hashtags, mentions, contractions,
and emojis. It was combined with the same stemming algorithm (SnowballStemmer) and stopword
removal, aiming for a preprocessing setup better adapted to short and colloquial texts [26].
• Emoji processing without alphanumeric cleaning: To assess the role emojis may play in
risk detection, a configuration was tested where no special character removal or morphological
transformations were applied. Instead, the emoji library was used to convert each emoji into
its textual description in Spanish using emoji.demojize() [27]. This approach was intended to
evaluate the specific contribution of emojis to model performance.
• Stopword removal only with TweetTokenizer: Finally, a minimal configuration was
considered in which only Spanish stopwords were removed after applying TweetTokenizer. This setting
was included as a simple baseline to compare against more complex techniques such as
lemmatization or emoji handling, in order to assess their actual contribution to system performance.</p>
          <p>These five configurations were selected after testing various combinations and were deemed
suficiently representative of diverse text preprocessing strategies—from linguistically oriented approaches
to ones adapted to informal social media language. They also ofered a good trade-of between simplicity,
execution time, and the ability to capture relevant information in the context of the MentalRiskES 2025
competition.</p>
          <p>To find the best hyperparameters, we used GridSearch with a relatively small search space.
• C: [0.1, 1, 10]
• kernerl: [’linear’, ’rbf’]
• gamma: [’scale’, ’auto’]
• tol: [1e-3, 1e-2, 1e-1]</p>
          <p>After combining both, we obtained the following results:</p>
          <p>We will stick with stopwords_only since the processing time and computational cost are lower than
those of emoji.</p>
        </sec>
        <sec id="sec-3-4-2">
          <title>3.4.2. Using zero-shot Classifier</title>
          <p>After attempting Data Augmentation without improving the results, we decided to analyze the options
that seemed most promising.</p>
          <p>In this context, the idea of applying a pipeline with a transformer for text classification using zero-shot
emerged. The goal was to classify each user’s messages one by one, where 0 indicates that the message
does not suggest gambling addiction, and 1 indicates that it does. This approach was conceived to
enable early detection of the moment when a high risk of gambling addiction begins to appear.</p>
          <p>After extensive research, we did not find any suitable model specifically dedicated to gambling
addiction detection. Therefore, we decided to use the vicgalle/xlm-roberta-large-xnli-anli model from
Hugging Face [28], with the labels ”sano” and ”enfermo”.</p>
          <p>Finally, for each user, we stored all messages together and decided to add the score_enfermo column,
where we calculated the ratio of messages labeled as ”sick” to the total number of messages for that
user.</p>
          <p>With this setup, we conducted a study to visualize whether there is any correlation between risk and
score_enfermo, thereby validating our creative approach.</p>
          <p>• Point-biserial correlation coeficient : It is a special case of the Pearson correlation coeficient
used to measure the relationship between a binary variable and a continuous variable. It is
equivalent to calculating the Pearson correlation coeficient directly between the binary variable
(encoded as 0 and 1) and the continuous variable. Intuitively, it measures how much the mean of
the continuous variable difers between the two groups defined by the binary variable.
The point-biserial correlation coeficient of 0.19 indicates a weak positive correlation between
the binary list (0 and 1) and the list of probabilities. This means that there is a slight tendency for
the continuous variable (probabilities) to have slightly higher values when the binary variable
is 1 compared to when it is 0. However, this relationship is not very strong. The p-value of
0.0003 is very low (generally considered significant if it is less than 0.05). This indicates that the
correlation you observed (although weak) is statistically significant. In other words, it is unlikely
that you would have obtained a correlation of 0.19 by chance if there were no real (albeit small)
relationship between the two lists in the population from which your data originate.
• Degree of dispersion: Another way to check for a relationship is by using a graph that shows
the distribution of scores as a function of the risk field. In this case, the graph does not provide
much insight.
• ROC Curve: It allows us to determine the optimal threshold. In this case, using roc_curve from
sklearn, we find that the optimal threshold is 0.299. This means that if the value of score_enfermo
is greater than 0.299, there appears to be a relationship indicating that the user is indeed at risk.</p>
          <p>After this, we used the score_enfermo column in the SVM. We performed the vectorization of
the messages using TF-IDF to obtain a matrix with textual features. Then, we concatenated the
score_enfermo column to this matrix using pd.concat(), thus creating a feature set where each row
includes both the textual information and the score_enfermo value.</p>
          <p>As we can see, the results have improved in almost all preprocessing methods.</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>3.4.3. Transformers Models</title>
          <p>Finally, we chose to use Transformer models in this task due to their ability to capture the complex
context and linguistic nuances present in the messages, especially in Spanish texts from social media.</p>
          <p>Given the size, language, and colloquial expressions typical of these texts, we decided to experiment
with the following Transformer models:
• bert-base-uncased [29]: This is the original version of the BERT model (Bidirectional Encoder
Representations from Transformers) [30], trained on BookCorpus and English Wikipedia. It has 12
transformer layers, 12 attention heads, and a hidden size of 768 per token, totaling 110 million
parameters. It is an uncased model, meaning it does not distinguish between uppercase and
lowercase letters. Although it is not adapted to Spanish, it was included as a baseline due to its
widespread use and proven efectiveness in many English text classification tasks. Its inclusion
provides a general reference point for comparison with more specialized models.
• myahan007/bert-base-spanish-wwm-cased-finetuned-tweets [31]: This model is based on
dccuchile/bert-base-spanish-wwm-cased, a Spanish adaptation of BERT [30] trained with Whole
Word Masking (WWM) on Spanish Wikipedia and other corpora. The version used here was
further fine-tuned on a corpus of tweets, making it especially suitable for this task, since the
dataset consists of informal, abbreviated, and colloquial messages from social platforms. Its use is
justified by its better adaptation to real-world, non-standard language.
• bertin-project/bertin-roberta-base-spanish [32, 33]: This model was trained from scratch
using the RoBERTa architecture (Robustly Optimized BERT Pretraining Approach) [34] on the
mc4-es corpus, which contains large volumes of Spanish text from Common Crawl. Unlike BERT,
RoBERTa removes tasks like Next Sentence Prediction and employs longer, dynamic training. This
monolingual Spanish variant allows us to assess the benefits of specific pretraining in Spanish
without multilingual interference and under a more robust architecture. Its inclusion helps
evaluate whether training from scratch in Spanish provides an advantage over multilingual or
adapted models.
• PlanTL-GOB-ES/RoBERTa-base-bne [35]: Developed by Spain’s National Language
Technologies Plan (PlanTL), this model also uses the RoBERTa architecture [34], but was trained on
CORPES XXI and other high-quality linguistic resources from the National Library of Spain. It
aims to faithfully represent contemporary normative Spanish, so it was included to compare
its performance against models adapted to more informal registers. Its linguistic quality and
carefully curated training data make it a solid reference for tasks in standard Spanish.
• distilbert-base-multilingual-cased [36]: DistilBERT [37] is a compressed version of BERT
[30] created via knowledge distillation, reducing the number of layers to 6 while maintaining
12 attention heads and a hidden size of 768, resulting in a 40% smaller model. The multilingual
version used was trained on over 100 languages. Although not specialized in Spanish or social
texts, its smaller size allows for greater computational eficiency, making it a practical alternative
in resource-constrained environments. Its inclusion allows us to assess the trade-of between
performance and eficiency in multilingual tasks.</p>
          <p>Together, the selected models span diferent approaches: general vs. specialized, monolingual vs.
multilingual, and training on normative vs. informal texts.</p>
          <p>Experiments using transformer-based models relied on default hyperparameters, though fine-tuning
was applied as described in Table ??. This process involves adding an output layer (e.g., softmax for
binary or multi-class classification) and training the model on the competition-provided message dataset.
This technique enables strong results even on small datasets by leveraging the model’s pretraining
knowledge.</p>
          <p>Due to the dataset’s limited size, the same fine-tuning setup was used across all models to avoid
overfitting and ensure fair comparison. Additionally, a TrainerCallback was used to manage early
stopping.</p>
        </sec>
        <sec id="sec-3-4-4">
          <title>3.4.4. Transformer Models</title>
          <p>In this case, the results were not as good as expected. Models better adapted to informal language,
such as the tweet-based Spanish model, clearly outperformed others pretrained on normative corpora
or in diferent languages. This supports the idea that domain and text-type adaptation is crucial [ 38].</p>
          <p>Additionally, generalist models or those trained on cleaner corpora tend to underperform when
dealing with noisy texts, emojis, or abbreviations. This suggests that specialized pretraining can yield
significant gains without complex adjustments. These models are likely better at understanding tone,
intent, and idiosyncrasies in social media messages—important indicators of risk in this task.</p>
          <p>Multilingual or compact models like distilbert-base-multilingual-cased showed decent performance,
which is promising for low-resource settings. However, they still lag behind the tweet-specialized
Spanish model.</p>
          <p>Finally, note that transformer models are limited to sequences of up to 512 tokens [39]. Since
user messages were concatenated, relevant information might have been lost in longer texts. In
realworld scenarios or early prediction stages with shorter messages, these models may perform better by
processing the full text without truncation.</p>
        </sec>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Approaches for Task 2: Type of Addiction Detection</title>
        <p>To properly train and compare our model, we used the same dataset, dividing the trial+train dataset
into training (80%) and testing (20%). Additionally, we used the stratify parameter in the train_test_split
function to ensure that the class proportion in both sets is similar to the original distribution. This
helps maintain a proper class balance, which is especially important when working with imbalanced
data like this.</p>
        <p>In the following table, we can see how these classes are balanced in the training set.</p>
        <sec id="sec-3-5-1">
          <title>3.5.1. Support Vector Machine</title>
          <p>In this case, an SVM model was used again, and to obtain the best model, the same combination of
preprocessing techniques and hyperparameters as in task 1 was applied.</p>
          <p>If we analyze the confusion matrix of the best model, stopwords_only with the selected
hyperparameters, we can see that the problem lies in the class imbalance, especially with the lootboxes
class.</p>
          <p>Preprocess
lemmatization</p>
          <p>stemming
tweet_stemming</p>
          <p>emoji
stopwords_only</p>
          <p>Therefore, it was decided to use the Data Augmentation technique for the lootboxes class. Tests
were conducted with diferent Hugging Face prompting models to generate synthetic data for training.
The model that best suited the task was NousResearch/Nous-Hermes-2-Mistral-7B-DPO because it
realistically mimics user messages on platforms like Twitch or Telegram, including slang and emoticons.
The results improved significantly:</p>
          <p>Preprocess
lemmatization</p>
          <p>stemming
tweet_stemming</p>
          <p>emoji
stopwords_only</p>
          <p>And as we can see from the confusion matrix, a better result has been achieved for all classes for the
new best model.</p>
        </sec>
        <sec id="sec-3-5-2">
          <title>3.5.2. Transformers Models</title>
          <p>The experiments with Transformers used the default hyperparameters, although we applied fine-tuning
as detailed below. We also added a TrainerCallback to manage early stopping. Additionally, a T4 x2
GPU from Kaggle was used.</p>
          <p>Regarding the transformer model, the same process as in task 1 was carried out, and it was once
again demonstrated that the best model for this context is
myahan007/bert-base-spanish-wwm-casedifnetuned-tweets. We followed a similar procedure to SVM, initially evaluating the model without data
augmentation and later adding DA for the lootboxes class, as in the previous case.</p>
          <p>Subsequently, we proposed the idea of using the same transformer model but with synthetic data
for several classes, so that the training set would be perfectly balanced. This idea emerged to achieve
better generalization and to provide more specific information for each class through DA prompting, as
previously explained. In this case, we generated 39 instances for betting, 23 for online gaming, and 89
for lootboxes.</p>
          <p>Finally, we compared the results obtained from the three approaches.</p>
          <p>Since the competition allows for three runs per task, we decided to choose the best model, which is
the DA with lootboxes, along with the last one that balances the classes with synthetic data, aiming to
have more messages and better generalization.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Runs</title>
      <p>Finally, we selected the best models, which are presented in the following table:</p>
      <p>For the model selection, we opted for those that demonstrated both strong technical foundations and
high performance in the evaluation phase.</p>
      <p>For Task 1, we used a Support Vector Machine (SVM) model with the stopwords_only preprocessing
strategy, which is particularly well-suited for text classification tasks. In Run 1, we enriched the input
features by including a zero-shot prediction score, which can provide an additional advantage by
anticipating certain labels in ambiguous cases. Finally, we wanted to include a transformer-based
model, known for its powerful capabilities in capturing complex language patterns. This model is also
pretrained, which helps improve performance even with limited task-specific data.</p>
      <p>For Task 2, we explored the use of SVM for a multi-class classification scenario. We used the
tweet_stemming preprocessing strategy and applied data augmentation to the lootboxes class, as it
was the most imbalanced. In addition to the classical approach, we reused the best transformer model
from Task 1 to assess its generalization capability in a multi-class setting. In Run 1, we applied data
augmentation specifically to lootboxes, which yielded excellent results. In Run 2, we balanced all classes
through augmentation to enhance the model’s ability to generalize across categories.</p>
      <sec id="sec-4-1">
        <title>4.1. Run Configuration</title>
        <p>The transformer models were stored in my personal Hugging Face account, while the SVM models
were saved directly in my Kaggle notebook. This made them easy to retrieve, as they were already
trained and ready for inference.</p>
        <p>For the prediction process, each user’s message was collected individually and concatenated with the
previous one as new messages arrived. This improved eficiency by avoiding the need to reconstruct
the full message history repeatedly. Additionally, when using the same preprocessing method, it was
not applied more than once, optimizing execution time.</p>
        <p>Following the competition’s guidelines, carbon emissions were also measured using the CodeCarbon
tool. Metrics such as total RAM usage, CPU usage percentage, floating point operations per second
(FLOPS), total processing time, and CO₂ emissions in kilograms were recorded. These eficiency metrics
help assess the environmental impact and resource demand of the system, providing insight into the
solution’s suitability for low-resource environments like mobile devices or personal computers.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this section, we present the results obtained in the competition.
5.1. Task 1</p>
      <p>The results obtained are slightly below our expectations, but considering the best outcomes, our
performance was still quite satisfactory.</p>
      <p>Despite not reaching the top positions, we achieved the lowest values in the competition in Run 0
with ERDE30 of 0.242, latencyTP of 2, and speed of 0.990, indicating that this model achieves more
efective early detection compared to other runs. Similarly, in Run 1, we obtained quite competitive
values in the competition with ERDE30 of 0.248, latencyTP of 3, and speed of 0.981. Although the
ERDE30 in Run 2 is 0.340 and slightly higher, it is still considered acceptable within the competition.</p>
      <p>SVM models stand out in Task 1 of early detection due to their ability for eficient processing, low
latency, and high speed (close to 0.990), achieving a low ERDE30. This is due to their use of TF-IDF
vectors, allowing for fast predictions even with few messages. Although Transformers are better at
capturing context, their limitation of 512 tokens and longer inference time make them less efective
for early detection. Additionally, SVM models tend to predict more false positives in the ”High Risk”
class compared to ”Low Risk”, which, although increasing some false positives, is useful for early alerts,
prioritizing the detection of high risks. This combination of speed, low latency, and focus on high-risk
alerts makes SVM models particularly efective for this task.</p>
      <p>As we can see, the results for task two are very good, achieving a podium position in two out of
the three runs. These first two models perform well because the DA technique was applied correctly,
obtaining good synthetic data for the most unbalanced class, lootboxes.</p>
      <p>However, we can observe that for run2, the DA technique slightly worsens the performance. This may
be because generating a large amount of synthetic data can cause it to deviate somewhat from the reality
found in messages from Telegram or Twitch. In other words, in run2, training with more synthetic data
slightly afects the performance, and this is due to the fact that the quality of the generated data is not
perfect.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        Participating in the MentalRisk-2025 competition [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] has been an enriching experience that allowed us
to apply and consolidate knowledge in natural language processing, machine learning, and
transformerbased models. Throughout the project, we achieved solid results by combining classical approaches
such as SVM with more advanced techniques, including transformer models and data augmentation
strategies.
      </p>
      <p>One of the most relevant aspects of this project was the use of data augmentation (DA) techniques,
particularly the generation of synthetic data through prompts using large language models (LLMs).
These methods proved especially efective in scenarios with limited training data. The strong
performance observed in our models for Task 2 after applying DA techniques confirms their efectiveness in
low-resource or imbalanced contexts [40]. This aligns with findings from previous editions of similar
competitions, where top-performing systems also leveraged such strategies.</p>
      <p>Another key factor in our results was the use of transformer-based models. Despite their limitations
in capturing full context in long inputs, these models proved highly efective. Their pretraining on
large corpora—many of which exhibit similar characteristics to the dataset used in this competition
(informal, abbreviated, conversational language)—allowed for highly valuable knowledge transfer. In
fact, our best-performing models in terms of F1-score, both for the binary and multi-class tasks, were
based on this architecture.</p>
      <p>Among the main limitations of our work, we must highlight the inefectiveness of data augmentation
for Task 1. Consequently, none of the three submissions for this task employed DA. This technique
failed to work adequately, likely due to the challenge of capturing the subtle diferences between
highand low-risk users—diferences which can be quite nuanced and dificult to simulate synthetically.</p>
      <p>Another significant limitation, as previously mentioned, lies in the restricted capacity of our models
to process long texts. In this competition, message lengths varied, and longer texts posed challenges for
our models, making it harder to capture complete context and key information.</p>
      <p>Lastly, we did not focus on the computational cost or emissions generated by model training and
inference during the competition. These aspects should be considered in future work to improve the
sustainability and eficiency of proposed solutions.</p>
      <sec id="sec-6-1">
        <title>Future Work</title>
        <p>To address current limitations, explore new techniques, and enhance the models’ ability to handle more
complex contexts, we have identified several future research directions:
• Apply Data Augmentation to Task 1: Explore more targeted text generation techniques that
better capture the subtle diferences between high- and low-risk users, in order to generate more
realistic and useful synthetic data.
• Evaluate specialized models for long sequences: Include architectures such as LongFormer
[41], which may improve performance on long texts by capturing the user’s full message context
more efectively.
• Expand the study of generative models: Investigate the use of other LLMs to produce
higherquality synthetic data, and explore generation control techniques through more precise prompts.
• Incorporate explainability techniques: Integrate interpretability methods such as LIME
[42] or SHAP [43] to analyze model behavior and identify the most relevant textual features in
classification. This will help avoid black-box decisions and improve the transparency of results.
• Optimize computational resources: Explore solutions that strike a balance between
performance and eficiency, reducing computational costs without compromising prediction quality.
• Explore multimodal approaches: Include other types of data (e.g., metadata, temporal
information, or images if available in future editions) to improve predictions. For example, Bucur et al.
(2023) propose a multimodal transformer model enriched with temporal information (time2vec),
combining text and images to detect depression on social media, achieving better results than
text-only models [44].</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>Grant PID2023-148577OB-C21 (Human-Centered AI: User-Driven Adapted Language
ModelsHUMAN_AI) by MICIU/AEI/ 10.13039/501100011033 and by FEDER/UE.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Gemini and Grammarly in order to: check
grammar, spelling and reword. After using these services, the authors reviewed and edited the content
as needed and take full responsibility for the publication’s content.
Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the
Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS. org, 2025.
[6] P. Álvarez Ojeda, M. V. Cantero-Romero, A. Semikozova, A. Montejo-Ráez, The precom-sm
corpus: Gambling in spanish social media, in: Proceedings of the 31st International Conference
on Computational Linguistics, 2025, pp. 17–28.
[7] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (1995) 273–297.
[8] S. Kowsari, K. Jafari Meimandi, M. Heidarysafa, S. Mendu, L. E. Barnes, D. E. Brown, Twenty
years of machine-learning-based text classification: A systematic review, Algorithms 16 (2023)
236. doi:10.3390/a16050236.
[9] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, in:
D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, Association for Computational Linguistics, Online,
2020, pp. 8440–8451. URL: https://aclanthology.org/2020.acl-main.747/. doi:10.18653/v1/2020.
acl- main.747.
[10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, arXiv preprint arXiv:1810.04805 (2019).
[11] J. Cañete, G. Chaperon, R. Fuentes, J. Pérez, B. Poblete, Spanish pre-trained bert models and
evaluation data, https://github.com/dccuchile/beto, 2020. Accessed May 18, 2025.
[12] M. Bayer, M.-A. Kaufhold, C. Reuter, A survey on data augmentation for text classification, ACM</p>
      <p>Computing Surveys 55 (2022) 1–39.
[13] S. Lee, L. Liu, W. Choi, Iterative translation-based data augmentation method for text classification
tasks, IEEE Access 9 (2021) 160437–160445.
[14] Helsinki-NLP, Opus-MT Spanish-English translation model, 2020. URL: https://huggingface.co/</p>
      <p>Helsinki-NLP/opus-mt-es-en, accedido el 8 de junio de 2025.
[15] Helsinki-NLP, Opus-MT English-Spanish translation model, 2020. URL: https://huggingface.co/</p>
      <p>Helsinki-NLP/opus-mt-en-es, accedido el 8 de junio de 2025.
[16] R. Sennrich, B. Haddow, A. Birch, Improving neural machine translation models with monolingual
data, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), ACL, 2016, pp. 86–96.
[17] Filipe Almeida, Mistral‑7B‑Instruct‑v0.1‑sharded, 2024. URL: https://huggingface.co/filipealmeida/</p>
      <p>Mistral-7B-Instruct-v0.1-sharded, accedido el 4 de junio de 2025.
[18] NousResearch, Nous-Hermes-2-Mistral-7B-DPO, 2024. URL: https://huggingface.co/NousResearch/</p>
      <p>Nous-Hermes-2-Mistral-7B-DPO, accedido el 4 de junio de 2025.
[19] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, Pegasus: Pre-training with extracted gap-sentences for
abstractive summarization, in: International Conference on Machine Learning, PMLR, 2020, pp.
11328–11339.
[20] Y. Wang, et al., Large language models are zero-shot data generators, in: International Conference
on Learning Representations, 2023.
[21] C. Shorten, T. M. Khoshgoftaar, A survey on image data augmentation for deep learning, Journal
of Big Data 6 (2019) 60.
[22] M. Honnibal, I. Montani, S. Van Landeghem, A. Boyd, spaCy: Industrial-strength Natural Language</p>
      <p>Processing in Python (2020). URL: https://spacy.io. doi:10.5281/zenodo.1212303.
[23] E. Loper, S. Bird, Nltk: The natural language toolkit, arXiv preprint cs/0205028 (2002).
[24] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Information
processing &amp; management 24 (1988) 513–523.
[25] M. F. Porter, Snowball: A language for stemming algorithms, 2001.
[26] K. Gimpel, N. Schneider, B. O’connor, D. Das, D. P. Mills, J. Eisenstein, M. Heilman, D. Yogatama,
J. Flanigan, N. A. Smith, Part-of-speech tagging for twitter: Annotation, features, and experiments,
in: Proceedings of the 49th annual meeting of the Association for Computational Linguistics:
Human Language Technologies, 2011, pp. 42–47.
[27] C. Ceriello, contributors, emoji: Emoji for python, https://github.com/carpedm20/emoji, 2024.</p>
      <p>Versión utilizada: 2.11.0.
[28] V. Gallego, Xlm-roberta-large-xnli-anli, 2023. URL: https://huggingface.co/vicgalle/
xlm-roberta-large-xnli-anli, accedido el 6 de junio de 2025.
[29] G. Research, google‑bert/bert‑base‑uncased, 2019. URL: https://huggingface.co/google-bert/
bert-base-uncased, modelo base BERT uncased en inglés.
[30] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, in: Proceedings of the 2019 conference of the North American chapter
of the association for computational linguistics: human language technologies, volume 1 (long
and short papers), 2019, pp. 4171–4186.
[31] myahan007, myahan007/bert‑base‑spanish‑wwm‑cased‑finetuned‑tweets, 2021. URL: https://
huggingface.co/myahan007/bert-base-spanish-wwm-cased-finetuned-tweets, bERT‑base en
español, Whole Word Masking, afinado en tweets.
[32] B. Project, bertin‑project/bertin‑roberta‑base‑spanish, 2021. URL: https://huggingface.co/
bertin-project/bertin-roberta-base-spanish, roBERTa‑base adaptado para español.
[33] J. De la Rosa, E. G. Ponferrada, P. Villegas, P. G. d. P. Salas, M. Romero, M. Grandury, Bertin:
Eficient pre-training of a spanish language model using perplexity sampling, arXiv preprint
arXiv:2207.06814 (2022).
[34] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[35] P. . GOB‑ES, Plantl‑gob‑es/roberta‑base‑bne, 2021. URL: https://huggingface.co/PlanTL-GOB-ES/
roberta-base-bne, roBERTa‑base entrenado con corpus de la BNE.
[36] H. Face, distilbert/distilbert‑base‑multilingual‑cased, 2019. URL: https://huggingface.co/distilbert/
distilbert-base-multilingual-cased, distilBERT multilingüe, versión cased.
[37] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster,
cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019).
[38] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, N. A. Smith, Don’t
stop pretraining: Adapt language models to domains and tasks, arXiv preprint arXiv:2004.10964
(2020).
[39] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin,</p>
      <p>Attention is all you need, Advances in neural information processing systems 30 (2017).
[40] Y. Wang, C. Xu, Q. Sun, H. Hu, C. Tao, X. Geng, D. Jiang, Promda: Prompt-based data augmentation
for low-resource nlu tasks, arXiv preprint arXiv:2202.12499 (2022).
[41] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document transformer, arXiv preprint
arXiv:2004.05150 (2020).
[42] M. T. Ribeiro, S. Singh, C. Guestrin, ” why should i trust you?” explaining the predictions of any
classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on knowledge
discovery and data mining, 2016, pp. 1135–1144.
[43] W. Zhao, T. Joshi, V. N. Nair, A. Sudjianto, Shap values for explaining cnn-based text classification
models, arXiv preprint arXiv:2008.11825 (2020).
[44] A.-M. Bucur, A. Cosma, P. Rosso, L. P. Dinu, It’s just a matter of time: Detecting depression
with time-enriched multimodal transformers, in: European conference on information retrieval,
Springer, 2023, pp. 200–215.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Mármol-Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Álvarez Ojeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno-Muñoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Plaza-del Arco</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D. MolinaGonzález</surname>
          </string-name>
          , M. T.
          <string-name>
            <surname>Martín-Valdivia</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Ureña-López</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Montejo-Ráez</surname>
          </string-name>
          , Overview of mentalriskes at iberlef 2025:
          <article-title>Early detection of mental disorders risk in spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>75</volume>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>World</given-names>
            <surname>Health</surname>
          </string-name>
          <string-name>
            <surname>Organization</surname>
          </string-name>
          , Trastornos mentales, https://www.who.int/es/news-room/fact-sheets/ detail/mental-disorders,
          <year>2024</year>
          . Consultado el 18 de mayo de
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Plan</given-names>
            <surname>Nacional</surname>
          </string-name>
          sobre Drogas, 1 de cada 12 jóvenes de 18 a
          <article-title>25 años que participa en apuestas online desarrolla problemas con el juego</article-title>
          , https://www.dsca.gob.es/es/comunicacion/notas-prensa/ 12-jovenes-18-25
          <article-title>-anos-participa-apuestas-online-desarrolla-problemas-</article-title>
          <string-name>
            <surname>juego</surname>
          </string-name>
          ,
          <year>2024</year>
          . Nota de prensa.
          <source>Consultado</source>
          el 18 de mayo de
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk at clef 2021:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          .,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          (Working Notes)
          <volume>1</volume>
          (
          <year>2021</year>
          )
          <fpage>864</fpage>
          -
          <lpage>887</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>González-Barba</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          <string-name>
            <surname>Jiménez-Zafra</surname>
          </string-name>
          ,
          <article-title>Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages</article-title>
          , in: Proceedings of the
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>