1. Introduction

wangkongqiang at MentalRiskES@IberLEF 2025: Early Detection of Mental Disorders Risk in Spanish

Kongqiang Wang

0 Yunnan University, School of Information Science and Engineering , Kunming, Yunnan, 650500 , China

2025

According to a recent report by the World Health Organization, there is 1 in every 8 people in the world sufering from a mental disorder. The organisation MentalRiskES at IberLEF 2025 considers that early identification is a key efective intervention to prevent these problems. The task I participated in was Task 1: Risk Detection of Gambling Disorders. This is a binary classification task aimed at determining whether a user is at high risk ( label = 1 ) or low risk ( label = 0 ) of developing a gambling-related disorder based on their messages. The objective is to enable early detection and facilitate timely interventions. We compare the performance of two diferent modeling approaches: fine-tuning a roberta-base model and using sentence embeddings as inputs to a linear regressor, with the latter yielding better results. My final experimental result is Accuracy 0.519, Macro_R 0.500, ERDE_5 0.332, ERDE_30 0.250.

eol>Mental Health Natural Language Processing Transformers Sentence Embedding

1. Introduction

IberLEF 2025 September 2025, Zaragoza, Spain * Corresponding author. $ wangkongqiang60@gmail.com (K. Wang) https://github.com/WangKongQiang/ (K. Wang)

The first thing we did was group all the messages by the user they belonged to and concatenated them into a single string, obtaining a total of 357 messages ( one per user ). This was done to obtain a single representation of each user’s conversation history ( from which the labels were assigned ) to be able to use it as input for the models. • augment the train data for train dataset settings.

To increase the amount of data available for training, at the same time, attempt to model early detection ( obtaining predictions early on in the lifetime of the message history ), we augmented the training set by adding observations that only contained half of their messages ( The first half and the second half ), One third of their messages ( The first third of the data, the middle third of the data, and the last third of the data ). This was done by first sorting the messages of each user in the training set by its date and then only taking the augment data, the resulting dataset was then appended to the original training set to obtain a new one with sixfold the number of observations to be used for training. • pretrain model for model training.

We used two models, and they are respectively models–PlanTL-GOB-ES–roberta-base-bne[ 5 ] and models–somosnlp-hackathon-2023–roberta-base-bne-finetuned-suicide-es[ 6 ]. The model we ifne-tuned was a version of RoBERTa pre-trained for detecting suicidal behavior from texts in Spanish. We chose this model due to the fact of having been trained previously for a task that shares similar characteristics to ours. • embedding model optional package if regression estimators is required in the document.

The sentence embeddings were obtained after concatenating the messages of each user into a single string. The diference in performance between roberta-base-bne-suicide-es encodings and the other embeddings can be justified by the fact that we are taking advantage of the information gained from the prior fine-tuning for suicide detection of this model, which likely shares semantic similarities with our data. • regression estimators optional sklearn package for using sentence embeddings as inputs to a linear regressor.

The estimators mentioned in the paper are implementations of common regressors from Python’s Scikit-Learn library.These include: Ordinary (" lr ") and Ridge (" ridge ") [ 7 ], Least Squares Regression, Ada-Boost regression (" ada "), LightGradientBoosting Machine (" lgbm "), SupportVectorRegression (" svr "), RandomForests (" rf ") [ 8 ], and a Multi-Layer Perceptron (" mlp "). References of the implementations of these models can be found in the Scikit-Learn documentation.

The rest of the paper is organized as follows: In the next section, we analyze the dataset used for the task ( Section 2 ). Then, we describe in detail our methodology for training and evaluating the models ( Section 3 ). Finally, we discuss the results obtained ( Section 4 ) and present our conclusions and future lines of work ( Section 5 ).

2. Dataset Analysis

The dataset given for the task consisted of a total of thousands of individual messages from 357 Telegram / Twitch / Reddit users ( see Table 1 ), each with a variable number of messages. The annotation process consisted of labeling each user based on the evidence from their conversation history of sufering from gambling disorders. Thus, a total of 2 labels were used for the tasks. Each was asked to assign one of the following two labels: at high risk ( label = 1 ) or low risk ( label = 0 ) to each user.

To increase the amount of data available for training, at the same time, attempt to model early detection ( obtaining predictions early on in the lifetime of the message history ), we augmented the training set by adding observations that only contained part of their messages. This was done by first sorting the messages of each user in the training set by its date and then only taking the part of data, the resulting dataset was then appended to the original training set to obtain a new one with sixfold the number of observations to be used for training. Now, taking the sample with the subject-id of user1002 as an instance, how to create an instance with five more partial data than the original data ( see Table 2 ).

3. Methodology

We proceeded to evaluate diferent techniques to solve this subtasks. Two main predictive-modeling approaches were explored: The first one involved fine-tuning a pre-trained language model on this subtask and the second was about training a standard ML regressor using sentence embeddings encoded from the user’s messages as features. The following section describes the steps taken for each approach, ifrst describing how the data was pre-processed and later explaining the training and evaluation process done for this subtask.

3.1. Data Processing and Augmentation

Based on the detailed description and practical examples provided earlier in the article, a brief account will be given here, include: • concat_messages : According to the messages of the telegram users, they are spliced in chronological order, that is, the rounds sequence. The standard for splicing is based on their user ids. • augment_data : Expand based on half and one-third of the original data set, and eventually expand to six times the original size ( including the original data ). 1

To prepare for training, the original data was split into training and validation sets, leaving a random 54 ( 15% ) users in the latter for stratified cross-validation, where each set receives the same proportion of samples of each class. The stratification was done using the labels of Subtask 1 to ensure equal representation of the classes in both sets.

3.2. Solving Substask1 by Solving for Regression

By the discussion in Section 2, it should be clear to see that all labels of the subtasks give the same amount of information about the condition of the subject and the likelihood of predicting it based on the available data. This observation led us to consider using models that solve for this one subtask by 1We can enable the display and use of data only the oficial data set is provided and no additional extended datasets were used. only training it with the labels of Subtask1. This allowed us to reduce the number of models that had to be trained and focus on solving for a single data modality (regression on [ 0, 1 ]).

We approached simple regression in a standard way training models, training models to minimize the Mean Squared Error between the output values and the real ones. Additionally,we included the post-processing step of clipping the output predictions of models of this type to the [ 0, 1 ] range to ensure that they were valid probabilities. simple-output regression using standard machine learning regression, on the other hand,wasn’t as trivial as in the simple binary classification case. We used many regressors in the sklean python package.

3.3. Modeling Approaches

3.3.1. Training a Regressor with Sentence Embeddings A sentence embedding[ 9 ] is a semantically meaningful real-valued vector representation of a sentence, obtained from the outputs of the hidden layers of a language model. The properties of this representation are so that sentences that express similar meanings are mapped ( encoded ) closer to each other in the vector space.

In this way, the process of encoding text as numeric vectors can be used directly to extract features for a classifier or regressor, which will try to learn from the semantic information of these encodings to predict the label of their corresponding messages. Note, however, that this approach requires the need to have a pre-trained model to perform this encoding. Furthermore, it assumes that the model will be good enough at capturing the semantic information of the texts given as input, enough for the classifier / regressor to learn from it.

Assuming that this is the case, this approach has the advantage that it is much faster to train these kinds of regressors with regular CPUs, with the most time-consuming part being obtaining the embeddings of the training / evaluation messages, which only has to be done once. However, it is necessary to evaluate diferent encoding models and diferent classifiers / regressors ( prediction models ) to find the best combination for the task at hand.

As such, we conducted experiments using diferent language models to find the best encoding model. Particularly, we tested two diferent versions of roberta trained with diferent corpora in Spanish. These versions are described in Table 3. Additionally, we experimented with over 10 diferent regressors, including Least Squares Linear regression, Random Forest,and Gradient Boosting[ 10 ], among others. These models were chosen due to their ease of implementation and the fact that they are commonly used in the literature[ 11 ].

The process of training and evaluating these models proceeded then as follows: First, the training set was encoded using the language model and the resulting embeddings were used as features for a regressor. The regressor was then trained using the labels of Subtask1 ( the most informative ones ) and the resulting model was used to predict the labels of the validation set. The predictions were then evaluated with the root mean squared error ( RMSE ). This process was repeated for each combination of language model and regressor.

Appendix contains the results of this experiment. Based on that, roberta-suicide-es was deemed to be the best model for encoding the texts. Additionally, Table 8 shows a detailed report of the evaluation of the best regression model with these embeddings.

3.4. Fine-tuning a Language Model for Regression

Apart from the approach mentioned above, we also experimented with the pure Deep Learning ( DL ) [ 12 ] approach of taking a language model and fine-tuning it with the labels of the corresponding subtask. The model we fine-tuned was a version of RoBERTa pre-trained for detecting suicidal behavior from texts in Spanish. We chose this model due to the fact of having been trained previously for a task that shares similar characteristics to ours. Intermediate fine-tuning has been proven to improve the results of downstream tasks by prior literature. The HuggingFace Transformers and Pytorch libraries in Python were utilized for loading the model weights and implementing the training loop. We changed the head of the pre-trained model to a linear layer consisting of output dimension 1 for simple regression. The models were trained using an NVIDIA GeForce RTX 3090 24G GPU for a total of 30 epochs, where the weights of the pre-trained model remained fully frozen for the first half and then were progressively unfrozen[ 13 ] each epoch after that as in ( see Table 4 ).

We used an Adam Optimizer with Mean-Squared Error ( MSE ) for the simple regression models. However, this did not improve the results empirically as compared with simply normalizing he outputs of the predictions after inference. The formula of this loss is shown in ( LOSS_custom = LOSS_crossentropy ). Other hyperparameters are shown in Table 4.

4. Results

Using the approaches mentioned in the prior section, we came up with diferent models to solve the subtask of Task 1 of MentalRiskES. The results in this section are obtained from selecting the bestperforming models after evaluating the diferent approaches and hyperparameters on the validation set. The final predictions were obtained from a test set of messages from 136 subjects never observed during the training process and evaluated against the task’s true labels.

In the tables ( Table 5, Table 6 ) below, we report the relevant metrics obtained for this subtask and compare them against the ones obtained from baseline models provided by the organizers of the competition. In particular, we report both absolute metrics, obtained after observing all the messages of each subject, and early detection metrics, obtained after incrementally observing the messages across several rounds. Additionally, Table 7 displays the inference-time CO2 emissions and energy consumption of each model, based on computing their absolute predictions on the test set. These values were estimated using the codecarbon[ 14 ] python library.

run run0 run1 run2 baseline1 baseline2 run compared with the actual model roberta-base-bne-finetuned-suicide-es embeddings and Ridge EmbeddingsRegressor actual_model

Robertuito

Roberta Base the inference-time CO2 emissions and energy consumption of each model duration_mean emissions_mean cpu_energy_mean

gpu_energy_mean ram_energy_mean energy_consumed_mean 1.20E-05 1.20E-05 1.22E-05 null null The embeddings approaches and regressors combination yielded the results representation estimator r2_score_risk

mean_squared_error_risk

For the absolute metrics, we show the accuracy, precision, recall, and F1 scores for the classification task ( Subtask 1) and the root mean squared error ( RMSE ) and coeficient of determination ( R2 ) for the regression task ( expand Subtask 1 ). The early detection metrics include the early-risk detection metric ( erde ) computed after observing diferent rounds of messages as well as other metrics ( more details are provided in the competition guidelines ).

The metrics are shown along with the name of the model used to obtain them. The models are named as follows: [ model name ]-[ approach ]. For example, roberta-suicide-es-fine-tuning refers to the model trained with the task 1 ( binary classification ) labels by fine-tuning the Roberta model pre-trained for suicide detection. The " approach " can be either embeddings or fine-tuning for the two approaches described in Section 3.

Furthermore, all ML regressors trained with embeddings as features were simple regressors, and all embeddings were obtained using roberta-suicide-es encodings as this combination yielded the best results in the evaluation set. The embeddings approaches and regressors for task 1 ( see Table 8 ).

For R2-score, it can be understood in a simple way as using the mean as the error reference to see if the prediction error is greater than or less than the mean reference error. R2-score = 1. The predicted values in the sample are exactly equal to the true values without any error, indicating that the interpretation of the dependent variable by the independent variable in the regression analysis is better. R2-score = 0. At this point, the numerator is equal to the denominator, and each predicted value of the sample is equal to the mean. R2-score is not the square of r. It may also be negative ( numerator > denominator ). The model is equivalent to blind guessing. It is better to directly calculate the average value of the target variable. Specific formula representation see Figure 1.

MSE is the abbreviation of Mean Squared Error and is a commonly used indicator to measure the prediction accuracy of regression models. It represents the average of the sum of squares of the diferences between the predicted values and the true values, and is usually used to evaluate the performance of regression models ( see Figure 2 ). RMSE is the abbreviation of Root Mean Squared Error and is a commonly used indicator to measure the prediction accuracy of regression models. It represents the average magnitude of the diference between the predicted value and the true value, and is usually used to evaluate the performance of the regression model ( see Figure 3 ). Among them, y-i is the true value of the i-th sample, yˆ-i is the predicted value of the model for the i-th sample, and m is the number of samples.

The smaller the MSE and RMSE are, the higher the prediction accuracy of the model is. However, it should be noted that MSE and RMSE are greatly afected by outliers. Therefore, in practical applications, a comprehensive evaluation needs to be conducted in combination with other indicators ( such as the maximum error, max-error ) .

5. Conclusions

The results show that the approaches considered in this work were successful at modeling of the predictive subtasks, with at least one of our models outperforming the baselines in most cases. We can make the following observations:

• The best-performing approach across task1 seems to be the one that uses the embeddings of the messages as input to a simple-output regression model. At least one model trained with this approach reached the top ranking for task absolute ranking metrics and outperformed the baseline absolute metrics across this task.

• Most notably, the regression method that uses regressors obtained the best metrics for task across all models, outperforming the fine-tuning approach by over 20% in the absolute metrics and reaching the better highest spot in the early-risk metrics for this task in our valid dataset ( train dataset split into 15% of subject ids for validation ).

• Models trained for single-output regression perform very well for binary classification and simple regression tasks, even outperforming the models trained for simple transformer targets in their own subtask. This suggests that using one model with MLPRegressor to solve for single targets was indeed a good approach to this problem.

• The models obtained with a pure DL approach from fine-tuning a RoBERTa[ 15 ] model are estimated to produce over 3-4x less emissions at inference time than the hybrid approach from training linear regressors on sentence embeddings. This gap is likely because the fine-tuning approach requires less computation at inference time than the hybrid approach, which requires the computation of the sentence embeddings before feeding them to regressors, while the fine-tuning approach is made in one forward pass.

Another finding we can conclude from these insights is that while our models achieve great results in the absolute ranking metrics, they do not perform as well for the metrics that assess early-risk performance. In our work, we did not model explicitly for an early detection scenario. We only added information about prior messages through data augmentation. This limitation means our models may not perform as well in real-world situations where we aim to detect signs of gambling disorder in a conversation early on.

Thus, it may be important to explore diferent training approaches to improve the performance of early-risk detection. This might include directly employing online learning to predict and update the model as new messages come in or incorporating an ensemble of models to make independent decisions about a message’s risk level and combining them for a final decision. Additionally, we may also look into more eficient implementations of the hybrid approach to minimize the disparity in emissions compared to pure DL models. These improvements are crucial when considering the deployment of our models in real-world situations and will be the focus of future work.

6. Acknowledgments

Thank you for MentalRiskES@IberLEF 2025 organizing the competition and providing the dataset and other support, and thanks to students of Yunnan University individuals and groups that assisted in the research and the preparation of the work.

7. Declaration on Generative AI The author(s) have not employed any Generative AI tools. A. Appendices The sources for the fine-tune pre-trained models are available via:

• fine-tune a pre-trained RoBERTa model ( models–PlanTL-GOB-ES–roberta-base-bne ), see Figure 4. • fine-tune a pre-trained RoBERTa model ( models–somosnlp-hackathon-2023–roberta-base-bneifnetuned-suicide-es ), see Figure 5.

[1] Álvarez-Ojeda , Pablo, Cantero-Romero, M. Victoria , Semikozova, Anastasia, Montejo-Ráez, Arturo, The precom-sm corpus: Gambling in spanish social media , in: Proceedings of the 31st International Conference on Computational Linguistics , 2025 , pp. 17 - 28 .

[2]

Xiong ,

Lipsitz ,

Nasri ,

L. M. W.

Lui ,

Gill ,

Phan ,

Chen-Li ,

Iacobucci ,

Ho ,

Majeed , R. S. McIntyre , Impact of COVID-19 pandemic on mental health in the general population: A systematic review , in: Journal of Afective Disorders 277 , 2020 , p. 55 - 64 . URL: https://www. sciencedirect.com/science/article/pii/S0165032720325891. doi: 10 .1016/j.jad. 2020 . 08 .001.

[3] González-Barba , J.

Ángel , Chiruzzo, Luis, Jiménez-Zafra, S.

María , Overview of IberLEF 2025: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2025), co-located with the 41st Conference of the Spanish Society for Natural Language Processing (SEPLN 2025), CEUR-WS . org, 2025 .

[4]

A. M.

Mármol-Romero ,

Álvarez-Ojeda ,

Moreno-Muñoz ,

F. M. P.

del Arco , M. D. MolinaGonzález , M.-T. Martín-Valdivia, L. A.

Ureña-López , A.

Montejo-Ráez , Overview of mentalriskes at iberlef 2025: Early detection of mental disorders risk in spanish , Procesamiento del Lenguaje Natural 75 ( 2025 ).

[5]

A.G.

Fandiño ,

J.A.

Estapé ,

Pàmies ,

J.L.

Palao ,

J.S.

Ocampo ,

C.P.

Carrino ,

C.A.

Oller ,

C.R.

Penagos ,

A.G.

Agirre ,

Villegas , Maria: Spanish language models , Procesamiento del Lenguaje Natural 68 ( 2022 ). URL: https://upcommons.upc.edu/handle/2117/367156#.YyMTB4X9A-0.mendeley. doi: 10 . 26342/2022-68-3.

[6]

D. L.

Padial , D.

Gómez, hackathon-somos-nlp-2023-roberta-base-bne-finetunedsuicide-es· hugging face (

2023 ). URL: https://huggingface.co/hackathon-somos-nlp -2023/ roberta-base-bne-finetuned-suicide-es.

[7]

A. E.

Hoerl ,

R. W.

Kennard , Ridge regression: Biased estimation for nonorthogonal problems , in: Technometrics, 55 - 67 , [Taylor Francis, Ltd., American Statistical Association,American Society for Quality], 1970 , p. 12 . URL: https://www.jstor.org/stable/1267351. doi: 10 .2307/1267351.

[8]

Breiman , Random forests , in: Machine Learning , volume 45 , 2001 , p. 5 - 32 . URL: https: //doi.org/10.1023/A:1010933404324. doi: 10 .1023/A: 1010933404324 .

[9]

C. S.

Perone ,

Silveira ,

T. S.

Paula , Evaluation of sentence embeddings in downstream and linguistic probing tasks , 2018 . URL: http://arxiv.org/abs/ 1806 .06259.

[10]

Friedman , Greedy function approximation: A gradient boosting machine , in: The Annals of Statistics , 2000 , p. 29 . doi: 10 .1214/aos/1013203451.

[11]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , E. Duchesnay, Scikit-learn: Machine learning in python , in: Journal of Machine Learning Research , 12 , 2011 , p. 2825 - 2830 .

[12]

Paszke ,

Gross ,

Massa ,

Lerer ,

Bradbury , G. Chanan,

Killeen ,

Lin ,

Gimelshein ,

Antiga ,

Desmaison ,

Kopf ,

Yang ,

DeVito ,

Raison ,

Tejani ,

Chilamkurthy ,

Steiner ,

J. B. L.

Fang ,

Chintala , Pytorch: An imperative style, highperformance deep learning library , in: Advances in Neural Information Processing Systems 32 , Curran

Associates

, Inc., 2019 , p. 8024 - 8035 . URL: http://papers.neurips.cc/paper/ 9015-pytorch -an-imperative-style-high-performance-deep-learning-library .pdf.

[13] C. C. Liu , J. Pfeifer , I. Vulić , I. Gurevych , Improving generalization of adapter-based crosslingual transfer with scheduled unfreezing , 2023 . URL: http://arxiv.org/abs/2301.05487.

[14]

Schmidt ,

Goyal ,

Joshi ,

Feld ,

Conell ,

Laskaris ,

Blank , J. Wilson,

Friedler ,

Luccioni , Codecarbon: estimate and track carbon emissions from machine learning computing , in: Cited on, 2021 , p. 20 .

[15]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer ,

Stoyanov , Roberta: A robustly optimized bert pretraining approach , 2019 . URL: http://arxiv.org/abs/ 1907 . 11692. doi: 10 .48550/arXiv. 1907 . 11692 .