1. Introduction

HULAT-UC3M at Task1@eRisk 2025: Detecting Depression Using Machine Learning Approaches

Javier Campos-Molina

Paloma Martínez

0 0 Computer Science and Engineering Department, Universidad Carlos III de Madrid , Leganés, Madrid , Spain

This paper describes the participation of HULAT-UC3M research group at Task 1: Search for Symptoms of Depression at eRisk 2025 shared task [1]. A proposal composed of three steps is proposes. The first is to train a SVM multi classifier using the embeddings from all-MiniLM-L6-v2 pretrained model to classify all the sentences into their corresponding symptom. Second step consists on a filter to select the most representative 1000 sentences to be sent and finally we will get the score for the sentences chosen in the previous step using a rule-based model and a encoder-based transformer (RoBERTa) for sentiment analysis. Performance of the best model is NDCG of 0.053 and P@10 of 0.157.

eol>Natural Language Processing Machine learning Depression detection Classification LLM

1. Introduction 2. Related Work

Starting from 2019 we have one of the participants using a text classifier called SS3 [ 5 ] [ 6 ] for solving task 3 [ 7 ]. The task is related, but it is not exactly the same like the one solved in this paper as it consists of classification of depression severity instead of classifying in symptoms and scoring them. The classifier previously mentioned, SS3, is a probabilistic model using statistic in order to associate some words. For each word creates a probability of being associated with other words taking into account if it appears previously together with that word or not. Its important to mention that this is not a transformer although it may appear similar in the sense that it assigns a probability between 0 and 1 to each word in relation to others. SS3 does not use self-attention as transformers does, but it relies on probabilistic functions such as confidence (cf), support (sf), and credibility (cv) to model context.

In 2020 edition [ 8 ] some systems proposed solutions based on roBERTa model, a model that is more powerful than BERT as they were trained with times 10 more data. The model has a tokenizer itself that was used to create the tokens, then create the embeddings and finally they did the classification and a softmax as last layer to compute the probabilities. The team using this approach was the best in terms of accuracy with more than a 69% [ 8 ].

In 2021 edition [ 9 ] some of the proposed systems followed similar approaches to those of 2020. One of the studies used BERT and roBERTa together [ 10 ] and the comparison of results were in favor of roBERTa, as it was expected for the reason previously mentioned. Other systems proposed diferent probabilistic methods similar as SS3 back into 2019. In this case, one of the groups participating on the task proposed a system using Latent Dirichlet Allocation (LDA) [ 11 ], that consists of a Bayesian network, in combination to sentence transformers and classical classifiers. LDA is very popular in unsupervised learning tasks. In other hand and although it is not a task related to depression but to self-harm we have some projects using interesting systems [ 9 ]. Self-harm task was a binary classification task between users that needs to classify them by the ones at some point have harmed themselves and the ones who have not. One of the team [ 12 ] used Yake for one of their runs, that is a model that takes the most important words out of a sentence but it did not work as expected as it removed the important signs of self-harm from the sentence. An additional run using VADER [ 12 ] model but there are no results available for this model as the team did not submit the run.

In 2022 edition [ 13 ], the system described in [ 14 ] used roBERTa but in addition they used a model called MiniLM that is a model derived from roBERTa and BERT architectures, with whole self-attention but it is a multilingual model and is able to work in diferent languages. RoBERTa model is only specialized in English texts. Also MiniLM is smaller and faster than roBERTa and could help in terms of eficiency. Another team introduced a fully connected neural network (FCNN) in the third run combined with previously used systems as support vector machine (SVM) and transformers. The results were very good in terms of recall (0.816 compared to 0.745 of the first team) but not in terms of precision (0.283 compared to 0.682). The team that won the competition [ 15 ] used the bag of words (BOW) approach, combined with entropy-based weighting and a SVM classifier. BOW is a technique for converting text into numerical representations, enabling the use of classical machine learning models as SVM, used in this system. This team applied TF-IDF weighting enhanced with entropy, giving higher importance to the most relevant words while reducing the influence of frequently occurring but not interesting terms. Additionally, they employed chi-square feature selection to further improve classification speed performance by retaining only the most relevant terms from the previous step by reducing the total amount of data.

eRisk 2023 edition [ 16 ] changed from binary to multiclassification classification. The objective of the task is to classify symptoms of depression according to BDI-II Questionnaire [17] and give them an score from 0 to 10. With the rise of generative AI, some of the teams [18] starting using LLM in order to generate more data as is it done in other approaches outside this task [19][20]. One of the teams [18] used ChatGPT in order to generate more data using a prompt for each of the symptoms. Then it was combined with some new models that performed well capturing semantic relations as MentalRoBERTa but the results were not good in comparison to the first team in the competition. The max average precision out of the 5 runs were 0.104 far below 0.319. Another team [21] also attempted to compare sentences based on their similarity by computing sentence embeddings using transformerbased models. However, due to the high computational cost of encoding all sentences, they first used the BM25 model [21] as a lightweight filtering step. Only the top ranked sentences, the most similar to the ones from the BDI-II questionnaire, were retained and then processed with the transformer models for similarity evaluation. The results were not good as the maximum AP of their runs is 0.039. Furthermore, the winners of 2023 year’s task [22] used word2vec embeddings to capture semantic and grammatical similarities between words. Then, a soft cosine similarity is applied to compare each sentence in the dataset with the individual sentences describing each of the 21 symptoms from the BDI-II questionnaire. Specifically, if a symptom in the BDI-II is described by four diferent sentence options, the similarity between a dataset sentence and each of these options is computed individually, resulting in four similarity scores. These scores are then weighted using predefined weights for each option. The weighted similarity for each option is obtained by multiplying the similarity score by its corresponding weight. Finally, the total similarity between the sentence and the symptom is calculated as the sum of all the weighted similarities.

3. Method and system description

Before the development of the solution, an analysis of the labeled data given by the eRisk organization [ 4, 1 ] was done to check that there is enough data to compute a model and checking that all the labels were in the correct format. The main objective of the task is to get the score for each of the symptoms for 1000 sentences given in a test dataset. The process is structured in three steps. The first one is to train a multi classifier to classify all the sentences into their corresponding symptom. After that, the sentences were filtered according to diferent criteria for the diferent runs and finally, we will get the score for the sentences chosen in the previous step using diferent methods.

3.1. Multi classification of the sentences

For the multi classification problem, the model is trained using the annotated data from 2024 [ 4, 1 ], which are the ones ofered for training. Unanimity, which means that all annotators agree to label that sentence as relevant to a particular symptom, is provided by organizers. Additionally, majority is also provided. The computed model uses the unanimity sentences because using the majority dataset could introduce sentences that increase the noise as not all of the annotators were agree. This can lead to misclassifications in the model for some of the sentences, even though they have more training data.

The proposed system uses a classical machine learning model, a support vector machine (SVM) 1 to do the multi-classification on the 21 symptoms present in the BDI-II questionnaire [ 17]. Taking only the relevant sentences, the model was trained using the embeddings created from a pretrained model called "all-MiniLM-L6-v2" 2. An analysis of the results was made by dividing them into train and validation datasets to test the model. The results of this test will be explained in section 4. Figure 1 represent the steps followed to compute the model and how we used it.

We trained the SVM with the hyperparameter probabilities set to true from the implementation given at sklearn 3 in order to predict the test sentences and filter the only ones that have higher probabilities than the threshold applied. Three diferent thresholds have been tested and the values are 0.75, 0.80, 0.85,0.90 and 0.95. The amount of sentences after each filter are represented in the table 1. 1https://scikit-learn.org/stable/modules/svm.html 2https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 3https://scikit-learn.org/stable/modules/svm.html

3.2. Selection of the sentences

Once we have filtered the messages from the test dataset, a sample of at most 1000 of these test messages was selected, which are the ones we have to send as runs for the proposed task. For this purpose, three diferent approaches have been implemented for the selection of the messages.

The first one (Figure 2) was to select the top 1000 sentences by confidence percentage, in other words, the ones that the multi-classification model gave the highest probability of belonging to that symptom. The pros and cons of this selection are clear, the pros is that the best samples will be sent to competition and therefore better results are expected, however, it is possible that it does not contain some of the symptoms as they are not classified with high percentage of confidence, and therefore, if the evaluation is an average between the results of all the symptoms, it can lead to score 0 on them if it does not contain any evaluation in this regard.

The second way to take this sample was to use the sentences previously rated above 0.95 confidence. We chose 0.95 because the other ones have a lot number of sentences for some of the symptoms and may introduce a lot of randomness to the selection of the sentences. The table 1 would be the amounts that would remain for each symptom after filtering by this confidence number.

Now to select the 1000, they will be taken proportionally to each group so that for each symptom its contribution is calculated as a percentage of 1000 and sentences are randomly selected from the subset already extracted. In case of decimals, we round down to the nearest whole number. We do this to ensure that the number of sentences does not exceed 1000. This approach ensures that you have sentences of all symptoms. Figure 3 shows the process. The following formula would be a formalization of the above applied for each symptom where Ni is the amount of sentences for the symptom and Nj is the sum of the amount of all the sentences from the 21 symptoms: ⎢ ⎢ ⎢⎢ = ⎢⎢ 21 ⎣ ∑︀ =1 ⎥ ⎥ ⎥ ⎥ × 1000⎥⎥ ⎦

In the last approach, after manually reviewing the sentences some of them did not talk about the symptom as if they were feeling it themselves but as if it was felt by someone else. A sub-selection of reflexive sentences was implemented within the sub dataset of 0.90 confidence (figure 4). We took a lower confidence because for some of the symptoms we did not reach the minimum we were looking for in this test. Those containing reflexive pronouns or the first person english pronouns such as ‘I’ or ‘I’m’ and their variants were selected. We then took the same number of sentences from all the symptoms to test with an equal distribution. Rounding up, this would leave 47 sentences per symptom.

3.3. Scoring the sentences

The next step is to score the selected sentences out of 10 and two approaches have been implemented, one using a model called VADER 4 and the other one called roberta-base-sentiment 5. Both of them are models that are used especially for binary ranking but they have also been used for scoring in some of the cases.

In the case of VADER, it returns 4 values for each parsed sentence. Three of them are values between 0 and 1 referring to how positive, negative or neutral the sentence is, giving the sum between them a total of 1. On the other hand, the parameter ‘compound’ is a number between -1 and 1 that combines the three previous values, that is, the more negative the sentence, the more negative the value of the compound and vice versa.

The objective was to test what values it gave for sentences that were cataloged with some of the symptoms of depression treated in this task, and then with the use of a formula adjust that value over 10. When analyzing the sentences we saw that the sentences were never completely negative, or completely positive but especially it was more dificult for them to be cataloged as positive, so we added a multiplier to this score to increase the diference between the symptoms of greater severity with those of lesser severity, putting a multiplier of 1.1 to the negative, and 1.2 to the positive respectively, as long as the opposite was 0. That is, if a sentence had only negative and neutral values, the multiplier would be applied, but if a sentence had all 3 values, or at least positive and negative, only the ‘compound’ value would be taken into account. However, it should be noted that if VADER returns a positive value, it means that the phrase has no negative connotation, so it should have a low score out of 10 and the opposite is applied if VADER returns a negative value. The following formula represents how the value of ‘compound’, together with the multiplier explained above, was used to calculate the score rounded to two decimal places. Its important to take into account that the returned value has a maximum of 10 and a minimum of 0.

︂( 1 − sentiment_score · regulator )︂ 2 × 10 4https://www.nltk.org/api/nltk.sentiment.vader.html 5https://huggingface.co/cardifnlp/twitter-roberta-base-sentiment where sentiment_score is the value of compound and the regulator is the 1.1 or 1.2 value explained before.

On the other hand, the roberta-base-sentiment model is a very similar approach to VADER. In this case, the model returns a label called label and another called score. Label has 3 values, LABEL_0, LABEL_1 and LABEL_2 where 0 represents sentences categorized as negative, 1 represents neutral sentences and 2 represents positive sentences. On the other hand, the score is a value between 0 and 1 that actually refers to the confidence of assigning it to the label (positive,neutral,negative). What we have done in this case is that if it is a negative label (0), we multiply the score by 10, if it is neutral (1) we multiply it by 5 and if it is positive (2) we multiply 1-score by 10. In this way we ensure that sentences with more severe symptoms are given a higher score.

4. Results and discussion

Some internal tests were done for the multi classification task, as it is the only part we can really test it in this way due to the lack of labeled data for the scores. In that case, we divide the sentences into train and test, 80% and 20% respectively. The results show metrics as precision, recall and F1-score, including also the amount of sentences used in test with the column name of support. The results are shown in Table 2 divided by symptoms.

Table 3 displays the results given by the eRisk organizers for the participants for task 1, in order to compare our performance with the best team in the task. The table represents the results for majority, meaning that at least 2 of the 3 assessor marked it as correct.

Our submitted runs were a mixed of the previously explained methods, mixing diferent approached to take the 1000 sentences and some diferent ways of scoring them. There was a typo in one of the runs for roBERTa as it has the same name for two runs, but one run was using roBERTa scorer with the top 1000 sentences by confidence so should be called roBERTa top, while the other one was using roBERTa scorer again but with the sample of 1000 sentences instead of the top ones. The run starting ID

NDCG INESC-ID INESC-ID INESC-ID INESC-ID INESC-ID HULAT_UC3M HULAT_UC3M HULAT_UC3M HULAT_UC3M

HULAT_UC3M maxcos 0.235 unanimity 0.354 max 0.350 mix23 0.312 aug-best 0.247 roBERTa 0.018 vader top 0.015 reflexives roBERTa 0.013 roBERTa 0.004 vader sample 0.004 with the name of VADER uses the model VADER to score the sentences, being the one called VADER top scoring the top sentences by confidence and VADER sample scoring the sample chosen sentences. Finally, relfexives roBERTa uses the reflexive sample and roBERTa scorer.

Across all evaluation metrics—Average Precision (AP), R-Precision (R-PREC), and Normalized Discounted Cumulative Gain (NDCG), our runs performed significantly worse compared to the INESC-ID team, that was the team with the best scores out of all the teams participating in the task. Their submissions achieved consistently strong results, with the best run reaching an AP score of 0.354.

Our best performing run was the one using the RoBERTa model, where we selected the top 1000 sentences based on confidence scores from our multiclass classification model with an AP of 0.018. This was closely followed by a similar run using the VADER model for scoring with the same top 1000 sentence selection approach. These two runs outperformed our other three submission by a wide margin in fact, they were up to four times more precise than the runs that used sentence sampling (runs 4 and 5) with a confidence threshold of 0.95, as previously described in this document.

From these results, we can conclude that the multi class classification model was efective. The two runs that used high confidence sentence selection clearly outperformed the runs based on lower confidence sampling, suggesting that confidence based filtering had a strong positive impact on performance. However, the methods used to score the sentences are not appropriate.

Table 4 shows the result in case of unanimity, meaning that all the three annotators have to agree on the sentence being well classified and scored.

NDCG INESC-ID INESC-ID INESC-ID INESC-ID INESC-ID HULAT_UC3M HULAT_UC3M HULAT_UC3M HULAT_UC3M

HULAT_UC3M unanimity 0.269 max 0.223 mix23 0.201 aug-best 0.167 maxcos 0.164 reflexives roBERTa 0.013 roBERTa 0.008 vader top 0.006 roBERTa 0.002 vader sample 0.001

In this case, the best run is the one that chooses reflexive sentences. It does not get worse compared to the results given in the table 3 representing the result of majority, 0.013 of precision in both cases. This result can lead us to think that if a sentence is reflexive, is more likely to be correctly selected than if it is not, as all the sentences correctly scored were all ranked by unanimity as in majority we have the same score compared to unanimity. In the case of the other runs, it gets worse by half or more of the precision for our team, while the score from the best team in this case is the run called unanimity, which can suggest that they used the data labeled as unanimity from the corpus provided by eRisk. The result achieved from INESC-ID can lead to think that our proposal of using only the unanimity sentences to train the classifier was in the good way, removing part of the noise or not clear sentences from the labeled data.

5. Conclusions and Future Work

We have presented a general overview of our participation at eRisk task 1, search for symptoms of depression using mainly machine learning approaches. As we mentioned in the previous section, we can notice that the approaches were not good in general even though we could get some possible conclusions or future testing.

For possible future work, we would like to change the way of scoring the sentences, as we could notice thanks to some conclusions that it was the weakest point of our systems. Probably VADER and roBERTa models are not adequate for this task, or we have not fixed fine tune it enough in order to get the best use of it. For substituting this approaches we would like to try generative AI for generating labeled data with the score giving in the prompt some examples of the labeled data and the BDI-II Sen. Some other teams, as mentioned in the related work section used it to generate data in general, but we would like to use it only for creating data with the scores with the idea of using it directly in a machine learning architecture as a SVM, using a two level architecture, firstly the SVM classifier followed by the SVM regressor for each of the symptoms for creating the scores.

Other solution could be to explore generative AI directly to score the sentences previously chose without training a machine learning architecture nor generating new data using generative AI. In this case, the scores will directly depend on the prompt used to generate the data so it will be important to give several accurate examples to the generative AI in order to achieve good results. In this approach we would like to have a professional in depression for creating some scored sentences for each of the symptoms to introduce them in the prompt. This approach without good and accurate examples do not work as it depends directly in the random choices made by the AI.

In other hand, some deep learning approaches would be really helpful for generating the scores. One of the approaches that would like test is training a neural network (NN) with very extreme sentences of depressed users in the internet and use a softmax classifier at the last layers of that NN. The purpose of the approach is to return the probability of the sentence to be part of that symptom as a list of 21 values, one for each of the symptoms. Then the score will be computed by multiplying the maximum probability between all of the symptoms by 10 in order to get the score and we will assign the sentence to the symptom having the higher probability. In the end, the sentences clearly being part of one symptom will get very high score and, if the model is unclear about it, the probability will be very low, and therefore the score will follow to as depends on the probabilities returned by the model. This approach substitutes the previously mentioned multiclassifier and requires appropriate data, as the labeled data given by the eRisk task do not have only severe sentences for each symptom, so we will need to combine data generation with generative AI or similar approaches for doing data augmentation in combination with this approach.

Acknowledgments

This work was partially supported by Grant PID2023-148577OB-C21 (Human-Centered AI: User-Driven Adapted Language Models-HUMAN AI) by MICIU/AEI/ 10.13039/501100011033 and by FEDER/UE.

Declaration on Generative AI

The team has used generative AI, particularly ChatGPT, in spelling check for creating this document and code regarding latex format. In addition some minor errors in the code has been fixed using it. on the internet, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2023, pp. 294–315. [17] A. T. Beck, C. H. Ward, M. Mendelson, J. Mock, J. Erbaugh, An inventory for measuring depression,

Archives of general psychiatry 4 (1961) 561–571. [18] A.-M. Bucur, Utilizing chatgpt generated data to retrieve depression symptoms from social media, arXiv preprint arXiv:2307.02313 (2023). [19] S. Ubani, S. O. Polat, R. Nielsen, Zeroshotdataaug: Generating and augmenting training data with chatgpt, arXiv preprint arXiv:2304.14334 (2023). [20] H. Dai, Z. Liu, W. Liao, X. Huang, Y. Cao, Z. Wu, L. Zhao, S. Xu, F. Zeng, W. Liu, et al., Auggpt:

Leveraging chatgpt for text data augmentation, IEEE Transactions on Big Data (2025). [21] D. Maupomé, T. Soulas, F. Rancourt, G. Cantin-Savoie, G. Winterstein, S. Mosser, M.-J. Meurs,

Lightweight methods for early risk detection., in: CLEF (Working Notes), 2023, pp. 718–726. [22] N. Recharla, P. Bolimera, Y. Gupta, A. K. Madasamy, Exploring depression symptoms through similarity methods in social media posts., in: CLEF (Working Notes), 2023, pp. 763–772.

[1]

Parapar ,

Perez ,

Wang ,

Crestani , Overview of erisk 2025: Early risk prediction on the internet, in: Experimental IR Meets Multilinguality , Multimodality, and Interaction - 16th International Conference of the CLEF Association, CLEF 2025 , Madrid, Spain, September 9- 12 , 2025 , Proceedings, Part

, volume To be published of Lecture Notes in Computer Science , Springer, 2025 .

[2] W. health organization (WHO), Depressive disorder (depression ), 2023 . URL: https://www.who. int/news-room/fact-sheets/detail/depression, last access: 16 de mayo de 2025 .

[3] W. health organization (WHO), Depressive disorder (depression ), 2025 . URL: https://www.who. int/news-room/fact-sheets/detail/suicide, last access: 16 de mayo de 2025 .

[4]

Parapar ,

Perez ,

Wang ,

Crestani , Overview of erisk 2025: Early risk prediction on the internet (extended overview) , in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2025 ), Madrid, Spain, 9 - 12 September , 2025 , volume To be published of CEUR Workshop Proceedings, CEUR-WS.org, 2025 .

[5]

S. G.

Burdisso ,

Errecalde , M. M. y Gómez , Using text classification to estimate the depression level of reddit users , Journal of Computer Science and Technology , Vol 21 , Iss

, Pp e1- e1 ( 2021 ) ( 2021 ). URL: https://doi.org/10.24215/16666038.21.e1. doi: 10 .24215/16666038.21. e1 .

[6]

S. G.

Burdisso ,

Errecalde ,

Montes-y Gómez , A text classification framework for simple and efective early depression detection over social media streams , Expert Systems with Applications 133 ( 2019 ) 182 - 197 .

[7]

D. E.

Losada ,

Crestani ,

Parapar , Overview of erisk 2019 early risk prediction on the internet , in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 10th International Conference of the CLEF Association, CLEF 2019 , Lugano, Switzerland, September 9- 12 , 2019 , Proceedings 10, Springer, 2019 , pp. 340 - 357 .

[8]

Martínez-Castaño ,

Htait ,

Azzopardi ,

Moshfeghi , Early risk detection of self-harm and depression severity using bert-based transformers: ilab at clef erisk 2020 ( 2020 ).

[9]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of erisk at clef 2021: Early risk prediction on the internet (extended overview) ., CLEF (Working Notes) 1 ( 2021 ) 864 - 887 .

[10] S. -H. Wu , Z.-J. Qiu , A roberta-based model on measuring the severity of the signs of depression ., in: CLEF (Working Notes) , 2021 , pp. 1071 - 1080 .

[11]

Manna ,

Monti , et al., Unior nlp at erisk 2021 : Assessing the severity of depression with part of speech and syntactic features , in: CEUR WORKSHOP PROCEEDINGS , volume 2936 , CEUR , 2021 , pp. 1022 - 1030 .

[12]

Barros ,

Trifan ,

J. L.

Oliveira , Vader meets bert: sentiment analysis for early detection of signs of self-harm through social mining ., in: CLEF (working notes) , 2021 , pp. 897 - 907 .

[13]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of erisk at clef 2022: Early risk prediction on the internet (extended overview) , in: CEUR Workshop Proceedings (CEUR-WS. org) , 2022 .

[14] A.-M. Bucur , A.

Cosma , L. P.

Dinu , P.

Rosso , An end-to-end set transformer for user-level classification of depression and gambling disorder , arXiv preprint arXiv:2207.00753 ( 2022 ).

[15]

Srivastava ,

Lijin ,

Sruthi , T. Basu, Nlp-iiserb@ erisk2022: Exploring the potential of bag of words, document embeddings and transformer based framework for early prediction of eating disorder, depression and pathological gambling over social media ., in: CLEF (Working Notes) , 2022 , pp. 972 - 986 .

[16]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of erisk 2023: Early risk prediction