1. Introduction

Conference and Labs of the Evaluation Forum, September

UMUTeam at eRisk@CLEF 2024: Fine-Tuning Transformer Models with Sentiment Features for Early Detection and Severity Measurement of Eating Disorders

Ronghao Pan

José Antonio García-Díaz

Tomás Bernal-Beltrán

Rafael Valencia-García

0 0 Facultad de Informática, Universidad de Murcia, Campus de Espinardo , 30100 , Spain

2024

0 9 12

This paper describes the participation of the UMUTeam in the eRisk shared task organized at CLEF 2024. We have addressed the Task 2 and 3 which are related to early detection of signs of anorexia and measuring the severity of eating disorder signs. For this purpose, several approaches were used, including the fine-tuning of a sentence transformer model for measuring the severity of eating disorder signs and the fine-tuning of pre-trained Transformers-based language models with sentiment features for detecting anorexia signs. For Task 2, we have reached the 5th position in the decision-based evaluation ranking and raking based evaluation ranking. As for Task 3, we have obtained 5th place, out of 5 participants, however, our model has a more balanced overall accuracy and performance across most metrics.

eol>Mental disorders Deep learning Natural Language Processing Fine-tuning Transformers

1. Introduction

Mental health is the state of a person’s psychological and emotional well-being. It includes the ability to manage emotions, cope with stress, maintain satisfying relationships, work productively, and contribute to the community. It can be influenced by many factors, including genetics, life experiences, social environment, stress, and brain chemistry [ 1 ]. In recent years, there has been an increase in mental illness, an alarming phenomenon that has captured the attention of public health oficials, experts, researchers, and governments around the world. According to a recent report by the World Health Organization (WHO), one in eight people in the world sufers from a mental illness 1. Therefore, there is an urgent need to address the factors contributing to the increase in these diseases and to implement efective strategies to improve the mental and physical health of the world’s population.

Several studies have shown that excessive use of social networking site can have negative efects on mental health, specially in adolescents and young adults, making it a topic of growing interest and concern in research and public health [ 2 ]. This relationship highlights the importance of early detection of mental health symptoms in order to efectively intervene and prevent these problems from worsening.

For this reason, the interest in the detection and identification of mental disorders in social network streams has grown in recent years, driven by the use of advanced Natural Language Processing (NLP) technologies, due to the increasing prevalence of mental health problems and their relationship with digital platforms [ 3 ]. In addition, a number of mental health-related tasks have emerged in important evaluation campaigns, such MentalriskES [ 4 ] of Iberian Languages Evaluation Forum (IberLEF) and eRisk [ 5 ] of Conference and Labs of the Evaluation Forum (CLEF).

The eRisk Lab focuses on the development of assessment methodologies and metrics for the early detection of risks on the Internet, specially related to health and safety issues. The initiative was initiated at CLEF in Dublin in 2017, and has already hosted eight editions through 2024. Throughout these editions, the Lab has presented numerous collections and models that address diferent application domains. Previous editions have explored topics such as depression, eating disorders, gambling, and self-harm detection. Lab tasks include early warning and severity assessment challenges, which involve automated analysis of temporal text streams to predict specific risks and compute detailed symptom estimates from users’ writings.

The eRisk@CLEF 2024 [ 6, 7 ] focuses on the early detection of signs of anorexia, the search for symptoms of depression, and measuring the severity of signs of eating disorders. This shared task was defined using the test collection, and evaluation metrics were proposed.

This paper presents the participation of the UMUTeam in tasks related to the early detection of signs of anorexia and measuring the severity of signs of eating disorders. For this purpose, several approaches have been employed, including fine-tuning of a sentence transformers model to measure the severity of the signs of eating disorders and fine-tuning of the pre-trained language models based on Transformers with sentiment features for the detection of signs of anorexia. The rest of the paper is organized as follows. Section 2 presents the task and the provided dataset. In Section 3, the methodology of our proposed system for addressing each task is described. Secondly, Section 4 shows the results obtained, and a discussion of them is presented. Finally, Section 5 concludes the paper with some conclusions and perspectives for future work.

2. Task description

This edition of eRisk focuses on detecting symptoms of depression, signs of anorexia, and the severity of symptoms associated with eating disorders through various datasets and challenges involving automated analysis of temporal text streams to predict specific problems and compute detailed symptoms estimations based on user writings. Thus, this shared task is divided into three tasks: • Task 1: Search for symptoms of depression. This task is a continuation of eRisk 2023’s Task 1, involves ranking sentences from user writing based on their relevance to symptoms of depression outlined in the BDI questionnaire. • Task 2: Early detection of signs of anorexia. This task is a continuation of eRisk 2018’s T2 and 2019’s T1 tasks, focuses on early detection of signs of anorexia. In this case, we are tasked with sequentially processing pieces of evidence to detect early signs of anorexia as early as possible, primarily using Text Mining solutions on social network texts. The test collection follows the format of the collection described in [ 8 ] and comprises writings of social media users, categorized into individuals with anorexia and control users. • Task 3: Measuring the severity of the signs of eating disorders. This task involves estimating the level of features associated with an eating disorder diagnosis from a history of user posts. In this task, participants are given a history of each user’s posts and are asked to complete a standard eating disorder questionnaire based on the clues found in the posts. The questionnaires are derived from the Eating Disorder Examination Questionnaire (EDE-Q), which is a 28-item self-report questionnaire adapted from the Eating Disorder Examination (EDE) semi-structured interview, and only questions 1-12 and 19-28 are used.

In this edition, we participated in Task 2 and other tasks. Table 1 shows the distribution of the training dataset. We can see that the table shows various measures of the dataset, such as the number of topics, the number of submissions (posts and comments), the average number of submissions per topic, the average number of days from first to last submission, and the average number of words per submission.

For Task 3, which is a continuation of ERISK 2022 and 2023 Task 3, we used only the 2023 dataset, which has a total of 404,404 text questions.

3. Methodology

This section details the processes, techniques, and tools used for Task 2 and Task 3. 3.1. Task 2 Figure 1 shows the general architecture of our approach for Task 2. Briefly, first, we performed a preprocessing by selecting the user messages that are most relevant for anorexia identification. Second, we divided the dataset into two subsets with an 80-20 ratio: training, a subset of data that is used to train the model, and validation, a subset of data separated from the training set that is used to evaluate the model’s performance during training. Third, the last hidden state of the pre-trained language models is used to obtain the text representation, and then a sentiment analysis model is used to obtain sentiment features from the texts. Finally, the last hidden state and the logits from the sentiment analysis model are concatenated to serve as input to a neural network, which is the classification head. This network includes a normalization layer (LayerNorm), a dropout layer, linear layers with Tanh as activation function, and a linear layer at the end to obtain the anorexia identification model.

For this task, we used only the 2019 dataset. From Table 1, we can see that at the post and comment level, there are a total of 253,752 posts, of which 24,874 are anorexia type and 228,878 are control type, indicating a significant imbalance. Therefore, we performed a preprocessing to prevent the model from always learning to predict the majority class and to reduce the noise in the dataset.

Sentiment analysis involves the use of NLP techniques to identify and categorize opinions expressed in a text, specifically to determine whether the sentiment is positive, negative, or neutral. For example, [ 9 ] shows the relationship between emotions and mental illness, as well as the importance of automatic recognition in the health field. In the context of anorexia, this analysis can help identify patterns in language that may indicate the presence of this disease. In this case, we used only negative texts from users with anorexia and positive and neutral texts from control users.

To address this task, we followed a supervised learning approach. To train our model, we used the two datasets obtained after the selection process. It is worth mentioning that the organizers only provided training data, so we selected a custom split for validation. The customized validation split is created using stratified sampling, in order to keep the balance between labels. Table 2 shows the distribution of the processed data set in the training and validation sets. We can see that we end up with a total of 4,656 texts representative of users sufering from anorexia and 11,309 of those not sufering from anorexia in the training set. In the validation set, we have a total of 1,164 anorexia type texts and 2,828 that are not related to anorexia. Moreover, we also deleted all mentions, references to URLs and hashtags from the texts, and identified and removed sequences such as “amp;format=png”, “amp;s=7b66887b445eb00d7d842b15e15e15e15f4759f3deb03d”, among others.

For this task, we evaluated the BERT [ 10 ], RoBERTa [ 11 ], and RoBERTa-large [ 11 ] models for text representation and the Cardif NLP TweetEval model for sentiment analysis of text.

BERT [ 10 ] is a language model developed by Google in 2018 based on the Transformer architecture, a neural network designed to process data streams such as text or audio. BERT was pre-trained on large amounts of text, allowing it to capture general linguistic knowledge. This pre-trained model can then be tuned for specific natural language processing tasks such as sentiment analysis, machine translation, or question answering.

RoBERTa [ 11 ] is an extension of Facebook AI’s BERT language model. It focuses on large-scale training, eliminating specific tasks and using more robust learning dynamics. These improvements make RoBERTa more efective and accurate than BERT at a variety of natural language processing tasks.

RoBERTa-large [ 11 ] is a larger and more powerful version of the RoBERTa language model. Like RoBERTa, it is based on the BERT architecture, but has more parameters and processing power. RoBERTalarge is trained on an even larger dataset for a longer period of time, allowing it to capture more complex and general linguistic patterns.

The Cardif NLP TweetEval model [ 12 ] is a RoBERTa-based model specifically trained to perform sentiment analysis tasks on Twitter tweets. It has been trained on approximately 58 million tweets and tuned for sentiment analysis using the TweetEval benchmark dataset. Finally, for early detection, we have evaluated a strategy based on making a decision when the number of signs of anorexia in a user’s messages exceeds a certain threshold. 3.2. Task 3 This task involves estimating traits associated with an eating disorder diagnosis from a set of user posts. The organizers have provided a user’s posting history along with a standardized eating disorder questionnaire. Thus, the primary goal of this task is to predict potential responses to the questionnaire based on the user’s posting history.

The questionnaire in question is the Eating Disorder Examination Questionnaire (EDE-Q), a 28-item self-report questionnaire derived from the semi-structured interview known as the Eating Disorder Examination (EDE). In this case, our goal is to predict responses to questions 1-12 and 19-28. The dataset consists of 28 instances of users’ posting history along with their corresponding responses to the EDE-Q questionnaire.

For this task, we adopted a fine-tuning approach using a sentence transformer model that uses textual similarity to measure the similarity between potential responses (user thread text) and each question in the EDE-Q. To achieve this, we processed the user text, mapped it to the 22 questions, and assigned a score based on the user’s responses to the questionnaire. To derive a scale-based score, we defined specific intervals for each possible answer within the questionnaire.

0. NO DAYS / not at all (0 to 0.1) 1. 1-5 DAYS / slightly (0.1 to 0.2) 2. 6-12 DAYS / slightly (0.2 to 0.3) 3. 13-15 DAYS / moderately (0.3 to 0.4) 4. 16-22 DAYS / moderately (0.4 to 0.5) 5. 23-27 DAYS / markedly (0.5 to 0.7) 6. EVERY DAY / markedly (0.7 to 1.0)

Thus, within the training set, each text is associated with specific questions and assigned a score, which is a randomly generated value that falls within the appropriate interval based on the user’s response. We also chose a custom 80-20 split for validation. The training set contains 323,523 textquestion relations along with their respective scores, while the validation set contains 80,881 such relations.

For this task, the dataset was first processed by removing contractions, mentions, hashtags, URLs, and AMP expressions, and extracting emoji features using the emoji Python library. Second, we fit the multi-qa-mpnet-base-dot-v1footnotehttps://huggingface.co/sentence-transformers/multi-qa-mpnetbase-dot-v1 and sentence-transformers/all-MiniLM-L6-v2footnoteurlhttps://huggingface.co/sentencetransformers/all-MiniLM-L6-v2 models with cosine similarity as the loss function, 10 epochs, and 1000 warm-up steps. multi-qa-mpnet-base-dot-v1 is based on the MPNet (Multilingual Pretrained BERT) architecture, which is based on the BERT (Bidirectional Encoder Representations from Transformers) model. sentence-transformers/all-MiniLM-L6-v2 is a kind of all-round model tuned for many use cases and trained on a large and diverse dataset of over 1 billion training pairs.

4. Results

4.1. Task 2 This section describes the systems submitted by our team in each run and shows the results obtained in each task.

• Run 0: This run consists of running a classification model obtained through the fine-tuning RoBERTa-base with sentiment feature within which the set of posts used has been preprocessed. The threshold used in early strategy is 10, i.e., the decision is made when more than 10 posts are identified as identifying the user’s anorexia type. • Run 1: This run uses the same classification model as Run 0, but uses 15 as threshold of the early detection strategy. • Run 2: In this run, we used the fine-tuned DeBERTa as a classification model and a threshold of 10 for early detection strategy. • Run 3: This run uses the same classification model as Run 2, but uses a threshold of 15 for early detection strategy.

• Run 4: This run has the same structure as Run 2, but changing the brave strategy threshold to 20.

Table 4 shows the results of the decision-based evaluation of Task 2, specifically the precision, recall, and F1 score over the five runs. Accuracy ranges from 0.14 to 0.16, indicating a low variability in the model’s ability to correctly identify relevant instances. Recall is very high across all runs, between 0.98 and 0.99, demonstrating the model’s efectiveness in capturing almost all relevant instances. The F1 score, which balances precision and recall, shows a slight improvement from 0.25 in run 0 to 0.27 in run 4. The ERDE5 and ERDE50 metrics, which measure early risk detection errors, remain relatively stable, indicating consistent early detection performance across all runs. Latency, which reflects the time it takes to make a correct prediction, increases from 18.0 in Run 0 to 35.5 in Run 4. Speed, which reflects the speed of processing, decreases slightly to a low of 0.87 in Run 4.

Overall, Run 4 achieves the highest accuracy and F1 score at the cost of higher latency, while Runs 0 and 2 ofer lower latency at slightly lower accuracy and F1 score. With this result, we ranked fifth in decision-based evaluation. 4.2. Task 3 For this task, we presented two runs based on fine-tuning a pre-trained sentence transformer model, that uses textual similarity to measure the similarity between potential responses (user thread text) and each question in the EDE-Q.

NDCG@10

NDCG@100 • Run 0: This run consists of using the multi-qa-mpnet-base-dot-v1 fine-tuned model as a system model to identify the similarity between the EDE-Q question and the user posts. For each user post, it is fed into the system and the system calculates the degree of similarity between the EDE-Q question and the post. Based on the score obtained by the system, a possible answer to the question is assigned within the intervals defined in section X. Once all the contributions have passed through the system, the most repeated answer for each question is assigned as the final answer. • Run 1: This run uses the same approach as Run 0, but uses sentence-transformers/all-MiniLM-L6-v2 ifne-tuned model as a system model.

Table 6 shows the results obtained in the evaluation of Task 3, evaluated according to diferent metrics: MAE (Mean Absolute Error), MZOE (Mean Zero-One Error), MAE_macro, GED (Global Eating Disorder Score), RS (Restraint Subscale), ECS (Eating Concern Subscale), SCS (Shape Concern Subscale), and WCS (Weight Concern Subscale).

First, we looked at the Mean Absolute Error (MAE), which measures the average size of the errors in the predictions without considering their direction. The Run 1 achieved an MAE of 2.227, while the Run 0 achieved an MAE of 2.366. This indicates that Run 1 had a higher overall accuracy in its predictions.

The MZOE metric shows the average of the errors in terms of binary hits and misses. Run 0 had an MZOE of 0.798 compared to 0.859 for Run 1. This means that Run 0 made fewer errors and was more accurate in correctly classifying cases.

As for MAE_macro, which evaluates the mean absolute error balanced across classes, Run 1 performed better with a value of 2.286 compared to 2.833 for Run 0. This result indicates that Run 1 achieved a more balanced performance between the diferent data categories, which is crucial in situations where all classes are equally important.

The GED measures the overall accuracy of the model in predicting eating disorders. The Run 0 had a GED of 3.261, while the Run 1 had a GED of 3.286. Although the diference is small, Run 0 showed slightly better performance on this overall measure.

For the RS, which measures accuracy in predicting dietary restraint behavior, both runs showed very similar results, with Run 1 scoring an RS of 3.269 and Run 0 scoring 3.285. This parity indicates that both runs are comparable in terms of accuracy on this specific subscale.

On the ECS, Run 0 showed better performance with an ECS of 2.659 compared to 2.911 for Run 1. This result suggests that Run 0 was more efective at capturing specific food concerns.

On the SCS, Run 1 performed better with an SCS of 2.560 compared to 2.771 for Run 0. This data suggests that Run 1 was more accurate in predicting body shape concerns.

Finally, on the WCS, Run 1 also outperformed Run 0 with a WCS of 2.026 compared to 2.218. This demonstrates a better ability of Run 1 to predict weight concern.

In summary, although Run 1 showed better overall accuracy and more balanced performance on most metrics, Run 0 excelled in specific aspects such as MZOE, GED, and ECS.

5. Conclusion

This paper summarizes UMUTeam’s participation in the eRisk collaborative task of the 2024 edition of CLEF. The eRisk Lab focuses on the development of assessment methods and metrics for the early detection of risks on the Internet, especially related to health and safety issues. In this edition, the focus is on detecting symptoms of depression, early detection of signs of anorexia, and measuring the severity of signs of eating disorders in three related subtasks.

In this shared task, we have focused on Task 2 and Task 3, which are related to early detection of signs of anorexia and measuring the severity of eating disorder signs. For this purpose, several approaches were used, including the fine-tuning of a sentence transformer model for measuring the severity of eating disorder signs and the fine-tuning of pre-trained Transformers-based language models with sentiment features for detecting anorexia signs.

In Task 2, we present 5 runs based on diferent settings, using diferent fine-tuned models as the classification model for the system and diferent thresholds for the early detection strategy. We ranked iffth in the decision-based evaluation, and run 4 achieved the highest accuracy and F1 score at the cost of higher latency, while runs 0 and 2 ofer lower latency with slightly lower accuracy and F1 score. For the decision-based evaluation, we obtained the top 5 results. In this case, all five runs are identical: P@10 is consistently 0.20, NDCG@10 is 0.12, and NDCG@100 is 0.14.

From the results obtained, we can see that the sheer number of comments may not be enough; the context and severity of the comments are also important. We also found that removing certain negative comments from users labeled as “control” runs the risk of the model not learning to properly distinguish between negative comments that are normal and those that are indicative of a mental disorder, which could degrade the performance of the system.

In Task 3, we present two runs based on fine-tuning a pre-trained sentence transformer model that uses textual similarity to measure the similarity between possible answers (user thread text) and each question in the EDE-Q. In this case, run 1, which is based on a fine-tuned model of sentencetransformers/all-MiniLM-L6-v2, has the best result in overall accuracy and a more balanced performance on most metrics.

As a future line, we suggest adding the user’s previous context as an input to improve performance, and not removing all negative comments from users marked as "control", to avoid that the model does not learn to correctly distinguish between negative comments that are normal and those that are indicative of a mental disorder. Furthermore, it is important to examine the relationship between indicators of mental illness and hate speech [ 13 ], the use of humor [ 14 ], and the demographic and psychographic characteristics of the message authors [ 15 ].

Acknowledgments

This work is part of the research projects LaTe4PoliticES (PID2022-138099OB-I00) funded by MICIU/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF)-a way of making Europe and LT-SWM (TED2021-131167B-I00) funded by MICIU/AEI/10.13039/ 501100011033 and by the European Union NextGenerationEU/PRTR, and ”Services based on language technologies for political microtargeting“ (22252/PDC/23) funded by the Autonomous Community of the Region of Murcia through the Regional Support Program for the Transfer and Valorization of Knowledge and Scientific Entrepreneurship of the Seneca Foundation, Science and Technology Agency of the Region of Murcia. Mr. Ronghao Pan is supported by the Programa Investigo grant, funded by the Region of Murcia, the Spanish Ministry of Labour and Social Economy and the European Union - NextGenerationEU under the “Plan de Recuperación, Transformación y Resiliencia (PRTR)”.

[1]

Dattani ,

Rodés-Guirao ,

Ritchie ,

Roser , Mental health, Our world in data ( 2023 ).

[2]

Sacco ,

Camilleri ,

Eberhardt ,

Umla-Runge ,

Newbury-Birch , A systematic review and meta-analysis on the prevalence of mental disorders among children and adolescents in Europe, European Child & Adolescent Psychiatry ( 2022 ). doi:10.1007/s00787-022-02131-2.

[3]

R. A.

Calvo ,

D. N.

Milne ,

M. S.

Hussain ,

Christensen , Natural language processing in mental health applications using non-clinical texts† , Natural Language Engineering 23 ( 2017 ) 649 - 685 . URL: https://api.semanticscholar.org/CorpusID:17828909.

[4] A. M. M.-R. y Adrián Moreno -Muñoz y Flor Miriam Plaza-del-Arco y María Dolores MolinaGonzález y Maria Teresa Martín-Valdivia y Luis Alfonso Ureña-López y Arturo Montejo-Raéz, Overview of MentalRiskES at IberLEF 2023: Early Detection of Mental Disorders Risk in Spanish , Procesamiento del Lenguaje Natural 71 ( 2023 ) 329 - 350 . URL: http://journal.sepln.org/sepln/ojs/ ojs/index.php/pln/article/view/6564.

[5]

Parapar ,

Martín-Rodilla ,

D. E.

Losada ,

Crestani , Overview of eRisk 2023: Early Risk Prediction on the Internet , in: A. Arampatzis , E. Kanoulas, T.

Tsikrika , S.

Vrochidis , A.

Giachanou , D.

Li , M.

Aliannejadi , M.

Vlachos , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction , Springer Nature Switzerland, Cham, 2023 , pp. 294 - 315 .

[6]

Parapar ,

Martín Rodilla ,

Losada ,

Crestani , Overview of eRisk 2024: Early Risk Prediction on the Internet, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction . 15th International Conference of the CLEF Association, CLEF 2024 , 2024 .

[7]

Parapar ,

Martín Rodilla ,

Losada ,

Crestani , Overview of eRisk 2024: Early Risk Prediction on the Internet (Extended Overview) , in: Working Notes of the Conference and Labs of the Evaluation Forum CLEF 2024 , 2024 .

[8]

D. E.

Losada ,

Crestani , A test collection for research on depression and language use , in: International Conference of the Cross-Language Evaluation Forum for European Languages , Springer, 2016 , pp. 28 - 39 .

[9]

Salmerón-Ríos ,

J. A.

García-Díaz ,

Pan ,

Valencia-García , Fine grain emotion analysis in Spanish using linguistic features and transformers , PeerJ Computer Science 10 ( 2024 ) e1992 . doi: 10 .7717/peerj-cs. 1992 .

[10]

Devlin ,

Chang ,

Lee ,

Toutanova , BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , CoRR abs/ 1810 .04805 ( 2018 ). URL: http://arxiv.org/abs/ 1810 .04805. arXiv: 1810 .04805.

[11]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer , V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach , CoRR abs/ 1907 .11692 ( 2019 ). URL: http://arxiv.org/abs/ 1907 .11692. arXiv: 1907 .11692.

[12]

Barbieri ,

Camacho-Collados ,

Neves , L. Espinosa-Anke, TweetEval: Unified benchmark and comparative evaluation for tweet classification , arXiv preprint arXiv: 2010 . 12421 ( 2020 ).

[13]

J. A.

García-Díaz ,

S. M.

Jiménez-Zafra ,

M. A.

García-Cumbreras ,

Valencia-García , Evaluating feature combination strategies for hate-speech detection in spanish using linguistic features and transformers , Complex & Intelligent Systems ( 2022 ) 1 - 22 .

[14]

J. A.

García-Díaz ,

Valencia-García , Compilation and evaluation of the spanish saticorpus 2021 for satire identification using linguistic features and transformers , Complex & Intelligent Systems 8 ( 2022 ) 1723 - 1736 .

[15]

J. A.

García-Díaz ,

Colomo-Palacios ,

Valencia-García , Psychographic traits identification based on political ideology: An author analysis study on spanish politicians' tweets posted in 2020, Future Generation Computer Systems 130 ( 2022 ) 59 - 74 .