1. Introduction

A. Casamayor);

ELiRF-VRAIN at MentalRiskES 2024: Using LongFormer for Early Detection of Mental Disorders Risk

Andreu Casamayor

Vicent Ahuir

Antonio Molina

Lluís-Felip Hurtado

0 0 Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València , Camino de Vera s/n, 46022 Valencia. Spain

2024

000 0 0001

This paper describes the approaches taken by the ELiRF-VRAIN team at the shared tasks of MentalRiskES at IberLEF 2024 [1]. These shared tasks involved two activities focused on identifying mental illness on Spanish-language social media: detection of disorder and context detection. Our work consisted of three approaches: one approach based on a Support Vector Machine and the other two based on Transformer architecture pre-trained models, one using BERT-like models and the other using LongFormer models. In order to fine-tune our models, we used a data augmentation process on the data provided by the organization. According to the results, our approaches fit the task correctly.

eol>Longformer Transformers Support Vector Machine Mental disorder detection

1. Introduction

A mental disorder is characterized by a clinically significant disturbance in an individual’s cognition, emotional regulation, or behavior. It is usually associated with distress or impairment in important areas of functioning [ 2 ].

According to the World Health Organization (WHO), 1 in every 8 people is living with a mental disorder, with anxiety and depressive disorders the most common [ 3 ]. Although the problem is widely known, the number of people is still increasing, and discrimination against them still exists. Currently, the governments work to prevent and cure mental illness. However, the lack of human and material resources means that many people cannot receive adequate treatment or none at all. In addition to all this, early detection of mental disorders is often dificult.

In this context, detecting mental disorders risk through analyzing social media interactions has acquired great relevance in recent years. Many factors make the problem of mental disorders detection complicated, such as availability, amount, and quality of data. Providing quality labeled data in Spanish and promoting the creation of models for this early detection is precisely the objective of the MentalRiskES shared tasks.

In the 2024 edition, the competition consisted of three tasks [ 4 ]: (1) Detection of mental disorder, (2) Context Detection, and (3) Suicidal ideation detection. Our team participated in the ifrst two tasks.

To tackle task 1, we considered three diferent approaches. 1. The first approach is based on a classic machine learning algorithm: Support Vector Machines (SVM). SVMs have demonstrated adequate behavior in long text classification tasks such as this case. We consider this approach as an assessment of the performance of classical models. 2. The second approach is based on Transformers [ 5 ]. We use a pre-trained RoBERTa model [ 6 ] as a basis and then run a fine-tuning process to adjust them to the task domain. We considered two diferent datasets to do fine-tuning: the one provided by the organization and an expanded version of the dataset through a data augmentation process. 3. The last approach is similar to the second one; however, to capture more context, we use a pre-trained LongFormer model [ 7 ]. This way, the model is able to capture more context because of the bigger size of the input layer. We used the same dataset as in the previous approach for the fine-tuning phase.

We submitted three runs for task 1, one for each approach. The best model of each approach was chosen through a previous validation stage in which diferent parameters and datasets were considered.

To tackle task 2, we sent one system based on the third approach of the first task, a LongFormerbased solution. We chose that approach since it was the more promising one based on the evaluation results of task 1.

2. Description of Dataset and Tasks

The datasets delivered by the organization consisted of a message collection sent to diferent groups on Telegram [ 8 ]. These public groups have the characteristic of being in Spanish and related to mental illnesses. The messages were anonymized and, subsequently, labeled by ten annotators at the user level; that is, each user was labeled considering his/her messages.

Two diferent datasets were delivered: one for the first two tasks and a diferent one for the third task. The first dataset, the one with which we worked, has the following sample distribution: 20 users for trial, 465 users for train, and 400 users for test.

As stated above, the main objective of this competition is to predict mental disorders as soon as possible. To achieve realistic behavior, the organization emulated a real conversation by setting up a server that gives out packets of data containing a message for each user. The system must predict the label of each user, considering the current message and all their previous messages, before the classification system will receive the next packet. The goal is to predict each user’s mental disorder, if any, as quickly as possible.

2.1. Task 1: Disorder Detection

Task 1 is a multiclass classification task whose objective is to predict if users sufer from depression, anxiety, or none disorder.

Table 1 shows the distribution among the diferent labels in the dataset for the first task.

None Depression Anxiety Total

To maximize the available samples for the training process, we joined the Train and Trial partitions to train our systems; the Total column of Table 1 shows the final sample distribution of our training dataset.

2.2. Task 2: Context Detection

Task 2 is a two-level multiclass multilabel task: in addition to detecting the mental illness, the context or contexts in which it appears must be detected. There are 7 contexts: addiction, emergency, family, work, social, other, and none.

The label distribution in this total dataset can be seen in Table 2. It shows how the contexts of Family, Social, Other, and None are the most common.

Depression Anxiety Total

Addiction 9 3 12

Emergency 7 10 17

3. System architecture and Techniques

In this kind of task, an important aspect to count on is the amount of context required to perform the detection correctly. Since each user can have many messages, the size of the input to the system must be a factor to consider. One goal of our team was to study the impact of the context in these tasks. That is, measure the capabilities of diferent systems depending on how much context they can manage. We selected three diferent systems to achieve this goal: the first based on Support Vector Machines (SVM), the second based on a RoBERTa model, and the third based on a LongFormer model. Every system evaluated has a diferent size for context: • SVM has no limit in the input size; it creates a vector as long as the vocabulary size. • The selected RoBERTa model has a limit of 512 tokens in the input.

• The selected LongFormer model has a limit of 4096 tokens in the input.

Regarding the dataset, we translated all the data into English because the Transformers base models were pre-trained using documents in this language. We used the library EasyNMT [ 9 ] and the model OPUS-MT Spanish-English [ 10 ] (https://huggingface.co/Helsinki-NLP/ opus-mt-es-en). Furthermore, we created two diferent datasets to train and evaluate the performance of the transformer-based systems.

Dataset 1. We created only one sample per user by accumulating all his/her messages, for both positive and negative labeled users.

Dataset 2. If we had some a priori evidence of in which message a user begins to present symptoms of mental illness risk, we could label the samples from previous messages as negative, and the samples containing that message and subsequent ones as positive. In this way, we could increase the number of positive samples, in order to achieve a more precise model. This data augmentation process is explained below.

To carry out our experimentation, we divided the original dataset into two partitions: training (80% of users) and development (20% of users), maintaining the proportions of positive and negative samples in each of the partitions. Table 4 shows the distribution of samples in Dataset 1.

None Depression Anxiety Total

3.1. Data Augmentation

The data augmentation process aims to create more samples per positive user. We said above that we need some evidence of the message in which a user begins to express symptoms of illness. To do this, we relied on the prediction of the SVM-based classifier. We can assume that all the previous messages to the SVM decision don’t express symptoms of illness. To achieve this goal, we followed the next steps: 1. For positive users, we calculated how many messages the SVM needs to classify the user as positive (depression or anxiety). Each user has a diferent trigger value. 2. For false negatives, we used the mean of the true positive trigger values as the trigger value. 3. For each positive user in the original data set, let be the number of messages that the SVM model needs to determine this user’s mental disorder risk, be the maximum number of messages the model supports as input, and the ith message from the user. a) we created − 1 negative samples as follows:

(1), (12), (123), ..., (1...− 1) b) and − + 1 positive samples:

(1...), (1...+1), ..., (1...... ) 4. Note that the value of depends on which model was used and the number of tokens in the messages. That is, we discard messages from an accumulated history of more than 512 tokens for RoBERTa and 4096 for LongFormer. So, if > only negative samples are generated. 5. For negative users, we created new samples accumulating the history as before, stopping when the MAX was reached.

The result of this technique is a new dataset with a higher number of positive samples for the training.

None Depression Anxiety Total

3.2. Task 1: Disorder Detection 3.2.1. Classical Machine Learning Classifier Approach

To evaluate the context’s importance, we wanted to use a classical machine learning classifier that can handle all the context. One of the most important issues of models based on Transformers is their poor ability to deal with large texts, because of their limitation in the input size. This issue can afect the performance since the input cannot hold all the sample length, and valuable information may be lost in this process.

Firstly, we did an experiment where we compared diferent types of classical machine learning classifiers. Scikit-learn library [ 11 ] provided us with the tools to develop this experiment. The configuration was to use all the default classifiers to select the better one. The results can be seen in Table 6. The table shows that the best classifier was the Linear SVM.

Linear SVM Gradient Boosting K-Neighboors Random Forest

Once the classifier was chosen, we wanted to test diferent approaches: • Preprocess of Data: 1. First approach: Transform the text into tokens using TweetTokenizer and then eliminate stop words. 2. Second Approach: Same as the first approach with the addition of methods to clean the text, eliminate non-alphanumerical characters and others, and lemmatize tokens. • Sentimental Analysis: We used the model "lxyuan/distilbert-base-multilingualcased-sentiments-student" [ 12 ] to proceed with a sentimental analysis of every message per user. We obtained 3 results, positive messages, negative messages, and neutral messages, all normalized in the end. We add these results as a new feature for the TF-IDF. • TF-IDF: The class TfidfVectorizer in Scikit-learn was used to vectorize the data. We tested diferent configurations for the analyzer and ngram_range number, and used the default values for the other features.

To find the best models for every approach, we did an exhaustive grid search over some specific parameters, such as regularization parameter C, diferent tols (Tolerance for stopping criteria), and diferent loss.

We obtained 6 diferent approaches. Table 7 shows the diferent configurations used in the experimentation, the column TF-IDF refers to the type of analyzers (word or char) used and the number of n-grams. The last column refers to the best model found in the search grid.

The result shows in Table 8 the best configuration is the SVM-4, using the most completed preprocess for the data, sentimental analysis, "char_wb" as the analyzer and (4-5) as ngram_range. This model was used for Run0 in Task 1.

SVM-1 SVM-2 SVM-3 SVM-4 SVM-5 SVM-6 SVM-7 SVM-8 "char_wb" , 4-5 n-gram "char_wb" , 4-5 n-gram "char_wb" , 4-5 n-gram "char_wb" , 4-5 n-gram "word" , 1-2 n-gram "word" , 1-2 n-gram "word" , 1-2 n-gram "word" , 1-2 n-gram

Best Model ’C’: 1, ’loss’: ’squared_hinge’, ’tol’: 0.1 ’C’: 1, ’loss’: ’squared_hinge’, ’tol’: 0.1 ’C’: 1, ’loss’: ’hinge’, ’tol’: 0.1 ’C’: 1, ’loss’: ’hinge’, ’tol’: 0.1 ’C’: 10, ’loss’: ’hinge’, ’tol’: 0.1 ’C’: 10, ’loss’: ’hinge’, ’tol’: 0.1 ’C’: 10, ’loss’: ’hinge’, ’tol’: 0.1 ’C’: 10, ’loss’: ’hinge’, ’tol’: 0.1

3.2.2. BERT-like Model Approach

It is well known that the state-of-the-art models in NLP are based on Transformers. Models like BERT or RoBERTa usually provide good versatility for classification tasks. However, these types of models usually can not handle more than 512 tokens, which could be a problem for tasks with long contexts such as the current ones. Therefore, we used one of these models as a baseline to compare other models with a better capacity to handle large contexts. Some research made by Alireza Porkeyvan [ 13 ] shows that the state of the art in mental disorder detection is MentalRoBERTa [ 14 ]. MentalRoBERTa is a RoBERTa-like model specialized in mental health. This model is pre-trained using a special corpus of texts from mental health forums, clinical notes, and normal corpus. Consequently, MentalRoBERTa provides better adaptation for the mental health-related language, which brings a lot of possible applications related to this domain.

The model chosen was AIMH/mental-roberta-large [15], a RoBERTa model trained with posts on Reddit related to mental health. This model can be found in HuggingFace [16] public hub (https://huggingface.co/AIMH/mental-roberta-large). Furthermore, we wanted to compare a specific domain RoBERTa model, like MentalRoBERTa, with the non-domain RoBERTa model, the baseline of the competition (RoBERTa base).

Once we chose our pre-trained model, we performed an experiment that consisted of testing two fine-tuning processes: one with the Dataset 1 (RoBERTa-1) and the other with the Dataset 2 (RoBERTa-2); the second dataset is the one with data augmentation. Table 9 shows the configuration used in the fine-tuning process.

parameter optimizer learning rate lr scheduler type weight decay number of epochs training batch size

value AdamW

7e-5 linear 0.01 10 16

Table 10 shows the results of each model on the development partition. The results show that the best model is RoBERTa-2, the one fine-tuned with data augmentation. In our participation, this model was used for Run1 in Task 1.

RoBERTa-1 RoBERTa-2

Data Augmentation

No Yes

Precision 0.81 0.94

Recall F1-score 0.82 0.81 0.94 0.93

3.2.3. LongFormer Approach

As we said before, one of the most important disadvantages of BERT-like or RoBERTa-like models based on Transformers is the lack of capacity to handle large contexts. However, a variant of Transformers can handle large text called LongFormer [ 7 ].

LongFormer is the abbreviation for “Long-Document Transformer” and can process long contexts more eficiently than Transformer models, such as BERT or RoBERTa. LongFormer architecture shows the following characteristics: • New attention mechanism: An eficient attention mechanism that uses a sliding window, where each token only attends to a fixed number of neighborhood tokens, reducing the complexity. • Global attention selection: The architecture can select which tokens are globally attended and which are just attended locally.

The pre-trained model chosen was AIMH/mental-longformer-base-4096 [17] a pre-trained LongFormer for the mental health domain. This model can be found in https://huggingface.co/ AIMH/mental-longformer-base-4096.

As in with the RoBERTa model, we fine-tuned the LongFormer with the two datasets: Dataset 1 without data augmentation (LongFormer-T1-1), and Dataset 2 with data augmentation (LongFormer-T1-2). We used the same fine-tuning parameters as in RoBERTa’s experimentation; the configuration is in Table 9.

Table 11 shows the results of the experimentation, where LongFormer-T1-2 (fine-tuned with data augmentation) achieves better performance than LongFormer-T1-1 (fine-tuned without data augmentation). This model was Run2 in our participation.

LongFormer-T1-1 LongFormer-T1-2 The experimentation for Task 1 shows that the best system is the LongFormer-T1-2, so to take part in Task 2 we only chose this approach. We used the LongFormer pre-trained model as the base model, increased the number of samples of the competition dataset with data augmentation, changed the labels for the new ones, and fine-tuned the model. LongFormer-T2 model was used for Run0 in the second task.

The Table 12 shows the results of the fine-tuning process.

LongFormer-T2

Model LongFormer

Precision 0.99

Recall F1-score 0.98 0.97

4. Runs

The reason for choosing these models was to assess the importance of context in predicting mental illness. Each model has a diferent input length capability, which can handle larger or smaller context sizes.

On the one hand, BERT-like models performed better than SVMs in the first task, even though BERT-like models can handle less context than SVMs. On the other hand, LongFormer performed slightly better than BERT-like models in the first task since LongFormer can handle larger contexts.

4.1. Run Configuration

Besides, to select the model for each run, the classification systems contained additional parameters that needed to be set:

Task1: • For every round in the competition, we used as the input classifier a new sample created combining the new message of the user with the previous ones. • Each system has an initial context, in other words, we made our systems wait until the initial context was suficiently large. This context was diferent in each system: – SVM: An initial context of 50 tokens after the pre-process.

– RoBERTa and LongFormer: An initial context of 100 tokens. • The RoBERTa and LongFormer system has a limit of tokens, when the system was full we just returned the last prediction made.

Task2:

For the second task, we combined the best model from Task 1 (LongFormer-T1-2) and the one ifne-tuned specifically for Task 2 (LongFormer-T2). The first model was used to discriminate between negative cases and positive cases. If the sample was detected as positive, then the LongFormer-T2 was used to predict the context.

5. Results

5.1. Task 1

Table 14 shows how the best system is the Run 2, which refers to the LongFormer-T1-2: pre-trained LongFormer fine-tuned with the data augmentation. This run achieved the first position in the competition. The only two runs that beat the Baseline were our Run1 and Run2, indicating the importance of appropriate data selection.

Although the best runs used a model base in Transformers, the run with SVM achieves a similar result, only 1% less than Run1. This indicates that classical approaches like SVMs continue to be useful in detecting mental illnesses because of their ability to handle large contexts. Therefore, SVMs still well-fitted in situations with low computational resources. 5.2. Task 2

As can be seen from Table 15, the results obtained by our system in the competition are not as good as those obtained in the development partition, which might indicate that the model was overfitted during the fine-tuning process. Further analysis is needed to find the source of the low generalization capabilities of the developed model.

5.3. Carbon emission

One of the main goals of the competition is to identify systems that complete tasks with minimal resource consumption[ 1 ]. This will help them pinpoint technologies that can operate on mobile devices or personal computers and those with the lowest carbon emissions. Therefore, we include the following information: • Total time to process (in milliseconds) • Kg in CO2 emissions.

Using the provided script, which utilizes the CodeCarbon API [18] to calculate emissions, we present our team’s computer configuration in Table 16. This table details the types and quantities of CPUs and GPUs employed, as well as the total RAM used. We present the results for the LongFormer-T1-2 Run 2.

Measurements CPU_Count GPU_Count CPU_Model GPU_Model RAM_Total_Size Country_ISO_Code Values 24 1 12th Gen Intel(R) Core(TM) i9-12900K

NVIDIA GeForce RTX 4090 128 GB

ESP

Figure 1 illustrates the variation in emissions and duration during the experimentation. A direct correlation exists between each measurement, indicating that rounds with longer durations emitted more CO2. Since every round utilized the same models and configurations, the primary factor influencing emissions was the length of the round and the accumulated context of the user.

(a) Emissions of CO2 (Kg) of each round (b) Duration (milliseconds) of each round

Figure 2 displays the cumulative energy consumption of each component. The GPU is the highest energy-consuming component, accounting for approximately 83% of the total energy usage. The CPU follows, consuming 16.5%, while RAM accounts for only 0.5% of the total energy consumption.

6. Conclusion

In this paper, we have presented the participation of the ELiRF-VRAIN team in the shared tasks of MentalRiskES at IberLef 2024. In addition to testing classic classification models and state-ofthe-art transformer models, our team’s most innovative contribution was using LongFormer models to expand the context for making the decision and increase the training corpus through data augmentation.

The results obtained support the correctness of our proposal, being the only team to exceed the baseline presented by the organization of the shared task.

For future work, two lines of improvement are identified. On the one hand, try to improve early detection so that the system does not need as much initial context to make the right decision; on the other hand, use Explainable Artificial Intelligence (XAI) techniques to better understand the system’s behavior.

Acknowledgments

This work is partially supported by MCIN/AEI/10.13039/501100011033 and "ERDF A way of making Europe" under grant PID2021-126061OB-C41. Partially supported by the Vicerrectorado de Investigación de la Universitat Politècnica de València PAID-01-23. It is also partially supported by the Spanish Ministerio de Universidades under the grant FPU21/05288 for university teacher training and by the Generalitat Valenciana under CIPROM/2021/023 project. pretrained language models for mental healthcare, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2022, pp. 7184–7190. URL: https://aclanthology.org/2022.lrec-1.778. [15] AIMH, Mentalroberta: A robustly optimized bert pretraining approach for mental health, 2024. URL: https://huggingface.co/AIMH/mental-roberta-large, accessed: 2024-05-15. [16] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural language processing, 2020. URL: https://arxiv.org/abs/1910.03771. arXiv:1910.03771. [17] AIMH, Mentallongformer: A long-document transformer model for mental health, 2024.

URL: https://huggingface.co/AIMH/mental-longformer-base-4096, accessed: 2024-05-15. [18] CodeCarbon, Codecarbon: Track and reduce your carbon emissions from machine learning workloads, https://mlco2.github.io/codecarbon/index.html, 2024. Accessed: 2024-05-15.

[1]

Chiruzzo ,

S. M.

Jiménez-Zafra ,

Rangel , Overview of IberLEF 2024: Natural Language Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS .org, 2024 .

[2]

World

Health Organization , Mental disorders, 2022 . URL: https://www.who.int/news-room/ fact-sheets/detail/mental-disorders, accessed: 2024 -05-15.

[3]

World

Health Organization , Mental disorders fact sheet , 2022 . URL: https://www.who.int/ news-room/fact-sheets/detail/mental-disorders, accessed: 2024 -05-21.

[4]

A. M.

Mármol-Romero ,

Moreno-Muñoz ,

F. M. P.

del Arco , M. D. Molina-González, M.- T.

Martín-Valdivia , L. A.

Ureña-López , A.

Montejo-Ráez , Overview of mentalriskes at iberlef 2024: Early detection of mental disorders risk in spanish , Procesamiento del Lenguaje Natural 73 ( 2024 ).

[5]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in Neural Information Processing Systems 30 ( 2017 ). URL: https://arxiv.org/abs/1706.03762, accessed: 2024 -05-15.

[6]

Liu ,

Ott ,

Goyal ,

Du ,

Joshi ,

Chen ,

Levy ,

Lewis ,

Zettlemoyer , V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach , arXiv preprint arXiv: 1907 . 11692 ( 2019 ). URL: https://arxiv.org/abs/ 1907 .11692.

[7]

Beltagy ,

M. E.

Peters ,

Cohan , Longformer: The long-document transformer , arXiv preprint arXiv: 2004 . 05150 ( 2020 ). URL: https://arxiv.org/abs/ 2004 .05150.

[8]

A. M.

Mármol Romero ,

A. Moreno

Muñoz ,

F. M.

Plaza-del Arco , M. D. Molina González , M. T. Martín

Valdivia , L. A.

Ureña-López , A.

Montejo

Ráez

, MentalRiskES: A new corpus for early detection of mental disorders in Spanish , in: N. Calzolari , M.- Y.

Kan , V.

Hoste , A.

Lenci , S.

Sakti , N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 11204 - 11214 . URL: https://aclanthology.org/ 2024 . lrec-main. 978 .

[9]

Reimers , Easynmt: A simple interface to state-of-the-art machine translation models , 2020 . URL: https://github.com/UKPLab/EasyNMT, accessed: 2024 -05-15.

[10]

Tiedemann , S. Thottingal, OPUS-MT - Building open translation services for the World , in: Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT) , Lisbon, Portugal, 2020 .

[11]

Pedregosa ,

Varoquaux ,

Gramfort ,

Michel ,

Thirion ,

Grisel ,

Blondel ,

Prettenhofer ,

Weiss ,

Dubourg ,

Vanderplas ,

Passos ,

Cournapeau ,

Brucher ,

Perrot , É. Duchesnay, Scikit-learn: Machine learning in python , Journal of Machine Learning Research 12 ( 2011 ) 2825 - 2830 . URL: https://jmlr.org/papers/v12/pedregosa11a. html.

[12]

L. X.

Yuan , distilbert -base-multilingual-cased-sentiments-student ( revision 2e33845) , 2023 . URL: https://huggingface.co/lxyuan/ distilbert-base -multilingual-cased-sentiments-student . doi:10 .57967/hf/1422.

[13]

Pourkeyvan ,

Safa ,

Sorourkhah , Harnessing the power of hugging face transformers for predicting mental health disorders in social networks , IEEE Access 12 ( 2024 ) 28025 - 28035 . URL: http://dx.doi.org/10.1109/ACCESS. 2024 . 3366653 . doi: 10 .1109/ access. 2024 . 3366653 .

[14]

Ji ,

Zhang , L. Ansari,

Fu ,

Tiwari , E. Cambria, MentalBERT: Publicly available