ELiRF-VRAIN at MentalRiskES 2024: Using
                                LongFormer for Early Detection of Mental Disorders
                                Risk
                                Andreu Casamayor, Vicent Ahuir, Antonio Molina* and Lluís-Felip Hurtado
                                Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Politècnica de València, Camino de Vera s/n,
                                46022 Valencia. Spain


                                                                         Abstract
                                                                         This paper describes the approaches taken by the ELiRF-VRAIN team at the shared tasks of MentalRiskES
                                                                         at IberLEF 2024 [1]. These shared tasks involved two activities focused on identifying mental illness on
                                                                         Spanish-language social media: detection of disorder and context detection. Our work consisted of three
                                                                         approaches: one approach based on a Support Vector Machine and the other two based on Transformer
                                                                         architecture pre-trained models, one using BERT-like models and the other using LongFormer models.
                                                                         In order to fine-tune our models, we used a data augmentation process on the data provided by the
                                                                         organization. According to the results, our approaches fit the task correctly.

                                                                         Keywords
                                                                         Longformer, Transformers, Support Vector Machine, Mental disorder detection


                                1. Introduction
                                A mental disorder is characterized by a clinically significant disturbance in an individual’s
                                cognition, emotional regulation, or behavior. It is usually associated with distress or impairment
                                in important areas of functioning [2].
                                   According to the World Health Organization (WHO), 1 in every 8 people is living with a
                                mental disorder, with anxiety and depressive disorders the most common [3]. Although the
                                problem is widely known, the number of people is still increasing, and discrimination against
                                them still exists. Currently, the governments work to prevent and cure mental illness. However,
                                the lack of human and material resources means that many people cannot receive adequate
                                treatment or none at all. In addition to all this, early detection of mental disorders is often
                                difficult.
                                   In this context, detecting mental disorders risk through analyzing social media interactions
                                has acquired great relevance in recent years. Many factors make the problem of mental disorders
                                IberLEF 2024, September 2024, Valladolid, Spain
                                *
                                 Corresponding author.
                                $ ancase3@upv.es (A. Casamayor); vahuir@dsic.upv.es (V. Ahuir); amolina@dsic.upv.es (A. Molina);
                                lhurtado@dsic.upv.es (L. Hurtado)
                                 https://vrain.upv.es/elirf/ (A. Casamayor); https://vrain.upv.es/elirf/ (V. Ahuir); https://vrain.upv.es/elirf/
                                (A. Molina); https://vrain.upv.es/elirf/ (L. Hurtado)
                                 0009-0003-6000-3828 (A. Casamayor); 0000-0001-5636-651X (V. Ahuir); 0000-0001-6537-8803 (A. Molina);
                                0000-0002-1877-0455 (L. Hurtado)
                                                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
detection complicated, such as availability, amount, and quality of data. Providing quality labeled
data in Spanish and promoting the creation of models for this early detection is precisely the
objective of the MentalRiskES shared tasks.
   In the 2024 edition, the competition consisted of three tasks [4]: (1) Detection of mental
disorder, (2) Context Detection, and (3) Suicidal ideation detection. Our team participated in the
first two tasks.
   To tackle task 1, we considered three different approaches.

   1. The first approach is based on a classic machine learning algorithm: Support Vector
      Machines (SVM). SVMs have demonstrated adequate behavior in long text classification
      tasks such as this case. We consider this approach as an assessment of the performance
      of classical models.
   2. The second approach is based on Transformers [5]. We use a pre-trained RoBERTa model
      [6] as a basis and then run a fine-tuning process to adjust them to the task domain. We
      considered two different datasets to do fine-tuning: the one provided by the organization
      and an expanded version of the dataset through a data augmentation process.
   3. The last approach is similar to the second one; however, to capture more context, we use
      a pre-trained LongFormer model [7]. This way, the model is able to capture more context
      because of the bigger size of the input layer. We used the same dataset as in the previous
      approach for the fine-tuning phase.

  We submitted three runs for task 1, one for each approach. The best model of each approach
was chosen through a previous validation stage in which different parameters and datasets were
considered.
  To tackle task 2, we sent one system based on the third approach of the first task, a LongFormer-
based solution. We chose that approach since it was the more promising one based on the
evaluation results of task 1.


2. Description of Dataset and Tasks
The datasets delivered by the organization consisted of a message collection sent to different
groups on Telegram [8]. These public groups have the characteristic of being in Spanish and
related to mental illnesses. The messages were anonymized and, subsequently, labeled by ten
annotators at the user level; that is, each user was labeled considering his/her messages.
   Two different datasets were delivered: one for the first two tasks and a different one for
the third task. The first dataset, the one with which we worked, has the following sample
distribution: 20 users for trial, 465 users for train, and 400 users for test.
   As stated above, the main objective of this competition is to predict mental disorders as soon
as possible. To achieve realistic behavior, the organization emulated a real conversation by
setting up a server that gives out packets of data containing a message for each user. The system
must predict the label of each user, considering the current message and all their previous
messages, before the classification system will receive the next packet. The goal is to predict
each user’s mental disorder, if any, as quickly as possible.
2.1. Task 1: Disorder Detection
Task 1 is a multiclass classification task whose objective is to predict if users suffer from
depression, anxiety, or none disorder.
  Table 1 shows the distribution among the different labels in the dataset for the first task.

                                                Train       Trial    Total
                                None             213         10       223
                                Depression       164          5       169
                                Anxiety          88           5       93
                                Total            465         20       485
Table 1
Distribution of samples across the Train and Trial partitions of the Task 1 dataset.

   To maximize the available samples for the training process, we joined the Train and Trial
partitions to train our systems; the Total column of Table 1 shows the final sample distribution
of our training dataset.

2.2. Task 2: Context Detection
Task 2 is a two-level multiclass multilabel task: in addition to detecting the mental illness,
the context or contexts in which it appears must be detected. There are 7 contexts: addiction,
emergency, family, work, social, other, and none.
   The label distribution in this total dataset can be seen in Table 2. It shows how the contexts
of Family, Social, Other, and None are the most common.

                      Addiction     Emergency        Family         Work     Social   Other   None
      Depression          9             7              47            9         66      26      52
      Anxiety             3            10              14            8         25      33      26
      Total              12            17              61            17        91      59      78
Table 2
Distribution of samples across the Task 2 dataset.

   Table 3 shows how many contexts there are per user. It can be seen that the most common
situation is only one context per user.

                                     1 class    2 classes     3 classes      4 classes
                      Depression       131          30            7              1
                      Anxiety          71           18            4              0
                      Total            202          48            11             1
Table 3
Number of classes per user in Task 2 dataset.
3. System architecture and Techniques
In this kind of task, an important aspect to count on is the amount of context required to perform
the detection correctly. Since each user can have many messages, the size of the input to the
system must be a factor to consider. One goal of our team was to study the impact of the context
in these tasks. That is, measure the capabilities of different systems depending on how much
context they can manage. We selected three different systems to achieve this goal: the first
based on Support Vector Machines (SVM), the second based on a RoBERTa model, and the third
based on a LongFormer model. Every system evaluated has a different size for context:

    • SVM has no limit in the input size; it creates a vector as long as the vocabulary size.
    • The selected RoBERTa model has a limit of 512 tokens in the input.
    • The selected LongFormer model has a limit of 4096 tokens in the input.

   Regarding the dataset, we translated all the data into English because the Transformers base
models were pre-trained using documents in this language. We used the library EasyNMT
[9] and the model OPUS-MT Spanish-English [10] (https://huggingface.co/Helsinki-NLP/
opus-mt-es-en). Furthermore, we created two different datasets to train and evaluate the
performance of the transformer-based systems.
   Dataset 1. We created only one sample per user by accumulating all his/her messages, for
both positive and negative labeled users.
   Dataset 2. If we had some a priori evidence of in which message a user begins to present
symptoms of mental illness risk, we could label the samples from previous messages as negative,
and the samples containing that message and subsequent ones as positive. In this way, we could
increase the number of positive samples, in order to achieve a more precise model. This data
augmentation process is explained below.
   To carry out our experimentation, we divided the original dataset into two partitions: training
(80% of users) and development (20% of users), maintaining the proportions of positive and
negative samples in each of the partitions. Table 4 shows the distribution of samples in Dataset
1.
                                            Training     Development
                             None             178             45
                             Depression       134             35
                             Anxiety           76             17
                             Total            388             97
Table 4
Distribution of samples in Dataset 1 for training and development partitions.


3.1. Data Augmentation
The data augmentation process aims to create more samples per positive user. We said above
that we need some evidence of the message in which a user begins to express symptoms of
illness. To do this, we relied on the prediction of the SVM-based classifier. We can assume that
all the previous messages to the SVM decision don’t express symptoms of illness. To achieve
this goal, we followed the next steps:

   1. For positive users, we calculated how many messages the SVM needs to classify the user
      as positive (depression or anxiety). Each user has a different trigger value.
   2. For false negatives, we used the mean of the true positive trigger values as the trigger
      value.
   3. For each positive user in the original data set, let 𝑛 be the number of messages that the
      SVM model needs to determine this user’s mental disorder risk, 𝑀 𝐴𝑋 be the maximum
      number of messages the model supports as input, and 𝑚𝑖 the ith message from the user.
         a) we created 𝑛 − 1 negative samples as follows:


                                (𝑚1 ), (𝑚1 𝑚2 ), (𝑚1 𝑚2 𝑚3 ), ..., (𝑚1 ...𝑚𝑛−1 )

         b) and 𝑀 𝐴𝑋 − 𝑛 + 1 positive samples:


                            (𝑚1 ...𝑚𝑛 ), (𝑚1 ...𝑚𝑛 𝑚𝑛+1 ), ..., (𝑚1 ...𝑚𝑛 ...𝑚𝑀 𝐴𝑋 )

   4. Note that the value of 𝑀 𝐴𝑋 depends on which model was used and the number of
      tokens in the messages. That is, we discard messages from an accumulated history of
      more than 512 tokens for RoBERTa and 4096 for LongFormer. So, if 𝑛 > 𝑀 𝐴𝑋 only
      negative samples are generated.
   5. For negative users, we created new samples accumulating the history as before, stopping
      when the MAX was reached.

  The result of this technique is a new dataset with a higher number of positive samples for
the training.

                                              Train    Development
                               None           4856          45
                               Depression     2832          35
                               Anxiety        1387          17
                               Total          9075          97
Table 5
Distribution of the new dataset: train and development partition.


3.2. Task 1: Disorder Detection
3.2.1. Classical Machine Learning Classifier Approach
To evaluate the context’s importance, we wanted to use a classical machine learning classifier that
can handle all the context. One of the most important issues of models based on Transformers
is their poor ability to deal with large texts, because of their limitation in the input size. This
issue can affect the performance since the input cannot hold all the sample length, and valuable
information may be lost in this process.
   Firstly, we did an experiment where we compared different types of classical machine learning
classifiers. Scikit-learn library [11] provided us with the tools to develop this experiment. The
configuration was to use all the default classifiers to select the better one. The results can be
seen in Table 6. The table shows that the best classifier was the Linear SVM.
                                               precision     recall   f1-score
                        Linear SVM               0.74         0.74      0.73
                        Gradient Boosting         0.50        0.48      0.49
                        K-Neighboors              0.44        0.50      0.47
                        Random Forest             0.61        0.55      0.59
Table 6
The results from different classifiers in the development partition. The scores are the Macro-precision,
recall, f1-score.


  Once the classifier was chosen, we wanted to test different approaches:

    • Preprocess of Data:
         1. First approach: Transform the text into tokens using TweetTokenizer and then
            eliminate stop words.
         2. Second Approach: Same as the first approach with the addition of methods to clean
            the text, eliminate non-alphanumerical characters and others, and lemmatize tokens.
    • Sentimental Analysis: We used the model "lxyuan/distilbert-base-multilingual-
      cased-sentiments-student" [12] to proceed with a sentimental analysis of every message
      per user. We obtained 3 results, positive messages, negative messages, and neutral
      messages, all normalized in the end. We add these results as a new feature for the TF-IDF.
    • TF-IDF: The class TfidfVectorizer in Scikit-learn was used to vectorize the data. We
      tested different configurations for the analyzer and ngram_range number, and used the
      default values for the other features.

   To find the best models for every approach, we did an exhaustive grid search over some
specific parameters, such as regularization parameter C, different tols (Tolerance for stopping
criteria), and different loss.
   We obtained 6 different approaches. Table 7 shows the different configurations used in the
experimentation, the column TF-IDF refers to the type of analyzers (word or char) used and the
number of n-grams. The last column refers to the best model found in the search grid.
   The result shows in Table 8 the best configuration is the SVM-4, using the most completed
preprocess for the data, sentimental analysis, "char_wb" as the analyzer and (4-5) as ngram_range.
This model was used for Run0 in Task 1.
            Preprocess
                           Sentiment
               data                             TF-IDF                           Best Model
                           analysis
             approach
  SVM-1          1             No         "char_wb" , 4-5 n-gram    ’C’: 1, ’loss’: ’squared_hinge’, ’tol’: 0.1
  SVM-2          2             No         "char_wb" , 4-5 n-gram    ’C’: 1, ’loss’: ’squared_hinge’, ’tol’: 0.1
  SVM-3          1             Yes        "char_wb" , 4-5 n-gram    ’C’: 1, ’loss’: ’hinge’, ’tol’: 0.1
  SVM-4          2             Yes        "char_wb" , 4-5 n-gram    ’C’: 1, ’loss’: ’hinge’, ’tol’: 0.1
  SVM-5          1             No         "word" , 1-2 n-gram       ’C’: 10, ’loss’: ’hinge’, ’tol’: 0.1
  SVM-6          2             No         "word" , 1-2 n-gram       ’C’: 10, ’loss’: ’hinge’, ’tol’: 0.1
  SVM-7          1             Yes        "word" , 1-2 n-gram       ’C’: 10, ’loss’: ’hinge’, ’tol’: 0.1
  SVM-8          2             Yes        "word" , 1-2 n-gram       ’C’: 10, ’loss’: ’hinge’, ’tol’: 0.1
Table 7
Summary of the different configurations of the SVM classifiers.

                                          precision    recall    f1-score
                               SVM-1         0.74       0.74       0.73
                               SVM-2         0.76       0.75       0.75
                               SVM-3         0.75       0.75       0.74
                               SVM-4        0.79        0.76       0.76
                               SVM-5         0.70       0.68       0.69
                               SVM-6         0.72       0.70       0.71
                               SVM-7         0.71       0.69       0.70
                               SVM-8         0.73       0.72       0.72
Table 8
Results of the different configurations of the SVM classifiers on development partition. In bold, the best
result for each metric.


3.2.2. BERT-like Model Approach
It is well known that the state-of-the-art models in NLP are based on Transformers. Models
like BERT or RoBERTa usually provide good versatility for classification tasks. However, these
types of models usually can not handle more than 512 tokens, which could be a problem for
tasks with long contexts such as the current ones. Therefore, we used one of these models as a
baseline to compare other models with a better capacity to handle large contexts. Some research
made by Alireza Porkeyvan [13] shows that the state of the art in mental disorder detection
is MentalRoBERTa [14]. MentalRoBERTa is a RoBERTa-like model specialized in mental
health. This model is pre-trained using a special corpus of texts from mental health forums,
clinical notes, and normal corpus. Consequently, MentalRoBERTa provides better adaptation
for the mental health-related language, which brings a lot of possible applications related to
this domain.
   The model chosen was AIMH/mental-roberta-large [15], a RoBERTa model trained with
posts on Reddit related to mental health. This model can be found in HuggingFace [16] public
hub (https://huggingface.co/AIMH/mental-roberta-large). Furthermore, we wanted to compare
a specific domain RoBERTa model, like MentalRoBERTa, with the non-domain RoBERTa model,
the baseline of the competition (RoBERTa base).
   Once we chose our pre-trained model, we performed an experiment that consisted of testing
two fine-tuning processes: one with the Dataset 1 (RoBERTa-1) and the other with the Dataset
2 (RoBERTa-2); the second dataset is the one with data augmentation. Table 9 shows the
configuration used in the fine-tuning process.

                                   parameter                 value
                                   optimizer               AdamW
                                   learning rate               7e-5
                                   lr scheduler type         linear
                                   weight decay                0.01
                                   number of epochs              10
                                   training batch size           16
Table 9
Parameters for the fine-tuning process.

   Table 10 shows the results of each model on the development partition. The results show that
the best model is RoBERTa-2, the one fine-tuned with data augmentation. In our participation,
this model was used for Run1 in Task 1.
                              Data Augmentation          Precision    Recall   F1-score
              RoBERTa-1              No                     0.81       0.82      0.81
              RoBERTa-2              Yes                   0.94        0.94      0.93
Table 10
RoBERTa’s result for Task 1 on development partition.


3.2.3. LongFormer Approach
As we said before, one of the most important disadvantages of BERT-like or RoBERTa-like
models based on Transformers is the lack of capacity to handle large contexts. However, a
variant of Transformers can handle large text called LongFormer [7].
  LongFormer is the abbreviation for “Long-Document Transformer” and can process long
contexts more efficiently than Transformer models, such as BERT or RoBERTa. LongFormer
architecture shows the following characteristics:

    • New attention mechanism: An efficient attention mechanism that uses a sliding win-
      dow, where each token only attends to a fixed number of neighborhood tokens, reducing
      the complexity.
    • Global attention selection: The architecture can select which tokens are globally
      attended and which are just attended locally.

  The pre-trained model chosen was AIMH/mental-longformer-base-4096 [17] a pre-trained
LongFormer for the mental health domain. This model can be found in https://huggingface.co/
AIMH/mental-longformer-base-4096.
  As in with the RoBERTa model, we fine-tuned the LongFormer with the two datasets: Dataset
1 without data augmentation (LongFormer-T1-1), and Dataset 2 with data augmentation
(LongFormer-T1-2). We used the same fine-tuning parameters as in RoBERTa’s experimenta-
tion; the configuration is in Table 9.
   Table 11 shows the results of the experimentation, where LongFormer-T1-2 (fine-tuned
with data augmentation) achieves better performance than LongFormer-T1-1 (fine-tuned
without data augmentation). This model was Run2 in our participation.
                                 Data Augmentation          Precision   Recall     F1-score
           LongFormer-T1-1              No                     0.83      0.84        0.83
           LongFormer-T1-2              Yes                   0.95       0.95        0.94
Table 11
LongFormer’s results for Task 1 on development partition.


3.3. Task 2
The experimentation for Task 1 shows that the best system is the LongFormer-T1-2, so to take
part in Task 2 we only chose this approach. We used the LongFormer pre-trained model as the
base model, increased the number of samples of the competition dataset with data augmentation,
changed the labels for the new ones, and fine-tuned the model. LongFormer-T2 model was
used for Run0 in the second task.
  The Table 12 shows the results of the fine-tuning process.
                                    Model          Precision      Recall    F1-score
                LongFormer-T2       LongFormer        0.99         0.98       0.97
Table 12
LongFormer’s results for Task 2 on development partition.


4. Runs
Table 13 summarizes the selected model for each run, also the development performance is
shown.
                        Task        Model           Precision      Recall   F1-score
                Run0     1          SVM-4              0.79         0.76      0.76
                Run1     1        RoBERTa-2            0.94         0.94      0.93
                Run2     1      LongFormer-T1-2       0.95          0.95      0.94
                Run0      2     LongFormer-T2          0.99         0.98         0.97
Table 13
Summary of the approaches chosen for each run. Also, the performance achieved by each system in the
development partition.


  The reason for choosing these models was to assess the importance of context in predicting
mental illness. Each model has a different input length capability, which can handle larger or
smaller context sizes.
   On the one hand, BERT-like models performed better than SVMs in the first task, even
though BERT-like models can handle less context than SVMs. On the other hand, LongFormer
performed slightly better than BERT-like models in the first task since LongFormer can handle
larger contexts.

4.1. Run Configuration
Besides, to select the model for each run, the classification systems contained additional param-
eters that needed to be set:
   Task1:

    • For every round in the competition, we used as the input classifier a new sample created
      combining the new message of the user with the previous ones.
    • Each system has an initial context, in other words, we made our systems wait until the
      initial context was sufficiently large. This context was different in each system:
         – SVM: An initial context of 50 tokens after the pre-process.
         – RoBERTa and LongFormer: An initial context of 100 tokens.
    • The RoBERTa and LongFormer system has a limit of tokens, when the system was full
      we just returned the last prediction made.

   Task2:
   For the second task, we combined the best model from Task 1 (LongFormer-T1-2) and the one
fine-tuned specifically for Task 2 (LongFormer-T2). The first model was used to discriminate
between negative cases and positive cases. If the sample was detected as positive, then the
LongFormer-T2 was used to predict the context.


5. Results
5.1. Task 1
Table 14 shows the results achieved by our teams in Task 1. The structure of the Table 14 is
the following: rows refer to each run and a special row which refers to the highest values of
the competition. The systems in the competition were ranked using the Macro-F1 score (last
column).

                           Model        Accuracy     Macro-P      Macro-R     Macro-F1
             Run0           SVM           0.848        0.840        0.838       0.833
             Run1         RoBERTa         0.850        0.853        0.845       0.840
             Run2        LongFormer     0.890 (1)    0.875 (1)    0.880 (1)   0.874 (1)
             Highest          -           0.890        0.875        0.880       0.874
Table 14
Results for the 3 runs on Task 1. Highest refers to the highest values achieved in the competition. The
values inside the parenthesis indicate our position in the ranking.
   Table 14 shows how the best system is the Run 2, which refers to the LongFormer-T1-2:
pre-trained LongFormer fine-tuned with the data augmentation. This run achieved the first
position in the competition. The only two runs that beat the Baseline were our Run1 and Run2,
indicating the importance of appropriate data selection.
   Although the best runs used a model base in Transformers, the run with SVM achieves
a similar result, only 1% less than Run1. This indicates that classical approaches like SVMs
continue to be useful in detecting mental illnesses because of their ability to handle large
contexts. Therefore, SVMs still well-fitted in situations with low computational resources.

5.2. Task 2
Table 15 shows the results for the Task 2. Our run was fifth in the official ranking, based on
Macro-F1 score.
                                Accuracy     Macro-P     Macro-R     Macro-F1
                    Run0         0.065 (4)   0.262 (3)   0.177 (5)   0.208 (5)
                    Highest        0.077       0.358       0.508       0.268
Table 15
Results for the Task 2. Highest refers to the highest values achieved in the competition. The values
inside the parenthesis indicate our position in the ranking.


   As can be seen from Table 15, the results obtained by our system in the competition are not
as good as those obtained in the development partition, which might indicate that the model
was overfitted during the fine-tuning process. Further analysis is needed to find the source of
the low generalization capabilities of the developed model.

5.3. Carbon emission
One of the main goals of the competition is to identify systems that complete tasks with minimal
resource consumption[1]. This will help them pinpoint technologies that can operate on mobile
devices or personal computers and those with the lowest carbon emissions. Therefore, we
include the following information:

    • Total time to process (in milliseconds)
    • Kg in CO2 emissions.

  Using the provided script, which utilizes the CodeCarbon API [18] to calculate emissions,
we present our team’s computer configuration in Table 16. This table details the types and
quantities of CPUs and GPUs employed, as well as the total RAM used. We present the results
for the LongFormer-T1-2 Run 2.
                   Measurements                                        Values
                   CPU_Count                                                24
                   GPU_Count                                                 1
                   CPU_Model             12th Gen Intel(R) Core(TM) i9-12900K
                   GPU_Model                       NVIDIA GeForce RTX 4090
                   RAM_Total_Size                                      128 GB
                   Country_ISO_Code                                       ESP
Table 16
Computer configuration


  Figure 1 illustrates the variation in emissions and duration during the experimentation.
A direct correlation exists between each measurement, indicating that rounds with longer
durations emitted more CO2 . Since every round utilized the same models and configurations,
the primary factor influencing emissions was the length of the round and the accumulated
context of the user.


    (a) Emissions of CO2 (Kg) of each round          (b) Duration (milliseconds) of each round

Figure 1: Emissions and Durations Graphs


  Figure 2 displays the cumulative energy consumption of each component. The GPU is the
highest energy-consuming component, accounting for approximately 83% of the total energy
usage. The CPU follows, consuming 16.5%, while RAM accounts for only 0.5% of the total energy
consumption.
Figure 2: Accumulated values of energy (kWh) during the rounds


6. Conclusion
In this paper, we have presented the participation of the ELiRF-VRAIN team in the shared tasks
of MentalRiskES at IberLef 2024. In addition to testing classic classification models and state-of-
the-art transformer models, our team’s most innovative contribution was using LongFormer
models to expand the context for making the decision and increase the training corpus through
data augmentation.
   The results obtained support the correctness of our proposal, being the only team to exceed
the baseline presented by the organization of the shared task.
   For future work, two lines of improvement are identified. On the one hand, try to improve
early detection so that the system does not need as much initial context to make the right
decision; on the other hand, use Explainable Artificial Intelligence (XAI) techniques to better
understand the system’s behavior.


Acknowledgments
This work is partially supported by MCIN/AEI/10.13039/501100011033 and "ERDF A way of
making Europe" under grant PID2021-126061OB-C41. Partially supported by the Vicerrectorado
de Investigación de la Universitat Politècnica de València PAID-01-23. It is also partially sup-
ported by the Spanish Ministerio de Universidades under the grant FPU21/05288 for university
teacher training and by the Generalitat Valenciana under CIPROM/2021/023 project.


References
 [1] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language
     Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
     Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference
     of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org,
     2024.
 [2] World Health Organization, Mental disorders, 2022. URL: https://www.who.int/news-room/
     fact-sheets/detail/mental-disorders, accessed: 2024-05-15.
 [3] World Health Organization, Mental disorders fact sheet, 2022. URL: https://www.who.int/
     news-room/fact-sheets/detail/mental-disorders, accessed: 2024-05-21.
 [4] A. M. Mármol-Romero, A. Moreno-Muñoz, F. M. P. del Arco, M. D. Molina-González, M.-T.
     Martín-Valdivia, L. A. Ureña-López, A. Montejo-Ráez, Overview of mentalriskes at iberlef
     2024: Early detection of mental disorders risk in spanish, Procesamiento del Lenguaje
     Natural 73 (2024).
 [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, Advances in Neural Information Processing Systems 30
     (2017). URL: https://arxiv.org/abs/1706.03762, accessed: 2024-05-15.
 [6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, RoBERTa: A robustly optimized BERT pretraining approach, arXiv preprint
     arXiv:1907.11692 (2019). URL: https://arxiv.org/abs/1907.11692.
 [7] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The long-document transformer, arXiv
     preprint arXiv:2004.05150 (2020). URL: https://arxiv.org/abs/2004.05150.
 [8] A. M. Mármol Romero, A. Moreno Muñoz, F. M. Plaza-del Arco, M. D. Molina González,
     M. T. Martín Valdivia, L. A. Ureña-López, A. Montejo Ráez, MentalRiskES: A new corpus
     for early detection of mental disorders in Spanish, in: N. Calzolari, M.-Y. Kan, V. Hoste,
     A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on
     Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024),
     ELRA and ICCL, Torino, Italia, 2024, pp. 11204–11214. URL: https://aclanthology.org/2024.
     lrec-main.978.
 [9] N. Reimers, Easynmt: A simple interface to state-of-the-art machine translation models,
     2020. URL: https://github.com/UKPLab/EasyNMT, accessed: 2024-05-15.
[10] J. Tiedemann, S. Thottingal, OPUS-MT — Building open translation services for the World,
     in: Proceedings of the 22nd Annual Conferenec of the European Association for Machine
     Translation (EAMT), Lisbon, Portugal, 2020.
[11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
     M. Perrot, É. Duchesnay, Scikit-learn: Machine learning in python, Journal of Machine
     Learning Research 12 (2011) 2825–2830. URL: https://jmlr.org/papers/v12/pedregosa11a.
     html.
[12] L.         X.        Yuan,           distilbert-base-multilingual-cased-sentiments-student
     (revision        2e33845),           2023.       URL:        https://huggingface.co/lxyuan/
     distilbert-base-multilingual-cased-sentiments-student. doi:10.57967/hf/1422.
[13] A. Pourkeyvan, R. Safa, A. Sorourkhah, Harnessing the power of hugging face trans-
     formers for predicting mental health disorders in social networks, IEEE Access 12
     (2024) 28025–28035. URL: http://dx.doi.org/10.1109/ACCESS.2024.3366653. doi:10.1109/
     access.2024.3366653.
[14] S. Ji, T. Zhang, L. Ansari, J. Fu, P. Tiwari, E. Cambria, MentalBERT: Publicly available
     pretrained language models for mental healthcare, in: N. Calzolari, F. Béchet, P. Blache,
     K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo,
     J. Odijk, S. Piperidis (Eds.), Proceedings of the Thirteenth Language Resources and Evalua-
     tion Conference, European Language Resources Association, Marseille, France, 2022, pp.
     7184–7190. URL: https://aclanthology.org/2022.lrec-1.778.
[15] AIMH, Mentalroberta: A robustly optimized bert pretraining approach for mental health,
     2024. URL: https://huggingface.co/AIMH/mental-roberta-large, accessed: 2024-05-15.
[16] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L.
     Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural
     language processing, 2020. URL: https://arxiv.org/abs/1910.03771. arXiv:1910.03771.
[17] AIMH, Mentallongformer: A long-document transformer model for mental health, 2024.
     URL: https://huggingface.co/AIMH/mental-longformer-base-4096, accessed: 2024-05-15.
[18] CodeCarbon, Codecarbon: Track and reduce your carbon emissions from machine learning
     workloads, https://mlco2.github.io/codecarbon/index.html, 2024. Accessed: 2024-05-15.