<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Neurocomputing</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1177/0165551516677946</article-id>
      <title-group>
        <article-title>CMM PLN at MentalRiskES: A Traditional Machine Learning Approach for Detection of Eating Disorders and Depression in Chat Messages</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rodrigo Guerra</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamín Pizarro</string-name>
          <email>p@30</email>
          <email>p@5</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlos Muñoz-Castro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrés Carvallo</string-name>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matías Rojas</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Aracena</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jocelyn Dunstan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Mathematical Modeling, Universidad de Chile</institution>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Pontificia Universidad Católica de Chile</institution>
          ,
          <addr-line>Santiago</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculty of Physical and Mathematical Sciences, Universidad de Chile</institution>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Institute for Mathematical and Computational Engineering, Pontificia Universidad Católica de Chile</institution>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Millenium Institute Foundational Research on Data</institution>
          ,
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>National Center for Artificial Intelligence</institution>
          ,
          <country country="CL">Chile</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>408</volume>
      <issue>2020</issue>
      <fpage>189</fpage>
      <lpage>215</lpage>
      <abstract>
        <p>This paper describes our approaches to solving the MentalRiskES task, which belongs to the IberLEF (Iberian Languages Evaluation Forum) shared task. The task aims to identify the eating disorders and depression of a user using a series of Telegram messages. Our proposed system uses the traditional TFiDF method to represent the messages and then utilizes these representations as input for machine learning models. The best results for classification were obtained using the Naive Bayes classifier, while in the regression task, the best models were Gradient Boots Regressor and Linear Regressor. Despite its simplicity, we demonstrated that our traditional approaches can still achieve competitive results in recent NLP tasks, obtaining the best results in the case of detecting depression and eating disorders.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Text Classification</kwd>
        <kwd>Text Regression</kwd>
        <kwd>Chat Messages</kwd>
        <kwd>Eating Disorders</kwd>
        <kwd>Depression</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The global mental health crisis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], as emphasized by alarming statistics from the World Health
Organization, demands novel and efective solutions for early risk identification. The emergence
of social media as a window into individuals’ mental states ofers an unprecedented opportunity
to identify and mitigate risks before they escalate. However, a gap exists in addressing the
non-English-speaking population, mainly Spanish speakers.
      </p>
      <p>
        This paper focuses on a proposed method that utilizes data from the MentalRiskEs task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
aimed at the early identification of mental disorders in Spanish messages from Telegram users.
Our solution seeks to guide through a data stream, identifying risks at the earliest point, an
approach that prioritizes accuracy and detection speed.
      </p>
      <p>
        Our method tackles two primary tasks: 1) the detection of eating disorders and 2) the
detection of depression. Depending on the task’s specifics, we approach the problem from
various angles, including binary classification, simple regression, multiclass classification, and
multi-output regression. Each approach is designed to discern whether a user has the given
mental disorder and to what degree. We combine machine learning algorithms and diferent
text representations, including traditional techniques such as TF-IDF and modern approaches
that leverage transformer-based language models trained in the Spanish language, such as
BETO [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], DistillBETO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], or BERTIN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], to solve these tasks. While these modern language
models have significantly advanced in natural language understanding, we give evidence
that traditional machine learning models can still yield competitive performance when paired
with appropriate text representations. We aim to show that our method, which combines
traditional and contemporary approaches, can identify individuals with potential eating and
mental disorders more efectively and rapidly. Detailed information about our approach, its
implementation, and its evaluation will be provided in the subsequent sections of this article.
      </p>
      <p>In this paper, we first present the tasks and data. In methodology, we present the preprocessing
and methods to solve these tasks. Then in experimental methodology, it shows the pipeline
for each task and the model’s selection criteria. Then the results and analysis, and finally,
conclusions and future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Tasks and Data Description</title>
      <p>
        2.1. Tasks
In our MentalRiskES [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] approach, we focus on two tasks: detecting eating disorders and
depression through the analysis of Telegram messages. Both tasks use various techniques,
including binary and multi-class classification and regression models, to predict the risk of
either condition or related sub-conditions. The first task involves detecting eating disorders and
estimating the likelihood of anorexia or bulimia. The second task identifies depression and its
probability and categorizes it into specific types. More details on each task and their subtasks
are provided as follows.
2.1.1. Task 1
This task aims to detect eating disorders. The subtasks are the following:
1. Subtask 1a: A binary classification. Labels are 0 for “control” and 1 for “sufer”.
2. Subtask 1b: Regression of the probability of sufering anorexia or bulimia. 0 means 100%
negative, and a 1 would be 100% positive.
2.1.2. Task 2
This task aims to detect depression. The subtasks we approached are:
1. Subtask 2b: Regression of the probability of sufering depression. 0 means 100% negative,
and a 1 would be 100% positive.
2. Subtask 2d: Multi-output regression for each of the previous classes. The system should
provide a probability of belonging to that class.
2.2. Data
To tackle the tasks of eating disorders and depression detection from Telegram messages, we
used the dataset provided by the MentalRiskES from the IberlEF 2023 challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The eating
disorder detection task consists of 50 rounds of messages, while the depression detection task
has 100 rounds of messages. We strategically divided the dataset, using 80% for training our
models and reserving the remaining 20% for dev-test. Furthermore, our approach ensured
stratification based on the label for the classification tasks.
      </p>
      <p>The evaluation of a system is, according to an evaluation perspective. For classification, the
perspectives are:
• Absolute classification : measure the absolute classification from the first round. If the
classifier output in a round is 1, the system will consider that output until the last round.</p>
      <p>The performance per round is measured with Micro and Macro F1.
• Early detection efectiveness : measures how fast the system detects the disorder. The
metrics are Early Risk Detection Error (ERDE) which measures the correctness and the
delay of the classification at a specific round.</p>
      <p>For regression, the evaluation perspectives are:
• Error basis: measures the error between the prediction of the model and the observed
values. It uses the Root Mean Square Error (RMSE).
• Ranking basis: measures a ranking of the output and its concordance with the observed
values. The metric is Precision@k (p@k), which means the top k values, and for this
problem, at a specific round.</p>
      <p>The following list shows the evaluation perspectives and the metrics to rank the systems for
each type of problem:</p>
      <sec id="sec-2-1">
        <title>Classification Tasks</title>
        <sec id="sec-2-1-1">
          <title>Absolute Classification</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Early Detection Efectiveness</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Metrics to Rank</title>
        <sec id="sec-2-2-1">
          <title>Macro-F1</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>ERDE30</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>This section outlines the methodology employed to tackle the tasks and subtasks of the
MentalRiskES Challenge. Our process begins with a detailed overview of our data preprocessing
strategy, which centers on preparing text data for input into our chosen machine learning
models. Following this, we present and contrast our proposed methods against the challenge’s
baseline models, which are state-of-the-art transformer-based architectures. Our objective is to
illustrate the competitiveness of traditional machine-learning techniques against these baselines.
Comprehensive details about our preprocessing steps and each method are thoroughly explained
in the subsequent subsections.
3.1. Preprocessing
As a first step, we concatenated for each user all the messages. We analyzed this data to see
which words were more common in the complete and per class. The most important finding in
Taks 1 is that the word “Ana” was common in class “sufer”. This information was useful in
ifnding out that the teenagers, mainly females, identify as Ana for anorexia and Mia for bulimia.</p>
      <p>
        After we lemmatized all the messages, with spacy-spanish-lemmatizer [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and removed
the word “mia” from the stopwords set. Then, for feature extraction, at first, we used Term
Frequency Inverse Document Frequency (TF-IDF) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with 2-gram to maintain context. This is
the input for the machine learning models.
3.2. Methods
This subsection outlines our methods for our experiments, centering on traditional machine
learning models for solving these intricate tasks related to mental disorders. Initially, we
discuss the challenge-proposed baselines, incorporating state-of-the-art language models like
RoBERTa [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and DeBERTa [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Subsequently, we shed light on our solutions for the binary and
multi-class classification tasks related to eating disorders and depression. We have leveraged
traditional machine learning classifiers, including Support Vector Classifier, Multinomial Naive
Bayes, K-Nearest Neighbors, and Extreme Gradient Boosting. For regression tasks, our toolkit
comprises SGD Regressor, Ridge Regressor, Linear Regressor, Gradient Boost Regressor, and
Random Forest Regressor. We inquire into more specifics for these models in the following
subsections.
      </p>
      <sec id="sec-3-1">
        <title>3.2.1. Baselines</title>
        <p>
          The baselines for MentalRiskES tasks are the following:
• RoBERTa-Base [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]: is a model developed by Facebook AI and is a variant of BERT
optimized for better performance. It uses dynamic masking and trains on larger batches
of data. It discards the Next Sentence Prediction (NSP) task used in BERT, enabling it to
produce more nuanced and context-aware text representations. This makes it a powerful
tool for various natural language processing tasks, namely regression and classification,
after adding a linear layer.
• RoBERTa-Large [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]: similar to the RoBERTa Base, optimizes the BERT approach for
superior performance. The ’Large’ variant denotes a more extensive architecture with
355 million parameters, compared to RoBERTa Base’s 125 million, allowing it to model
more complex data patterns at a higher computational cost.
• DeBERTa [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]: is a transformer-based model used for natural language processing tasks.
        </p>
        <p>It builds upon the BERT model by introducing two main enhancements: disentangled
attention and an enhanced mask decoder. Disentangled attention in DeBERTa separates
the content and position streams in the self-attention mechanism. The model processes
the content (the what) and position (the where) information separately in the attention
scores, allowing for a more nuanced understanding of the context. The enhanced mask
decoder applies a scaling factor to the dot-product self-attention mechanism, improving
the model’s ability to model dependencies in masked language modeling tasks. These
improvements lead to a more refined understanding of the context in the text and allow
DeBERTa to excel in various NLP tasks, from classification to sentiment analysis and
beyond.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2.2. Classification</title>
        <p>
          To accomplish classification tasks, we employed four traditional supervised algorithms that
have demonstrated exceptional results across diverse tasks in Natural language Processing
(NLP) from scikit-learn library [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          • Support Vector Classifier (SVC) : The Support Vector Machine or Support Vector
Classifier (SVC) is a powerful method for regression and classification tasks [
          <xref ref-type="bibr" rid="ref11">11, 12</xref>
          ]. The
main objective of SVC is to find an optimal hyperplane that separates diferent classes,
maximizing the margin between them in a dimensional space. In the same way, the
algorithm uses a training subset, and it identifies the support vectors; that correspond to
points nearest to the hyperplane. These support vectors play a crucial role in determining
the position and orientation of the hyperplane [12], finally directly impacting the decision
function. Additionally, SVC has the capability to handle both linearly and non-linearly
separable datasets by employing kernel functions [13]. These functions transform the
input data into a higher-dimensional space where a linear separation may exist. The most
popular Kernels are Linear, Radial Basis Function (RBF), Polynomial, and Sigmoid.
• Multinomial Naive Bayes (Multinomial NB) Classifier : The algorithm is supported
by the well-known Bayes’ theorem [14], which provides a foundation for a simple
probabilistic method known as Naive Bayes (NB). This algorithm aims to predict the conditional
probability of an event by making certain naive assumptions. It calculates the probability
of occurrence based on previous knowledge of circumstances associated with the event,
leveraging the principles of Bayesian probability. In addition to the above, an approach
to a multinomial model [15] becomes particularly relevant when dealing with discrete
features (e.g., word counts present in text). The multinomial classifier efectively captures
the patterns and information embedded in the discrete features of the text, enabling
accurate classification.
• K Nearest Neighbors Classifier (KNN) : The KNN algorithm [16] is a classical and
uncomplicated classifier but highly eficient and efective in several tasks [ 17, 18]. The
procedure classifies the unlabeled data based on the k-nearest neighbor points labeled
from a previous sample [19], commonly named as training examples. The essence of this
approach depends mainly on measuring the distance between the examples, the number
of neighbors represented by k, and the neighborhood combination criterion. Where k is
an integer that takes values ≥ 1 and depicts the observed neighborhood’s scope.
• Extreme Gradient Boosting Classifier (XGB) : it operates under the principle of
gradient boosting, which involves building multiple weak predictive models, typically decision
trees, and combining them in an ensemble for stronger and more accurate predictions.
When applied to text classification, XGBoost starts with preprocessing the text data.
This involves steps such as tokenization, removing stop words, and vectorization. After
preprocessing, each document is represented as a vector, forming a high-dimensional
dataset. XGBoost then builds decision trees iteratively. For each iteration, it adds a new
tree that attempts to correct the errors made by the ensemble of existing trees. It does this
by assigning more weight to the instances that were misclassified in previous iterations.
This process continues until a specified number of trees are added or no further
improvements can be made. XGBoost also employs a technique called regularization, which helps
to prevent overfitting by penalizing complex models, leading to better performance on
unseen data.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.2.3. Regression</title>
        <p>
          To accomplish regression tasks, we employed four traditional supervised algorithms used in
Natural language Processing (NLP) from scikit-learn library [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]:
• SGD Regressor: is a linear model that employs a stochastic approximation to the gradient
descent optimization. It’s particularly efective for large-scale data due to its eficiency
and ease of implementation. The model predicts the outcome using a linear combination
of feature values and model parameters.
• Ridge Regressor: is a technique for analyzing multiple regression data sufering from
multicollinearity. Adding a degree of bias to the regression estimates, it reduces the
standard errors. This technique employs L2 regularization, which discourages large
coeficients by penalizing the square of the coeficients.
• Linear Regressor: is a simple form of regression that assumes a linear relationship
between the independent and dependent variables. It can be single (one independent
variable) or multiple (more than one independent variable). The model calculates the
best-fit line for the observed data by minimizing the sum of the squares of the vertical
deviations from each data point to the line.
• Gradient Boosting Regressor Trees (GBRT): The GBRT algorithm corresponds to a
variant of Gradient Boosting (GB), highlighting particularly in regression tasks. Where
the objective is not to predict classes as XGB; on the contrary, it predicts continuous
values. Gradient Boosting Regressor involves the sequential addition of decision trees,
assigning greater importance to instances with higher prediction errors. This iterative
process continues until a predetermined number of trees are added, or no improvements
are found.
• Random Forest Regressor: is an ensemble learning method that builds multiple decision
trees and outputs the mean prediction of the individual trees. It reduces overfitting
common in single decision trees, and improves generalization by introducing randomness
in constructing the trees.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Methodology</title>
      <p>
        As a preprocessing, we used the Spanish lemmatizer from spacy [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For feature extraction, we
used TFiDF [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] with bigram or trigram. We used the one with better results.
      </p>
      <p>After, we trained the models with all the rounds concatenated for each task, with the users in
the training set. To evaluate the models, we predicted the users in the dev-test set also known
as the validation set; each round concatenates the previous messages.</p>
      <p>Finally, for the selection criteria, we plotted the evolution of metrics for each model, and we
chose the model that ended with best with a specific metric and stability (the predictions do not
change strongly over a few incoming messages). The experiment modules are in Figure 1.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>5.1. Task 1: Eating Disorders Detection
For this task, there are two approaches to detecting an eating disorder. The first is a classification
problem of the classes “sufer” and “control”. And the second is a simple regression with the
probability of the class “sufer”.</p>
      <sec id="sec-5-1">
        <title>5.1.1. Binary Classification</title>
        <p>In this task, our approach in Macro F1 metric is 0.827. And rank in this metric we are 9 out of 25
models, the best model result is 0.966. Our method outperformed baseline models in accuracy,
Macro-P, Macro-R, and Macro-F1. In early risk detection, in ERDE30 our model result is 0.074
and in rank, is 5 out of 25. We outperform the baselines. The best model in this metric is 0.018.</p>
        <p>The preprocessing of this task was to lemmatize and use bigrams with TFiDF. The classifiers
used for this task are Linear SVC, RBF SVC, KNN, XGB, and MultinomialNB, all with sklearn
default parameters, except for MultinomialNB, with ℎ = 0.01 for additive smoothing
parameter (smooth categorical data).</p>
        <p>We see the instability of the metrics in the first messages in every model. KNN, RBF CVC,
and XGB have an unstable evolution. This means in absolute classification, the behavior of
these models would be lower because of the changes in the decisions.</p>
        <p>On the other hand, models MultinomialNB and Lineal SVC have better metrics and similar
stabilities. We choose MultinomialNB because, after 8 rounds, it has the highest metrics and
more stable evolution.</p>
        <p>Table 3 shows the results of the validation and test set. The metrics of the absolute
classification show that Multinomial Naive Bayes performs better than transformers and the mean of
the rest of the systems.</p>
        <sec id="sec-5-1-1">
          <title>Accuracy</title>
        </sec>
        <sec id="sec-5-1-2">
          <title>Macro-P</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>Macro-R</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Macro-F1 Model Validation</title>
        <sec id="sec-5-2-1">
          <title>Multinomial Naive Bayes</title>
          <p>Test</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>Multinomial Naive Bayes</title>
        </sec>
        <sec id="sec-5-2-3">
          <title>BaseLine - Roberta Large</title>
        </sec>
        <sec id="sec-5-2-4">
          <title>BaseLine - Deberta</title>
        </sec>
        <sec id="sec-5-2-5">
          <title>Other systems (median)</title>
        </sec>
        <sec id="sec-5-2-6">
          <title>Other systems (mean)</title>
        </sec>
        <sec id="sec-5-2-7">
          <title>BaseLine - Roberta Base</title>
          <p>Table 4 shows the results for early prediction. At ERDE5, the performance is not well because,
in the submission, we could not send the first 8 rounds. When simulating this in validation, we
see similar values. For ERDE30, we can see that our model is superior to the others but with a
higher latency, which means that latency-weighted F1 is lower than all the transformers-based
models.</p>
        </sec>
        <sec id="sec-5-2-8">
          <title>ERDE5</title>
        </sec>
        <sec id="sec-5-2-9">
          <title>ERDE30 latencyTP</title>
          <p>speed latency-weightedF1</p>
          <p>The failed prediction in the test partition is mainly caused by an early prediction of the
class “sufer”. This could have been avoided with a higher Multinomial Naive Bayes classifier
threshold when predicting probabilities of the class “sufer”.</p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.1.2. Simple Regression</title>
        <p>The result in RMSE of our model is 0.244, and in the ranking is 8 out of 20. The baselines
outperformed our model, and the best model is Baseline Roberta Base with an RMSE of 0.178.
On the other hand, in ranking-based evaluation in p@30, our model has 0.733, and the best
model is Baseline Roberta Large with 0.900.</p>
        <p>The preprocessing of this task was to lemmatize and for feature extraction bigram TFiDF.
The models we used were all with the default parameters. The regressors were SGD, Ridge,
Linear, Gradient Boost, and Random Forest.</p>
        <p>Figure 3 shows the evolution of the metrics Mean Absolute Error and R2 Score. In the
ifrst round, the Linear Regressor has better metrics than the rest of the models, but after 2
rounds, Random Forest and Gradient Boost are always better. We choose Gradient Boost
Regressor because, after 10 rounds, it has the lowest Mean Absolute Error and the highest R2 Score.
(a) Mean Absolute Error in Task 1b
(b) R2 Score in Task 1b</p>
        <p>Table 5 shows that the transformers perform better than the rest of the systems in both
evaluation perspectives. Regression is more dificult, so we expected that complex models
outperform machine learning models.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Model Validation</title>
        <sec id="sec-5-4-1">
          <title>Gradient Boost Regressor</title>
          <p>Test</p>
        </sec>
        <sec id="sec-5-4-2">
          <title>BaseLine - Roberta Base</title>
        </sec>
        <sec id="sec-5-4-3">
          <title>BaseLine - Roberta Large</title>
        </sec>
        <sec id="sec-5-4-4">
          <title>BaseLine - Deberta</title>
        </sec>
        <sec id="sec-5-4-5">
          <title>Gradient Boost Regressor</title>
        </sec>
        <sec id="sec-5-4-6">
          <title>Other systems (mean)</title>
        </sec>
        <sec id="sec-5-4-7">
          <title>Other systems (median)</title>
        </sec>
      </sec>
      <sec id="sec-5-5">
        <title>RMSE</title>
        <p>0.800
1.000
0.800
0.600
0.763
0.700
0.800
0.800
0.900
0.700
0.694
0.700
0.700</p>
        <p>Table 6 shows that our model performs better than other systems, but again,
transformerbased models outperform our Gradient Boost Regressor at round 25.</p>
        <p>We compared the evolution of p@5 and p@30 in Figures 4a and 4b of the baseline methods
and ours in the sampled rounds. We can see that Transformers have had the same or better
performance since the beginning of the rounds. This is probably due to the large vocabulary
and more accurate representation of words in these types of models.
5.2. Task 2: Depression Detection
For this task, we only approached the regression problems, corresponding to a simple
regression to predict the probability of a user sufering from depression. The second problem is a
multi-output regression for the classes “control”, “sufering+against”, “sufering+in favor”, and
“sufering+other”.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.2.1. Simple Regression</title>
        <p>The result of our model in RMSE is 0.309, with a rank of 4 out of 19. The best model is Baseline
Roberta Base with an RMSE of 0.277. In ranking-based evaluation, our model has the best
results in every metric, with a p@30 of 0.600.</p>
        <p>For this task, the feature extraction was bigrams in TFiDF. All the models had the default
parameters. The regressors were SGD, Ridge, Linear, Gradient Boost, and Random Forest.
Figure 5 shows the evolution of Mean Absolute Error and R2 Score in the validation set. We
can see an erratic Random Forest and Gradient Boost Regressor behavior in both metrics. SGD
has a more stable behavior but the worst metrics overall rounds. The best models are Ridge and
Linear Regressor. After 20 messages, both are stable, but Linear Regressor has a lower Mean
Absolute Error.</p>
        <p>(a) Mean Absolute Error Evolution in Task 2b
(b) R2 Score Evolution in Task 2b
0.466</p>
        <p>Table 8 shows that at round 25, Linear Regressor has the same or better performance in every
ranking-based evaluation.</p>
      </sec>
      <sec id="sec-5-7">
        <title>Model Validation</title>
        <sec id="sec-5-7-1">
          <title>Linear Regression</title>
          <p>Test</p>
        </sec>
        <sec id="sec-5-7-2">
          <title>Linear Regression</title>
        </sec>
        <sec id="sec-5-7-3">
          <title>BaseLine - Roberta Large</title>
        </sec>
        <sec id="sec-5-7-4">
          <title>BaseLine - Roberta Base</title>
        </sec>
        <sec id="sec-5-7-5">
          <title>BaseLine - Deberta</title>
        </sec>
        <sec id="sec-5-7-6">
          <title>Other systems (mean) Other systems (median) p@5 0.800</title>
          <p>0.800
0.400
0.600
0.800
0.320
0.400
0.800
0.500
0.800
0.600
0.300
0.300
0.700</p>
          <p>If we look at the evolution of the rankings p@5 and p@30 in Figures 6a and 6b respectively,
our model, the Linear Regressor, at p@5, has lower than Roberta Base and Large in round 1, but
from round 25, it has the best performance, and the metrics are similar to Roberta Base. On the
other hand, in p@30, the Linear Regressor is the best in rounds 1 and 25, but after, the metric
gets lower.</p>
          <p>Unlike Task 1b, a machine-learning model performed better than transformer-based models
in ranking metrics. This could be explained because Task 2b has less spaced vocabulary (9046
unique words in 100 messages of the training set) than in 50 messages (8860 unique words in 50
messages of the training set), resulting in a more robust machine learning model for this task,
and achieving metrics that can compete with a transformer.</p>
          <p>The detail that this model outperformed transformers-based models in Ranking metrics is
unexpected, considering the fine-tuning of a language model compared with a Linear Regressor.
This leads us to think that with a smaller corpus is not always necessary to apply big complex
models, and depending on the behavior of a simpler model, try models that might have a better
performance, but always consider that a model that requires higher computing capacity will
have higher emissions.</p>
        </sec>
      </sec>
      <sec id="sec-5-8">
        <title>5.2.2. Multi-output Regression</title>
        <p>The result of our model in RMSE mean is 0.349, ranked 5 out of 7. The best model for this task
is Baseline Deberta with 0.246. In ranking-based on p@30 metric our approach is 7 out of 7
models with 0.175, and the best model has 0.350.</p>
        <p>For this task, we used trigrams with TFiDF. The regressors for this task were SGD, Ridge,
Linear, Gradient Boost, and Random Forest, all with default parameters.</p>
        <p>Figure 7, shows an unstable evolution of Gradient Boost and Random Forest Regressor, with
a lower Mean Absolute Error at first rounds. However, as the round passes, the error is similar
to the Linear Regressor. On the other hand, R2 Score shows that Linear Regressor, after the 20th
round, always has a higher score, and it remains stable. So we chose it for the submission.
(a) Mean Absolute Error Evolution in Task 2d
(b) R2 Score Evolution in Task 2d</p>
        <p>Table 9 shows that our Linear Regressor has a better RMST than Roberta Base and Large. But
metrics RMSE sa, and Pearson sa, prove that our model is biased to the class “sufer+against”,
and we can see the same behavior in Roberta Base and Large. This could be explained because
of the imbalanced probabilities for this task. Contrary to Deberta, which is less biased because
of the more similar metrics between the classes.</p>
        <p>Model
Validation
Linear Regressor
Test
BaseLine - Deberta
Other systems (mean)
Other systems (median)
Linear Regressor
BaseLine - Roberta Base
BaseLine - Roberta Large</p>
        <p>RMSE mean RMSE sf RMSE sa RMSE so RMSE c Pearson mean Perason sf Pearson sa Pearson so Pearson c</p>
        <p>Table 10 shows that Linear Regressor cant rank properly in comparison with the rest of the
models. This task is more complex than the others, and so transformers’ initial weights about
the language can provide a more precise output.</p>
      </sec>
      <sec id="sec-5-9">
        <title>Model Validation</title>
        <sec id="sec-5-9-1">
          <title>Linear Regressor</title>
          <p>Test</p>
        </sec>
        <sec id="sec-5-9-2">
          <title>Other systems (mean)</title>
        </sec>
        <sec id="sec-5-9-3">
          <title>Other systems (median)</title>
        </sec>
        <sec id="sec-5-9-4">
          <title>BaseLine - Roberta Large</title>
        </sec>
        <sec id="sec-5-9-5">
          <title>BaseLine - Deberta</title>
        </sec>
        <sec id="sec-5-9-6">
          <title>BaseLine - Roberta Base Linear Regressor p@5 0.350</title>
          <p>0.225</p>
          <p>In Figure 8, Linear Regressor after round 25 can not achieve the rankings that
transformedbased models have.
(b) Evolution of p@30</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Error Analysis</title>
      <p>This section describes the failed predictions and characterization, possible improvements, and
lessons learned from each task.</p>
      <p>For eating disorders, the classification error source was mainly associated with sporty men,
with a late mention of intentions to lose weight. The tendency of the previous cumulative
messages afected the model, which we could solve with the probability prediction of all the
text, predict the message by itself, and ponderate by a hyperparameter. Another thing that can
afect this behavior is the high dimensional TFiDF. We could limit the features and also prevent
the overfitting to certain n_grams.</p>
      <p>In the ERDE metric, the main source of error was certain words in the context of gym and
sports such as ’subir peso’, ’bajar peso’, ’alimenticio’, or identifying themselves as ’Ana’ but
their real name is Ana afects a wrong early prediction. We could improve this by adding
high-probability thresholds in the first messages and lowering them as the rounds continue.</p>
      <p>In eating disorders’ regression problems, in RMSE, there could be an overfitting problem
because the validation errors are better than all the baselines. Still, in the test set, those models
outperform our model, and it happens the same with the Pearson coeficient. The same as
before, the feature extractor should have limited n_grams. The errors were mainly in subjects
with low probabilities, trying to lose weight healthily.</p>
      <p>In the ranking evaluation regression, the failed predictions were similar to the previous.
Still, the messages were more associated with food or diets in a gym context, such as fat
percentage, “frutos secos”, “batidos”, “suplemento alimenticio”, but also subjects talking about
people wanting or starting going to the gym.</p>
      <p>In depression detection, the failed regression was mostly associated with people with
depression but who are trying to help others or giving advice to someone else. Also, some
low probabilities label uses a lot of “cara llorando”. We could solve this by adding context
to the linear regressor or using other ways to represent the messages but save information
from previous messages and weigh them; this way, we could prevent forgetting precious context.</p>
      <p>In multi-output regression, the linear regressor was biased to the class “’sufer+againts”,
which was the highest probability overall after control. There is significant confusion in all
classes because the feature extractor and the model cannot get the context of the messages.
There are many errors, and there is no pattern in them.</p>
      <p>To summarize, the main sources of errors are related to overfitting problems with a class
because the feature extractor and the model cannot get the context of the messages. To improve
the metrics, we could use and weight the probabilities of previous probability predictions and
use them to preserve context. The context was essential for the second task.</p>
      <p>Also, for feature extraction, we should find better parameters for max features in TFiDF to
prevent overfitting with too specific words that do not add information to the messages or
introduce noise to the model. Besides, try using other text representations of the messages.
They might behave better results with other classifiers or regressors.</p>
      <p>Another way to improve the results is cross-validation in specific rounds and choosing a
model according to these results to prevent a biased partition in favor of a class or using another
strategy, such as histograms, for a more balanced probability.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and future work</title>
      <p>Despite the growing interest in transformer-based techniques in NLP, we show that traditional
machine-learning models can outperform transformers in more straightforward tasks. Still, we
have to consider the size of the corpus and vocabulary. In a smaller corpus is more likely that a
transformer performs better than a machine learning model.</p>
      <p>Machine learning can compete with a transformer model for classification tasks in absolute
classification and early detection tasks. Still, it has to be well-calibrated for the task or at least
have stability during the training. We can perform better with a higher probability threshold in
the model output to classify the messages. In this type of task, if the training is unstable, this
means that the classification changes regularly for the same subject, and machine learning
models will not achieve a good performance.</p>
      <p>On the other hand, transformers outperform machine learning models in ranking simple
regression metrics. It usually has better metrics at the beginning because it is a pre-trained
model and can preserve the context of the messages. Still, with simpler models and a bigger
corpus for feature extraction, machine learning can achieve similar or better results as the
messages increase.</p>
      <p>In a more complex task, such as multioutput regression, transformers outperform machine
learning because a multioutput will have fewer data to train per class compared to a simple
regression and a higher class imbalance than binary tasks. This can result in an overfitted
model. So transformers have a high advantage over machine learning models because of the
pretaining in a much bigger corpus.</p>
      <p>In future work, we propose to test other feature extraction techniques such as TFiDF
with diferent parameters, bag of words, n-grams, or word2vec. Additionally, using
other models like neural networks for various problems of rounds. A more theoretical
approach for this problem would be to get a boundary or an optimal relation between
vocabulary, corpus, and features for specific diferent models and use this corpus as an experimental set.</p>
      <p>As a final thought, we want to emphasize that, although the better performance of the
transformers-based model, we can compare the emissions made in training and deployment
and question how much improvement of a metric is worth with a higher emissions model. To
make that decision, we have to consider how important the decision is and measure the impact
on the environment and people’s life regarding bias, influence, quality of life, or privacy.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work was funded by ANID Chile: Basal Funds for Center of Excellence FB210017 (CENIA),
FB210005 (CMM); Millennium Science Initiative Program ICN17_002 (IMFD) and ICN2021_004
(iHealth), Fondecyt grant 11201250, and National Doctoral Scholarships 21211659 (Claudio
Aracena) and 21221155 (Carlos Muñoz-Castro). This research was partially supported by the
supercomputing infrastructure of the NLHPC (ECM-02) and the Patagón supercomputer of
Universidad Austral de Chile (FONDEQUIP EQM180042).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B. N.</given-names>
            <surname>Rudd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Beidas</surname>
          </string-name>
          ,
          <article-title>Digital mental health: the answer to the global mental health crisis?</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Mármol-Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moreno-Muñoz</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. M.</surname>
          </string-name>
          <article-title>Plaza-del-</article-title>
          <string-name>
            <surname>Arco</surname>
            ,
            <given-names>M. D.</given-names>
          </string-name>
          <string-name>
            <surname>Molina-González</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          <string-name>
            <surname>Martín-Valdivia</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Ureña-López</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Montejo-Ráez</surname>
          </string-name>
          , Overview of MentalriskES at IberLEF 2023:
          <article-title>Early Detection of Mental Disorders Risk in Spanish</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>71</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          , G. Chaperon,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fuentes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-H.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Spanish pre-trained bert model and evaluation data</article-title>
          ,
          <source>Pml4dc at iclr</source>
          <year>2020</year>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Araujo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carvallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kundu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mendoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Mercer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bravo-Marquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.-F.</given-names>
            <surname>Moens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soto</surname>
          </string-name>
          ,
          <article-title>Evaluation benchmarks for spanish sentence representations</article-title>
          ,
          <source>arXiv preprint arXiv:2204.07571</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cañete</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Donoso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bravo-Marquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carvallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Araujo</surname>
          </string-name>
          ,
          <article-title>Albeto and distilbeto: Lightweight spanish language models</article-title>
          ,
          <source>arXiv preprint arXiv:2204.09145</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>J. D.</surname>
          </string-name>
          la Rosa y Eduardo G. Ponferrada y Manu Romero y Paulo Villegas y Pablo González de Prado Salas y María Grandury,
          <article-title>Bertin: Eficient pre-training of a spanish language model using perplexity sampling</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>68</volume>
          (
          <year>2022</year>
          )
          <fpage>13</fpage>
          -
          <lpage>23</lpage>
          . URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6403.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Honnibal</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Montani</surname>
          </string-name>
          , spaCy 2:
          <article-title>Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing (</article-title>
          <year>2017</year>
          ). URL: https: //github.com/pablodms/spacy-spanish-lemmatizer, to appear.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , E. Duchesnay,
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (
          <year>2011</year>
          )
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , arXiv preprint arXiv:
          <year>1907</year>
          .
          <volume>11692</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen, Deberta:
          <article-title>Decoding-enhanced bert with disentangled attention</article-title>
          , arXiv preprint arXiv:
          <year>2006</year>
          .
          <volume>03654</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cervantes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Garcia-Lamont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rodríguez-Mazahua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey on support vector machine classification: Applications, challenges and trends,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>