UNSL at eRisk 2022: Decision policies with history for
early classification
Juan Martín Loyola1,3 , Horacio Thompson1,2 , Sergio Burdisso1,4 and
Marcelo Errecalde1
1
  Universidad Nacional de San Luis (UNSL), Ejército de Los Andes 950, San Luis, C.P. 5700, Argentina
2
  Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
3
  Instituto de Matemática Aplicada San Luis (IMASL), CONICET-UNSL, Av. Italia 1556, San Luis, C.P. 5700, Argentina
4
  Idiap Research Institute, Rue Marconi 19, 1920 Martigny, Switzerland


                                         Abstract
                                         For the 2022 edition of the CLEF eRisk Laboratory, our research group at Universidad Nacional de San Luis
                                         (UNSL) added new approaches and improvements concerning our last participation. We proposed two
                                         decision policies for EarlyModel that take into account historic information available to the models, and
                                         incorporated two score normalization steps into the SS3 model. We also significantly reduced the runtime
                                         to process the inputs. Despite not having achieved the best performances, our team obtained the best
                                         results for the ERDE50 in tasks T1 and T2. Besides, considering the 𝐹latency , we were the third-best team
                                         for both tasks. Finally, a couple of our models got some of the best performance for the ranking-based
                                         metrics for task T1.

                                         Keywords
                                         Early Risk Detection, Early Classification, Learned Early Alert Policy


1. Introduction
The 2022 edition of the early risk prediction on the Internet laboratory (eRisk) [1] presents
two tasks for early risk detection: early detection of signs of pathological gambling and early
detection of depression. Both had been introduced in previous editions [2, 3, 4], thus the
participants have a corpus to use for training or validation. The performance was assessed with
the same metrics as last edition. That is, the standard classification measures (precision, recall,
and 𝐹1 score), measures that penalize delay in the response (ERDE [5] and 𝐹latency [6]), and
ranking-based evaluation metrics were used. The 𝐹1 and 𝐹latency scores were computed with
respect to the positive class only.
  The remaining sections describe the models and datasets used, and discuss the results obtained.
Section 2 gives a brief description of the methods and points out the main differences from
our last participation. Sections 3 and 4 describe the datasets, the models’ parameters, and their


CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
§ https:// github.com/ jmloyola/ unsl_erisk_2022
$ jmloyola@unsl.edu.ar (J. M. Loyola); hjthompson@unsl.edu.ar (H. Thompson); sburdisso@unsl.edu.ar
(S. Burdisso); merreca@unsl.edu.ar (M. Errecalde)
 0000-0002-9510-6785 (J. M. Loyola)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
results for Task 1 and Task 2, respectively. Finally, Section 5 closes with the conclusions of the
work.


2. Method
Our participation in this edition followed similar steps as the last edition [7]. The corpus
generation and data pre-processing steps did not change. The same kinds of models were
used, but we improved upon them and proposed new decision policies. Also, to compare our
performance with the models from last year, we trained and validated our models using the
generated corpus and tested them using the provided corpus. Finally, we reduced the time each
run took to process the writings.
  In what follows, the enhancements with respect to our previous participation are presented.

2.1. Early risk detection models
The models proposed for early risk detection were based on the early classification framework
presented by Loyola et al. [8]. This framework divides the task into two separated but related
problems: classifying with partial information and deciding the moment of classification. The
task of classification with partial information (CPI) consists in obtaining an effective model that
predicts the class of a document only using the available information. On the other hand, the
task of deciding the moment of classification (DMC) involves determining the point at which
we can stop reading the input with some certainties that the prediction made will be correct.
   We used the same kind of models as last year: EarlyModel [7], EARLIEST [9], and SS3 [10].
The main difference was in the DMC component. We proposed two new decision policies
for EarlyModel and two different normalization steps for the SS3 scores. The new decision
policies consider the model’s past scores and other information from the documents context.
The motivation behind the normalization of the SS3 scores was to restrict the scores to the
interval [0, 1] and avoid an infinite increasing score 1 .
   Next, we briefly describe each early risk detection model and the proposed improvements.

EarlyModel. This model tackles both tasks of early classification separately. Thus, it is
composed of two parts: a classifier with partial information and a decision policy in charge
of raising an alarm. While the input is being processed, the partial classifier categorizes it,
returning the class probability. The class probability and other information from the context
are fed to the decision policy that decides if we should stop processing the input and raise an
alarm or continue reading. The EarlyModels were built following our last edition workflow.
The best <representation, classifier with partial information> pairs were selected and integrated
with a decision policy. This year three policies were evaluated:


     1
       Note that the score of each user is additive. That is, as new posts arrive, the user’s score could potentially
increase more and more, never reaching a limit. Since SS3 has a global decision policy, the score of all users is
considered to make a decision. Even users that have already finished processing the input impact the decision of
active users.
    • SimpleStopCriterion. Decision policy used last year. To decide if an alarm for a user had
      to be emitted, three attributes were consider: the predicted class, the current delay, and
      the predicted positive class probability. If the user is predicted as positive, the probability
      of belonging to the positive class exceeds a certain threshold (𝛿), and more than 𝑛 posts
      are processed, then an alarm is raised. Different combinations of 𝛿 and 𝑛 were tested,
      selecting the best ones according to the 𝐹latency measure obtained in the validation stage.
    • HistoricStopCriterion. To avoid hasty alarms, SimpleStopCriterion was extended by
      considering some of the model’s previous positive class probabilities. This policy defines
      that if the current probability exceeds the threshold and the last 𝑚 predictions also do so,
      the system must issue a user at-risk alarm and end the analysis; otherwise, it is necessary
      to continue evaluating the user.
    • LearnedDecisionTreeStopCriterion. Decision policy based on a learned decision tree.
      This was only used in the depression task since the laboratory organizers provided two
      datasets, one of which we could use to train the model without leakage of information. In
      order to train this model, we labelled the point 𝑝 in which the system could emit an alarm.
      Fifty positive users of the depression training corpus for eRisk 2018 2 were labelled for this
      task. Half of them were used for training and the other half for validation. For each user,
      multiple samples were generated, one for each of the posts. A sample in time 𝑡 contained
      all the publications in time 𝑖 such that 𝑖 ≤ 𝑡. To label each sample, we compared the time
      𝑡 of the sample with the decision point 𝑝. If 𝑡 < 𝑝, a negative label was assigned to the
      sample. For the following ten samples, 𝑝 ≤ 𝑡 < 𝑝 + 10, the positive label was assigned.
      The features calculated for each sample were: current positive class probability given by
      the CPI, an average of the last ten positive class probabilities given by the CPI, an average
      of the last five positive class probabilities given by the CPI, median of the last ten positive
      class probabilities given by the CPI, current delay, the current label assigned by the CPI,
      an average of the last ten labels assigned by the CPI, number of words in the top 0.01
      percentile of information gain for the depression training corpus for eRisk 2018, and the
      number of words in the top 0.015 percentile of chi-squared for the depression training
      corpus for eRisk 2018. Then, a decision tree model was trained using group k-fold to
      ensure all the samples for a user were in the same group. Grid search was used to find
      the best parameters. Finally, the best model was evaluated in the testing corpus built.

EARLIEST. It is an end-to-end deep learning model that tackles both early classification
tasks at the same time. The model is composed of three parts: a base RNN that summarizes
the partial input, a controller that decides if we should continue or not processing the input,
and a discriminator that classifies the partial input once the controller halts the processing.
Reinforcement learning is used in order to train the model. This year, a couple of bugs in the
implementation were fixed and the model was forced to make decisions after seen five posts at
least.


    2
     Note that this corpus was not used for training, validation or testing of the system, thus any leakage was
avoided.
SS3. It is a two-part early classification model where SS3 [10] is used as a classifier with partial
information, and a user-global early alert policy is proposed to halt the processing. The policy
used to raise an alarm for a particular user takes into account its score value, globally, with
respect to the current score of all the other users. This year the users’ scores were normalized
using two approaches:
     • N1. First, to limit the score to the interval [0, 1], the softmax function was applied to the
       confidence values (cv) that the model gave to the positive and negative classes. Given
       the additive nature of the confidence values and the behaviour of the softmax function
       when the values increase 3 , the confidence values were divided by the current delay of
       the document. For the users that do not have more posts, the delay is equal to the total
       number of posts sent. Finally, the score for a user was calculated as:
                                                  𝑐𝑣positive 𝑐𝑣negative
                                      𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(               ,         )
                                                     delay      delay
     • N2. Second, the proportion of the positive confidence value given the sum of both positive
       and negative confidence values was used. Thus, the score for a user was computed as:
                                                   𝑐𝑣positive
                                            𝑐𝑣positive + 𝑐𝑣negative

2.2. Training workflow
Following other teams decision from previous editions of the laboratory, we augmented the
provided corpus with posts from Reddit. Though, this time, we didn’t used the provided corpus
for the initial training and validation of the models. The generated Reddit corpus was used to
select the models and its hyper-parameters. Once we selected the best models, we evaluated
them in the provided corpus. Note that these datasets were used for testing in previous editions
of the laboratory, thus we were able to compare the performance. Finally, before we started
processing the writings, we retrained the models with the obtained hyper-parameters using all
the datasets available.
   In order to speed the training process and to simulate the laboratory pipeline, a mock server
was developed 4 . This server replicate the GET/POST behaviour of the eRisk laboratory. That
is, a team can ask for new writings using the same GET request structure as the one used
during the laboratory. Similarly, the same POST request structure can be used to send the team
response for each run. Besides these, the mock server can:
     • Manage teams: create, list, and get information of a team.
     • Show a separation plot for a given team and time.
     • Plot the user score evolution for a given team and run.
     • Plot the elapsed times of teams.
     • Show the table with the results of all finished experiments.
     • Plot the server elapsed times to answer the requests.
    3
       When the values given to the function increase beyond one, the function assigns most of the probability to the
largest input. On the other hand, when the parameters tend to zero, the softmax function returns equal probability
to all the inputs.
     4
       The source code and instructions to run the mock server are available at: https://github.com/jmloyola/erisk_
mock_server.
Table 1
Details of the corpora used for Task T1: the different training and validation sets, as well as the test
set used by the eRisk organizers to evaluate the participating models. The number of users (total,
positives, and negatives) and the number of posts of each corpus are reported. The median, minimum,
and maximum number of posts per user and words per post in each corpus are detailed.
                                #users                          #posts per user    #words per post
         Corpus                                     #posts
                        Total    Pos Neg                      Med Min Max         Med Min Max
  T1_test               2,079     81 1,998        1,177,590    297      3 2,001    11      0 6,728
  T1_valid              2,348    164 2,184        1,130,799    244     10 2,001    11      1 8,241
  T1_redd_train         1,746    286 1,460          158,924     51     31 1,188    20      1 7,479
  T1_redd_valid         1,746    286 1,460          161,204     53     31 1,337    20      1 3,234


2.3. Speed up runs
One of the concerns about our participation last year was the time it took our models to process
the writings. Input/output operations to communicate with the laboratory server caused part
of this delay. Each model had to wait for the previous one to finish sending its responses to
start processing the input. Therefore, this year we processed each writing concurrently which
allowed us to reduce the total processing time. That is, some models could be already processing
the input while others could still be sending data and/or be waiting for server responses. We
used the asyncio 5 Python package to implement this behaviour.


3. Task T1: Early Detection of Signs of Pathological Gambling
In this section, the details of our participation addressing the eRisk’s early detection of patholog-
ical gambling task are given. Namely, the details of the datasets and the five models submitted
to this challenge are introduced. Finally, the results obtained after the evaluation stage are
shown.

3.1. Datasets
For Task T1, the eRisk’s organizers provided a corpus to train, validate or test the participating
models. The corpus was made available as a set of XML files, one for each user. In our case,
we used the provided corpus to compare our results with last edition models, thus we initially
did not train or validate our models using this. In order to train and validate our models a
complementary corpus was built using data from Reddit following the same steps as last year [7].
This corpus was split into a training and a validation set, each containing half of the users.
Finally, once the best hyper-parameters were found using the generated datasets, all the datasets
(including the one provided by the organizers) were used to retrain the models before deploying
them.
   Table 1 shows the details of each complementary corpus along with the validation and test
datasets provided for this task. In this table, “T1_test” refers to the test set used to evalu-

    5
        https://docs.python.org/3/library/asyncio.html
ate all participating models, “T1_valid” to the validation set provided by the organizers, and
“T1_redd_train” and “T1_redd_valid” to the training and validation sets built using Reddit.
  Note that the corpus used to evaluate the participating models this year was more unbalanced
than last year, thus, probably making this year task more difficult to address. Note that in
the provided corpus 6.9% of the users are positives, whereas in the corpus used to evaluate
participating models, only 3.8%. On the other hand, T1_redd_train and T1_redd_valid had a
much lower number of total posts and posts per user than the datasets used by the organizers.
Finally, in T1_test, there were empty posts (without words), which could be due to users who
edited their posts after posting, deleting their content.

3.2. Models
This section describes the details of the models used by our team to tackle this task. Namely,
from the results obtained after the model selection and the hyper-parameter optimization stage,
the following five models were selected for participating:

         UNSL#0. An EarlyModel with a bag of words (BoW) representation and a logistic re-
         gression classifier. Words unigrams were used for the BoW representation with term
         frequency times inverse document-frequency (commonly known as tf-idf ) as the weight-
         ing scheme. For the logistic regression, a balanced weighting for the classes was used,
         that is, each input was weighted inversely proportional to its class frequencies in the
         input data. Finally, for the decision-making policy, a SimpleStopCriterion with threshold
         𝛿 = 0.7 and minimum number of posts 𝑛 = 10 was used.

        UNSL#1. An EarlyModel with a BoW representation and a support vector machine
        (SVM) classifier. For the BoW representation, character 4-grams were used with tf-idf
        as the weighting scheme. The support vector machine was parameterized with a radial
        basis function kernel with gamma=“scale” and regularization parameter 𝐶 = 2 weighted
        inversely proportional to its class frequencies in the input data. Finally, for the decision-
        making policy, a SimpleStopCriterion with threshold 𝛿 = 0.7 and minimum number of
        posts 𝑛 = 10 was used.

        UNSL#2. An EarlyModel based on BERT with an extended vocabulary. For the fine-
        tuning process, the following parameters were used: architecture=BERT-based-uncased 6 ,
        optimizer=Adam, learning_rate=3e-5, batch_size=8, and n_epochs=3. Also, 25 new words
        were added to the BERT’s vocabulary [11] by applying the following process: an SS3
        model was trained on all available data (Reddit and eRisk2021’s datasets), then words were
        ordered according to their confidence values on the positive class, and finally, the top-25
        most important words were extracted. Finally, the decision policy HistoricStopCriterion
        was applied with a threshold 𝛿 = 0.7, a minimum number of posts 𝑛 = 10, and considering
        the last 𝑚 = 10 previous predictions.


   6
       https://huggingface.co/bert-base-uncased
         UNSL#3. An SS3 model7 with a policy value of 𝛾 = 2.5 and the normalization N1.

        UNSL#4. An EARLIEST model with a doc2vec representation. Each post was represented
        as a 300-dimensional vector. To learn this representation the Reddit training corpus,
        T1_redd_train, was used. The base recurrent neural network chosen was a one-layer
        LSTM with an input feature dimension of 300 and 256 hidden units. The discriminator of
        the EARLIEST model reduced the hidden state of the LSTM to two dimensions representing
        the probabilities of both, the positive and negative classes. Finally, the value of lambda
        used to train was 𝜆 = 1e-4.

3.3. Results
Table 2 shows the results obtained for the decision-based performance metrics. As can be seen,
our team achieved the best performance in terms of the ERDE50 measure. Besides, considering
all teams, we obtained the third-best team performance for the ERDE5 , 𝐹1 , and 𝐹latency measures.
Despite not having obtained the best results, our team achieved competitive performance in
most metrics, exceeding the average level among all teams for this edition.
   Analyzing the performance of our team, the EarlyModels (UNSL#0, #1, and #2) reached the
best score in the ERDE50 measure, but only UNSL#1 could stand out, achieving the best 𝐹1 and
𝐹latency scores. However, in ERDE5 , these models showed a lower performance. The SS3 model
(UNSL#3) proved to be competitive, achieving acceptable performance in all metrics and, in
particular, obtaining the best results for ERDE5 . Finally, the EARLIEST model (UNSL#4), did
not perform well on this task.
   On the other hand, Table 3 shows the results obtained for the performance metrics based on
rankings. Our team achieved the best performance in most metrics with respect to the different
rankings used for the evaluation. In particular, the EarlyModels (UNSL#0 and UNSL#1) were able
to stand out from other models presented by our team. The only exceptions was NDCG@100
with 1 and 500 posts, where the results were very close to the best of the competition: for 1
post, the best was 0.76 by UNED-NLP#2 and our best model obtained 0.70; and for 500 posts,
the best result was 0.95 by UNED-NLP#4 and our best model got 0.93.
   It is also interesting to note that most of the NDCG@100 values improved as more writings
were processed. In early classification problems, this is usually common since the models’
accuracy improves as the size of the input increases. However, an excessive delay in classification
can become a problem, as it could put people’s lives at risk. An early decision could be made
considering less data, but still, accuracy remains key to the final performance of the models.
This behavior is not present for the ranking measures that only consider the top 10 scores
(P@10 and NDCG@10) because their reduced sample size makes them more unstable. One
sample could bias the final measure, unlike when 100 samples are used.


   7
       SS3 models were coded in Python using the “PySS3” package [12] (https://github.com/sergioburdisso/pyss3).
Table 2
Decision-based evaluation results for Task T1. The best teams taking into account the ERDE5 , ERDE50 ,
and 𝐹latency are shown (values in bold), as well as the median and mean values of the results report for
CLEF eRisk 2022.
      Model              P        R       𝐹1      ERDE5 ERDE50 latencyTP speed 𝐹latency
      UNSL#0           0.401 0.951 0.564           0.041    0.008      11.0      0.961     0.542
      UNSL#1           0.461 0.938 0.618           0.041    0.008      11.0      0.961     0.594
      UNSL#2           0.398 0.914 0.554           0.041    0.008      12.0      0.957     0.531
      UNSL#3           0.365 0.864 0.513           0.017    0.009       3.0      0.992     0.509
      UNSL#4           0.052 0.988 0.100           0.051    0.030       5.0      0.984     0.098
      UNED-NLP#4 0.809 0.938 0.869                 0.020    0.008       3.0      0.992    0.862
      SINAI#1          0.575 0.802 0.670          0.015     0.009       1.0      1.000     0.670
      BLUE#0           0.260 0.975 0.410          0.015     0.009       1.0      1.000     0.410
      Mean             0.223 0.846 0.279           0.034    0.021       4.8      0.985     0.297
      Median           0.116 0.963 0.205           0.037    0.015       2.8      0.993     0.211

Table 3
Ranking-based evaluation results for Task T1. Results are reported according to the three classification
metrics obtained after processing 1, 100, 500, and 1000 posts, respectively.
           Ranking        Metric       UNSL#0      UNSL#1     UNSL#2     UNSL#3      UNSL#4
                          P@10          1.00        1.00       0.90       1.00        0.10
            1 post      NDCG@10         1.00        1.00       0.90       1.00        0.07
                        NDCG@100        0.68        0.70       0.66       0.69        0.32
                          P@10          1.00        1.00       1.00       0.60        0.10
           100 posts    NDCG@10         1.00        1.00       1.00       0.58        0.07
                        NDCG@100        0.90        0.90       0.77       0.72        0.32
                          P@10          1.00        1.00       0.90       0.80        0.20
           500 posts    NDCG@10         1.00        1.00       0.92       0.81        0.13
                        NDCG@100        0.93        0.92       0.78       0.77        0.33
                          P@10          1.00        1.00       0.90       0.80        0.30
          1000 posts    NDCG@10         1.00        1.00       0.90       0.81        0.22
                        NDCG@100        0.95        0.93       0.77       0.78        0.37


4. Task T2: Early Detection of Depression
In this section, the details of our participation addressing the eRisk’s early detection of depression
task are given. Namely, the details of the datasets and the five models submitted to this challenge
are introduced. Finally, the results obtained after the evaluation stage are shown.

4.1. Datasets
For Task T2, the eRisk’s organizers provided datasets to train and validate the participating
models. Each corpus was made available as a set of XML files, one for each user. Part of the
supplied training corpus was used to train the decision policy LearnedDecisionTreeStopCriterion.
The provided validation corpus was used to compare our results with the models from eRisk
Table 4
Details of the corpora used for Task T2: the different training and validation sets, as well as the test
set used by the eRisk organizers to evaluate the participating models. The number of users (total,
positives, and negatives) and the number of posts of each corpus are reported. The median, minimum,
and maximum number of posts per user and words per post in each corpus are detailed.
                                 #users                           #posts per user          #words per post
         Corpus                                     #posts
                         Total    Pos Neg                       Med Min Max               Med Min Max
  T2_test                1,400     98 1,302        898,326      457.0     6 2,000          12      0 8,009
  T2_train                 887    135   752        531,394      321.0    10 2,000          13      1 7,450
  T2_valid                 820     79   741        545,188      411.5    10 2,000          13      1 7,280
  T2_redd_train          1,056    499   557        142,059       66.0    31 2,282          21      1 6,792
  T2_redd_valid          1,057    500   557        130,534       61.0    31 2,220          20      1 6,629


2018 8 . In order to train and validate our models a complementary corpus was built using data
from Reddit following the same steps as last year [7]. This corpus was split into a training and
a validation set, each containing half of the users. Next, the best hyper-parameters were found
using the generated datasets and, finally, all the datasets (including the ones provided by the
organizers) were used to retrain the models before deploying them.
   Table 4 shows the details of each complementary corpus along with the validation and test
datasets provided for this task. In this table, “T2_test” refers to the test set used to evaluate all
participating models, “T2_train” and “T2_valid” to the training and validation sets provided by
the organizers, and “T2_redd_train” and “T2_redd_valid” to the training and validation sets
built using Reddit.
   As with Task T1, the corpus used to evaluate the participating models was more unbalanced
than in previous years. Note that in the provided training and validation sets 15.21% and 9.63%
of the users were positive, respectively, whereas in the test set only 7%. On the other hand,
the number of total posts and posts per user in T2_redd_train and T2_redd_valid was notably
smaller than in T2_test (used for evaluation). Finally, in the same way as with T1_test, in T2_test
there were empty posts with no words in them, i.e. empty posts.

4.2. Models
This section describes the details of the models used by our team to address this task. Namely,
from the results obtained after model selection and hyper-parameter optimization, the following
five models were selected for participating:

         UNSL#0. An EarlyModel with latent semantic analysis representation and a logistic
         regression classifier. Fifty factors were used to depict each user, transforming the input
         and projecting it into the space generated by the single value decomposition algorithm
         trained in T2_redd_train. For the logistic regression, a balanced weighting for the classes
         was used, that is, each input was weighted inversely proportional to its class frequencies in
         the input data. Finally, for the decision-making policy, a LearnedDecisionTreeStopCriterion
         was used. The first levels of the obtained decision tree are shown in Figure 1 and the full
    8
        The last year the task of early detection of depression was evaluated was 2018.
Figure 1: First levels of the decision tree classifier learned for task T2 in the context of the LearnedDeci-
sionTreeStopCriterion decision policy.


       decision tree is shown in Figure 2. The best hyper-parameters found after grid search
       were: class_weight=“balanced”, criterion=“entropy”, max_depth=4, min_samples_leaf=1,
       and splitter=“best”.

       UNSL#1. An EarlyModel with a BoW representation and an SVM classifier. For the
       BoW representation, character 3-grams were used with tf-idf as the weighting scheme.
       The support vector machine was parameterized with a radial basis function kernel with
       gamma=“scale” and regularization parameter 𝐶 = 8. Finally, for the decision-making
       policy, a SimpleStopCriterion with threshold 𝛿 = 0.7 and minimum number of posts
       𝑛 = 10 was used.

       UNSL#2. An SS3 model with a policy value of 𝛾 = 2.5 and the normalization N1.

       UNSL#3. An SS3 model with a policy value of 𝛾 = 2 and the normalization N2.

       UNSL#4. An EARLIEST model with the same hyper-parameters and structure as the
       one used in UNSL#4 for Task T1. In this case, to learn the doc2vec representation,
       T2_redd_train was used.


4.3. Results
The Table 5 shows the results obtained for the decision-based performance metrics. As can be
seen, our team achieved the best performance in terms of the ERDE50 measure together with
SCIR2. Besides, considering all teams, we obtained the second-best team performance for the
ERDE5 measure and we were the third-best team for the 𝐹latency .
  If we compare our models performance, we can see that UNSL#2 (SS3) outperform the rest in
almost every metric. Looking at the performance of both SS3 models (UNSL#2 and UNSL#3),
Table 5
Decision-based evaluation results for Task T2. The best teams taking into account the ERDE5 , ERDE50 ,
and 𝐹latency are shown (values in bold), as well as the median and mean values of the results report for
CLEF eRisk 2022.
  Model                      P       R       𝐹1       ERDE5 ERDE50 latencyTP speed 𝐹latency
  UNSL#0                   0.161 0.918 0.274           0.079    0.042      14.5      0.947     0.260
  UNSL#1                   0.310 0.786 0.445           0.078    0.037      12.0      0.957     0.426
  UNSL#2                   0.400 0.755 0.523           0.045    0.026       3.0      0.992     0.519
  UNSL#3                   0.144 0.929 0.249           0.055    0.035       3.0      0.992     0.247
  UNSL#4                   0.080 0.918 0.146           0.099    0.074       5.0      0.984     0.144
  NLPGroup-IISERB#0 0.682 0.745 0.712                  0.055    0.032      9.0       0.969     0.690
  LauSAn#4                 0.201 0.724 0.315          0.039     0.033      1.0      1.000      0.315
  SCIR2#3                  0.316 0.847 0.460           0.079    0.026      44.0      0.834     0.383
  Mean                     0.200 0.730 0.278           0.068    0.048      22.8      0.922     0.288
  Median                   0.149 0.847 0.249           0.070    0.041       6.0      0.981     0.256

Table 6
Ranking-based evaluation results for Task T2. Results are reported according to the three classification
metrics obtained after processing 1, 100, 500, and 1000 posts, respectively.
           Ranking        Metric       UNSL#0      UNSL#1     UNSL#2     UNSL#3      UNSL#4
                          P@10          0.60        0.80       0.70       0.10        0.10
            1 post      NDCG@10         0.40        0.88       0.68       0.06        0.12
                        NDCG@100        0.36        0.46       0.50       0.15        0.05
                          P@10          0.20        0.60       0.50       0.40        0.00
           100 posts    NDCG@10         0.13        0.73       0.39       0.27        0.00
                        NDCG@100        0.46        0.64       0.55       0.43        0.03
                          P@10          0.30        0.60       0.70       0.30        0.20
           500 posts    NDCG@10         0.28        0.73       0.73       0.21        0.19
                        NDCG@100        0.43        0.66       0.61       0.42        0.07
                          P@10          0.60        0.60       0.70       0.30        0.00
          1000 posts    NDCG@10         0.72        0.71       0.73       0.21        0.00
                        NDCG@100        0.45        0.66       0.61       0.42        0.04


we can see that the choice of normalization step and the value of 𝛾 play a critical role. However,
considering the EarlyModels, UNSL#0 and UNSL#1 achieved measures barely over the median
of all teams. Finally, the EARLIEST model (UNSL#4), did not perform well on this task either.
   On the other hand, Table 6 shows the results obtained for the performance metrics based
on rankings. Here, our team did not accomplish a performance as good as with task T1. The
EarlyModel UNSL#1 was the only one to reach the best performance among all teams when
reading one post for P@10 and NDCG@10. UNSL#2 came close to this model, surpassing it
in some metrics when reading more than 500 posts. The rest were not able to achieve good
enough results.
   In this task, we observe the same behavior with NDCG@100 as with task T1. When the
number of posts processed increased, the ranking measure improved. EARLIEST was the only
model that didn’t show this behavior, which could be related to the model underfitting the data.
5. Conclusions
In this paper we briefly presented the models proposed for early risk detection by our team
from Universidad Nacional de San Luis for tasks T1 and T2 of the eRisk 2022 laboratory. The
differences with our previous participation are described in detail and the final results are
presented. Furthermore, we show a summary of the datasets used, both the provided and the
generated ones.
   Even though we improved last year’s models, the performance obtained was not as good as
the last edition. Nonetheless, our team got the best score for ERDE50 for T1 and T2. Also, we
were the third-best team with respect to the 𝐹latency metric for both tasks. Finally, considering
the ranking-based metrics, our results for T1 were one of the best among all the teams.


References
 [1] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of eRisk 2022: Early risk
     prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality, and
     Interaction. 13th International Conference of the CLEF Association, CLEF 2022, Springer,
     2022.
 [2] D. E. Losada, F. Crestani, J. Parapar, eRisk 2017: CLEF lab on early risk prediction on the
     internet: experimental foundations, in: International Conference of the Cross-Language
     Evaluation Forum for European Languages, Springer, 2017, pp. 346–360.
 [3] D. E. Losada, F. Crestani, J. Parapar, Overview of eRisk: early risk prediction on the
     internet, in: International Conference of the Cross-Language Evaluation Forum for
     European Languages, Springer, 2018, pp. 343–361.
 [4] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of eRisk 2021: Early risk
     prediction on the internet, in: International Conference of the Cross-Language Evaluation
     Forum for European Languages, Springer, 2021, pp. 324–344.
 [5] D. E. Losada, F. Crestani, A test collection for research on depression and language use, in:
     Proc. of Conference and Labs of the Evaluation Forum (CLEF 2016), Evora, Portugal, 2016,
     pp. 28–39.
 [6] F. Sadeque, D. Xu, S. Bethard, Measuring the latency of depression detection in social
     media, in: Proceedings of the Eleventh ACM International Conference on Web Search and
     Data Mining, 2018, pp. 495–503.
 [7] J. M. Loyola, S. Burdisso, H. Thompson, L. Cagnina, M. Errecalde, UNSL at eRisk 2021:
     A comparison of three early alert policies for early risk detection, in: Working Notes of
     CLEF 2021-Conference and Labs of the Evaluation Forum, Bucarest, Romania, 2021.
 [8] J. M. Loyola, M. L. Errecalde, H. J. Escalante, M. M. y Gomez, Learning when to classify
     for early text classification, in: Argentine Congress of Computer Science, Springer, 2017,
     pp. 24–34.
 [9] T. Hartvigsen, C. Sen, X. Kong, E. Rundensteiner, Adaptive-halting policy network for
     early classification, in: Proceedings of the 25th ACM SIGKDD International Conference
     on Knowledge Discovery & Data Mining, 2019, pp. 101–110.
[10] S. G. Burdisso, M. Errecalde, M. Montes-y Gómez, 𝜏 -SS3: A text classifier with dynamic
     n-grams for early risk detection over text streams, Pattern Recognition Letters 138 (2020)
     130 – 137. doi:https://doi.org/10.1016/j.patrec.2020.07.001.
[11] W. Tai, H. Kung, X. L. Dong, M. Comiter, C.-F. Kuo, exBERT: Extending pre-trained models
     with domain-specific vocabulary under constrained training resources, in: Findings of the
     Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1433–1439.
[12] S. G. Burdisso, M. Errecalde, M. Montes-y Gómez, PySS3: A python package imple-
     menting a novel text classifier with visualization tools for explainable ai, arXiv preprint
     arXiv:1912.09322 (2019).
Figure 2: Decision tree classifier learned for task T2 in the context of the LearnedDecisionTreeStopCrite-
rion decision policy.