NLP-UNED at eRisk 2024: Approximate Nearest
                         Neighbors with Encoding Refinement for Early Detecting
                         Signs of Anorexia
                         Notebook for the eRisk Lab at CLEF 2024

                         Hermenegildo Fabregat1,3 , Daniel Deniz3 , Andres Duque1,2,* , Lourdes Araujo1,2 and
                         Juan Martinez-Romo1,2
                         1
                           NLP & IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED), Juan del
                         Rosal 16, Madrid 28040, Spain
                         2
                           IMIENS: Instituto Mixto de Investigación, Escuela Nacional de Sanidad, Monforte de Lemos 5, Madrid 28019, Spain
                         3
                           Avature Machine Learning, Marqués de Valdeiglesias, 3, Madrid 28004, Spain


                                      Abstract
                                      This paper describes our participation in Task 2 (Early Detection of Signs of Anorexia) from the CLEF 2024 eRisk
                                      Workshop, addressed to detecting early signs of anorexia in Social Media users through the analysis of their posts.
                                      A relabelling step based on Approximate Nearest Neighbors (ANN) is performed for generating a training dataset
                                      annotated at message level instead of user level, and then contrastive learning techniques are applied for refining
                                      the previously generated vector representations of the messages. ANNs are used also for classification purposes,
                                      combined with the use of rules and heuristics focused on expanding the number of considered messages from
                                      the user for making the final decision. Our system obtains the best results in both the decision-based evaluation,
                                      with 9 percentage points over the second best system in terms of latency-weighted F1, and in the ranking-based
                                      evaluation, with the best scores for 11 out of the 12 metrics employed.

                                      Keywords
                                      Early risk detection, Anorexia, Approximate Nearest Neighbors, Contrastive Learning,


                         1. Introduction
                         In recent years, the analysis of social media for early detection of health risks has become an intriguing
                         and significant area of research. Within this research field, the eRisk workshop, part of the Confer-
                         ence and Labs of the Evaluation Forum (CLEF) since 2017, has played a pivotal role. This workshop
                         fosters collaborative efforts to develop innovative methodologies and practical solutions for the early
                         identification of various health concerns, including eating disorders, self-harm, pathological gambling
                         and depression, through the analysis of textual content on social media platforms. By analyzing social
                         media posts and messages, researchers can obtain valuable insights to identify individuals at risk.
                            This paper details our approach to tackling Task 2 of the eRisk 2024 Workshop [1, 2]: Early Detection
                         of Signs of Anorexia. In this task, systems must sequentially process messages posted by different
                         users in Reddit forums, searching for early traces of anorexia, this is, detecting as soon as possible
                         whether a user is at risk of suffering from anorexia. The task is a continuation of Task 2 of the eRisk
                         2018 Workshop [3] and Task 1 of the eRisk 2019 Workshop [4].
                            Building upon our previous work in the detection of pathological gambling [5, 6, 7], we have refined
                         our system by incorporating contrastive learning techniques for fine-tuning the encoded representations
                         of text messages written by the analyzed users. Additional heuristics have been also included in the
                         system in order to expand the context of the user’s messages, this way taking into account a larger

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                             gildo.fabregat@lsi.uned.es (H. Fabregat); daniel.deniz@avature.es (D. Deniz); aduque@lsi.uned.es (A. Duque);
                          lurdes@lsi.uned.es (L. Araujo); juaner@lsi.uned.es (J. Martinez-Romo)
                           0000-0001-9820-2150 (H. Fabregat); 0000-0002-0313-2127 (D. Deniz); 0000-0002-0619-8615 (A. Duque);
                          0000-0002-7657-4794 (L. Araujo); 0000-0002-6905-7051 (J. Martinez-Romo)
                                   © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
number of previous messages when making the final decision on whether the user is at risk. These
improvements have proven to enhance the system’s accuracy and reliability in detecting potential cases
of anorexia from social media content.
   The rest of the paper is structured as follows: Section 2 gathers information about previous research
works related to early detection of risks, as well as systems participating in previous eRisk competitions.
A brief description of the addressed task, and the dataset and evaluation metrics involved is presented
in Section 3. The different components of the proposed system are described in Section 4, and the
results obtained by this system are shown and analyzed in Section 5. Finally, Section 6 depicts some
conclusions about the work, together with possible future lines of work regarding this research.


2. Related Work
The automatic detection of mental health issues is currently a hot research topic within machine
learning, specifically regarding natural language processing. The availability of information sources
with large amounts of data, such as social media, is enabling the development of new systems aimed
at the early detection of these types of issues. Within this context, different evaluation frameworks
and campaigns such as CLEF’s eRisk [8], CLPSych [9] or IberLEF’s MentalRiskES [10, 11] represent a
significant effort by the scientific community to support the development and dissemination of these
types of systems.
   Anorexia nervosa (AN) is a severe eating disorder characterized by an inability to maintain a healthy
body weight, often falling below 85% of the ideal weight. Individuals with AN obsess over weight gain,
perceive their bodies as larger than they are, and engage in behaviors to sustain weight loss. This illness
profoundly affects both mind and body, with sufferers placing significant importance on their shape
and weight, intertwining their self-esteem with their body image [12]. The 2018 and 2019 CLEF eRisk
competitions addressed the automatic detection of signs of anorexia in Social Media posts, encouraging
the participating systems to develop techniques for determining whether a user can be classified as at
risk of suffering from this illness. Although the stage of development of neural models was nowhere
near the current level when the last edition of this task was held (2019), some of the best participating
systems at that time used such models for their predictions. An ensemble approach with different neural
attention-based models is used in [13] for feature extraction, and then combined with Support Vector
Machines to determine the final decision. Deep learning models are also used in [14] for developing
a time series dataset representing the evolution of the user’s mood through time. Then, Bayesian
inference is employed for performing the final classification. Other approaches obtained good results
in the competition by using more classic machine learning methods such as statistical word-based
techniques [15], or Support Vector Machines with customized feature sets based on emotions derived
from the text [16] or content-based features from phrases with personal pronouns [17]. In general, and
also based on the results obtained by our own participations in early risk detection tasks, systems not
relying on deep learning techniques or large language models are also able to achieve good results [7].
   Contrastive learning techniques can be defined as methods aimed to learn and refine effective
representations of data by pulling semantically close neighbors together and pushing dissimilar ones
apart [18]. One of the most important characteristics of contrastive learning is that the model learns by
comparison, this is, it is not necessary for the instances whose representations are to be refined to be
accompanied by their corresponding labels. Instead, these approaches only need to define the similarity
distribution. This way, the model should learn to map together similar instances, while separating
dissimilar instances in the embedding space [19]. These techniques have been successfully applied to
computer vision problems [20] and natural language processing tasks [21], as well as to other domains
such as audio or reinforcement learning [22]. Considering our system presented in previous eRisk
competitions, based on approximate nearest neighbors with vector representations of text messages,
exploring these techniques seems like a logical step for its improvement.
3. Task 2: Early Detection of Signs of Anorexia
As previously mentioned, we have participated in task 2 of the eRisk 2024 competition, denoted “Early
Detection of Signs of Anorexia". In this task, participants have access to a training dataset containing
the whole history of writings (Reddit posts) for a set of users. These users are annotated depending on
whether they have explicitly mentioned to have been diagnosed with anorexia (positive users) or not
(negative or control users). In the test stage, systems are asked to determine, as soon as possible, whether
a new user is at risk of suffering from anorexia according to the user’s writing history. In particular, for
each new message of a user, systems must determine whether the user is positive or negative. Once
a user is labelled as positive, the decision is considered to be final, and hence all subsequent labels
assigned to this user are ignored. Systems must also assign, after each message, a score measuring the
user’s risk of suffering from anorexia. This score is considered for evaluation purposes even after a user
has been labelled as positive.
   The statistics of the test dataset used for evaluating systems participating in this task are shown in
Table 1:

Table 1
Main statistics of test collection for task 2: Early detection of signs of anorexia.
                                                                           Anorexia    Control
                  Num. subjects                                               92         692
                  Num. submissions (posts & comments)                       28,043     338,843
                  Avg num. of submissions per subject                        304.8      489.6
                  Avg num. of days from first to last submission            ≈ 482       ≈ 971
                  Avg num. words per submission                              28.5        21.4

  System evaluation is conducted using two different paradigms: decision-based evaluation and ranking-
based evaluation. Complete information about the employed metrics can be found in [23].

    • Decision-based evaluation: This type of evaluation only attends to the label assigned by the
      system to each user (positive or negative), as well as the delay in determining that a positive user
      is indeed at risk of suffering from anorexia. For this aim, standard metrics used for classification
      such as precision, recall and F-Measure are combined with metrics that take into account this
      delay information. The early risk detection error metric ERDE [24] is also used, although their
      values have low interpretability. To overcome this, other metrics regarding the latency and speed
      on detecting true positives are also proposed, and a final latency-weighted F1 measure is computed
      by weighting the F-Measure with these delay-related metrics.
    • Ranking-based evaluation: The score assigned to each user by the system, after analyzing each
      received message, is used in this evaluation for computing ranking-based metrics. This is, users
      are ranked after 𝐾 messages according to this score, and then standard ranking metrics such as
      𝑃 @𝐾 and 𝑁 𝐷𝐶𝐺@𝐾 are applied for measuring the performance of the systems.

 Finally, the lapse of time employed by the system for processing the whole test dataset is also
measured and reported, in order to illustrate the efficiency of the proposed systems.


4. Proposed System
The system developed for performing early detection of signs of anorexia is presented in this Section.
In particular, the different components that constitute the complete system pipeline are enumerated
and described in detail. The main differences with the original research, based on dataset relabelling
and approximate nearest neighbors techniques, presented in [5], are the use of a contrastive learning
technique for fine-tuning the embedding representations of the user’s messages (Section 4.3), as well
as the development of a set of heuristics for considering previous messages for the final classification,
instead of only taking into account the last message received (Section 4.4).
4.1. Data representation
The encoder used in this work for obtaining embeddings representing each of the messages of a
particular user is the Universal Sentence Encoder [25]. Through its use, all messages in the training
dataset are transformed into 512-dimensional embeddings. The specific model used in the encoding is
based on a Deep Average Network (DAN) [26], trained on different sources of data written in English,
and normally used for generating vector representations of texts longer than words, i.e., sentences,
phrases or short paragraphs.

4.2. Relabelling process
The relabelling process has been described in previous works [5, 7]. Its main objective is to generate
a training dataset labelled at message-level, starting from the user-level annotation provided by the
organizers. The intuition behind this decision, already tested in previous eRisk competitions devoted to
detecting pathological gambling, is that message-level annotations can help the system to emit accurate
alerts about the risk of a user of suffering from anorexia by analyzing the user’s individual messages.
   In this stage a technique for generating indexes based on approximate nearest neighbors (ANN) is
applied, this way creating a data structure that allows us to obtain the 𝑁 most similar messages to
a specific one. Two different ANN approaches have been explored in this work: first, Annoy [27] is
a partitioning method based on the use of hyperplanes that recursively divide the search space with
random direction. The generated index has the shape of a binary tree, and through its use the most
similar elements to a query can be easily retrieved. On the other hand, the Hierarchical Navigable
Small World (HNSW) method, implemented by the Non-Metric Space Library (NMSLIB) [28] is a
graph-based ANN technique. In this case, the search index has the form of a proximity graph in which
nodes correspond to particular instances (in our case, messages), and edges define the neighborhood
relationship. The main idea behind the use of this technique is that a neighbor’s neighbor is likely to
also be a neighbor of a particular instance. Nearest neighbor retrieval is then performed by using a
best-first search strategy on the graph.
   Once that the selected index has been built on all the messages composing the training dataset, we
are able to retrieve all the desired nearest neighbors given a particular message. In the first iteration of
the relabelling process, all messages are labelled as belonging to the same class (positive or negative)
as the user that created them. Then, for each positive message 𝑀 in the training dataset, a set of its
𝐾 nearest neighbors is retrieved from the index. The message will be relabelled as negative only if at
least 𝐽 of those 𝐾 nearest messages belong to the negative class. In our implementation, only positive
messages can be relabelled as negative. This is due to the fact that only positive users can have negative
messages, because if negative users had any positive message they would have been labelled as positive.
Only messages containing title information, this is, messages representing the opening of a Reddit
thread, are taken into account for generating our training dataset. This filtering allows us to focus
on discussions originally initiated by the analyzed user, which are more likely to contain information
about particular worries or calls for help from the user. Moreover, this also reduces the computational
complexity of the system, while the final results do not significantly differ from those obtained by using
the complete set of messages. The relabelling step is iteratively repeated until convergence is reached,
this is, no new relabellings are done during an iteration. A random sample of 33% of the users in the
original training dataset is employed for validation purposes, allowing us to explore the optimal values
of the 𝐾 and 𝐽 parameters. Through this validation step, these values have been set to 𝐾 = 10 and
𝐽 = 6.

4.3. Contrastive Learning
After completing the relabelling process, we propose an additional technique in the encoding step
of our system based on fine-tuning the generated embeddings representing the different messages.
This fine-tuning relies on a contrastive learning technique [29], a method employed for maximizing
the distance between embeddings of messages belonging to different classes and minimizing it when
the messages belong to the same class. In particular, in our system this is achieved by retraining the
Universal Sentence Encoder used for generating the initial representations of the messages. However,
during this retraining, we employ a particular type of loss function, known as triplet loss [30]. For each
message in the training dataset, either labelled as positive or negative, a triplet (𝑎, 𝑝, 𝑛) is created, being
𝑎 the original message, 𝑝 a message belonging to the same class, and 𝑛 a message belonging to the
opposite class. The triplet loss function used in our retraining is ℒ = 𝑚𝑎𝑥(𝑑(𝑎, 𝑝) − 𝑑(𝑎, 𝑛) + 𝛼, 0),
where 𝑑 is a function measuring the distance between the generated embeddings. The distance function
employed for this work is cosine distance. This implies that the main aim of the training process will be
to minimize the distance between messages belonging to the same class and maximize the distance
between messages belonging to different classes. An additional parameter 𝛼 is included into the loss
function in order to determine the minimum desired distance between positive and negative instances,
considering 𝑎 as reference instance.
   The main idea behind the contrastive learning process is illustrated in Figure 1.


Figure 1: Contrastive learning with triplet loss: training is oriented to maximizing the distance between same
class (𝑝) and opposite class (𝑛) instances with respect to a given anchor instance (𝑎).


  The different hyperparameters employed in the contrastive learning process are the following:
    • Number of instances: 20 triplets (𝑎, 𝑝, 𝑛) are generated for each message 𝑎, by randomly
      selecting positive (same class) and negative (opposite class) instances.
    • Batch size: Batch size value is 32.
    • Learning rate: The learning rate is set to 1𝑒−5 .
    • Epochs: The number of epochs is 4.
    • Steps per epoch: The number of steps per epoch is 128.
    • Margin: The triplet loss margin (𝛼) is set to 0.15 (normalized values are used for distances and
      margin).
   With this configuration, a maximum of 128*32 (steps per epoch times batch size) instances are fed
to the the network in each epoch. This implies that 128*32*4 = 16,384 instances are used for training.
Hence, given the size of the training dataset, only a fraction of the generated triplets are effectively
used for training. Also, not all instances are seen the same number of times.

4.4. Final classification
Once that the representation of the text messages is refined using contrastive learning techniques, the
final classification step is somehow similar to the relabelling process described in Section 4.2. However,
some additional heuristics have been added to this stage in order to consider more than one individual
message for determining whether a user is at risk of suffering from anorexia.
   Two new 𝐾 and 𝐽 parameters are calculated in this step for performing the final classification. Each
time a new message 𝑀 is received, the 𝐾 nearest neighbors are retrieved. If at least 𝐽 of those 𝐾
neighbors are positive, the message, and hence the user, is directly classified as positive. Through the
use of the validation split aforementioned, the values of these parameters have been set to 𝐾 = 19 and
𝐽 = 19 for the classification step.
    As previously mentioned, we are also interested in analyzing whether the history of previous messages
from the user can be useful for performing a more accurate classification. With this purpose, we have
explored in more depth how assigning risk scores to the user after analyzing each message can affect
the final classification. Besides the classification of the user as positive or negative, and regarding the
ranking-based evaluation, a score is expected to be assigned to the user after receiving each message,
representing the user’s risk of suffering from anorexia. In our system, this score is computed by
calculating the average
                    ∑︀ distance between a received message 𝑀 and all its nearest neighbors labelled as
positive, 𝑣𝑎𝑙 = 𝑘1 𝑘𝑥=1 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒(𝑈𝑥 , 𝑀 ), where 𝑈𝑥 is a message within the set of 𝐾 nearest neighbors
that is labelled as positive. The distance function employed returns values between 0 and 2, and hence
the scoring assigned to the user is 𝑠𝑐𝑜𝑟𝑒 = (2 − 𝑣𝑎𝑙). This way, a message really close to its positive
neighbors would receive a distance value of 𝑣𝑎𝑙 ≈ 0 and hence its score would be 𝑠𝑐𝑜𝑟𝑒 ≈ 2. This
score is calculated for test messages classified as positive, but also for those classified as negative, and a
buffer containing the scores of the 𝑁 previous messages from the user is stored. The buffer is originally
filled with zeros. Hence, if the system initially classifies a message as negative, the average score value
for the last 𝑁 messages is calculated, and the message (and user) will be classified as positive if this
average is over a particular threshold 𝑆. The optimal values of 𝑁 and 𝑆 (this is, the message window
considered and the score threshold) are also determined using the validation split and vary depending
on the submitted run (see Section 5.1).


5. Results and Discussion
Main results achieved by the proposed system are presented in this Section. Experiments using the
validation split are first depicted in order to justify the configurations selected for the submitted runs.
Only decision-based evaluation, and more particularly, latency-weighted F1 values, were taken into
account for tuning the hyperparameters through the validation split. Then, results obtained on the test
dataset by the 5 different configurations selected are shown.

5.1. Validation and Selected Runs
As previously mentioned, a random split of 33% of the users in the training dataset is employed for
validation purposes. Through these experiments we have confirmed that the use of the contrastive
learning technique is able to improve all previous results obtained when using the Universal Sentence
Encoder with no modifications for generating the embeddings. In particular, the latency-weighted F1
value of the best performing configuration that uses the original encoder is around 6% lower than the
best performing system in our validation process. For this reason, we decided to use the contrastive
learning encoder in all the submitted runs. In general, applying the relabelling method also improves
the results with respect to not using it (this is, labelling all messages from a positive user as positive and
all messages from a negative user as negative). However, we included a run that does not perform any
relabelling in the test configurations, in order to compare results. The remaining parameters (values 𝐾
and 𝐽 in either relabelling or classification, and values 𝑁 and 𝑆) have been adjusted by selecting the
best performing configurations in the validation phase. As already stated, values of 𝐾 = 10 and 𝐽 = 6
during relabelling and 𝐾 = 19 and 𝐽 = 19 during classification showed the best results in this stage.
   Table 2 shows the configurations of the proposed system, for each of the five runs allowed to be
submitted in the task.
   Column “ANN system” indicates the technique employed for building the nearest neighbor index:
Annoy or NMSLIB. The type of encoder employed is always the one that refines the Universal Sentence
Encoding with contrastive learning (CL_USE). Column “Relabel” indicates whether the relabelling step
has been followed or not, while column “Heuristics” shows values for parameters 𝑁 (window size)
and 𝑆 (decision threshold) in case the rules described in Section 4.4 have been employed, and “None”
otherwise. It can be noticed how the best value for parameter 𝑆 is always set to 1.0, this is, half the
maximum scoring value that the average score for the 𝑁 last messages can reach. Finally, we can
Table 2
Validation results: Configurations selected for the test phase.
 Run     ANN system        Encoder      Relabel      Heuristics         Latency-weighted F1 (validation)
  R0       Annoy           CL_USE        YES        𝑁 = 7, 𝑆 = 1.0                  0.6967
  R1       Annoy           CL_USE        YES            None                        0.6862
  R2       Annoy           CL_USE        YES        𝑁 = 5, 𝑆 = 1.0                  0.6863
  R3       Annoy           CL_USE         NO            None                        0.6506
  R4      NMSLIB           CL_USE        YES        𝑁 = 7, 𝑆 = 1.0                  0.6915


observe how the latency-weighted F1 metric is quite similar in this validation for all the proposed
configurations, except for R3, which does not include the relabelling step.

5.2. Test results
The following tables illustrate the main results achieved by our system regarding the two types of
evaluations considered, as well as the comparison with the other teams participating in the task. In
particular, Table 3 shows results according to the decision-based evaluation.

Table 3
Test results: Results of the decision-based evaluation for task T2. Bold indicates the best result for each consid-
ered metric.
 Team              Run     P      R      F1    ERDE5     ERDE50     Latency TP    Speed    Latency-weighted F1
 NLP-UNED           0     0.64   0.97   0.77    0.09       0.04        13.00       0.95          0.73 (1)
 NLP-UNED           1     0.67   0.97   0.79    0.09       0.04        14.00       0.95          0.75 (1)
 NLP-UNED           2     0.63   0.97   0.76    0.09       0.04        12.00       0.96          0.73 (1)
 NLP-UNED           3     0.63   0.98   0.77    0.09       0.03        11.00       0.96          0.74 (1)
 NLP-UNED           4     0.63   0.97   0.76    0.09       0.04        14.00       0.95          0.72 (1)
 BioNLP-IISERB      4     0.73   0.62   0.67    0.08       0.05         4.00       0.99          0.66 (2)
 Riewe-Perla        0     0.45   0.97   0.62    0.07       0.02         6.00       0.98          0.60 (3)
 ELiRF-UPV          0     0.43   0.99   0.60    0.10       0.04        12.00       0.96          0.57 (4)
 UNSL               2     0.42   0.97   0.59    0.14       0.03        12.00       0.96          0.56 (5)
 SINAI              0     0.21   0.92   0.34    0.10       0.07         3.00       0.99          0.34 (6)
 APB-UC3M           0     0.17   0.99   0.28    0.15       0.08         9.00       0.97          0.28 (7)
 UMUTeam            1     0.15   0.99   0.26    0.19       0.09        27.00       0.90          0.24 (8)
 GVIS               1     0.12   1.00   0.22    0.12       0.10        1.00        1.00          0.22 (9)
 COS-470-Team-2     0     0.00   0.00   0.00    0.12       0.12                                    (10)

   As the table shows, all the configurations proposed for our system are able to overcome all participat-
ing systems in terms of latency-weighted F1. In particular, our best performing run, R1, is 9% ahead of
the second best performing team. Although some other teams obtain slightly better results regarding
precision and recall, the F1 and latency-weighted F1 values show that our proposal is the most robust
across the considered metrics. Our system also obtains good results for some of the early risk detection
metrics. In particular, it achieves the third best ERDE5 and second best ERDE50 values, although the
latency and speed values are somewhat worse. It is particularly noticeable how all the proposed runs
are able to obtain good results. This probably indicates that the main improvement proposed, which is
the use of a contrastive learning technique for refining the embeddings representing text messages,
has a powerful impact on the performance of our system. On the other hand, the use of heuristics
for increasing the amount of information considered before classifying a message, does not seem to
have that much impact on the final results. However, in the validation stage we have stated that when
contrastive learning is not performed on the original embeddings, the use of these heuristics does
positively influence the results. Therefore, future efforts should be focused on improving these rules.
   Table 4 shows the main results on the ranking-based evaluation.
   Once again, our system ranks first in this type of evaluation, for almost all the considered metrics,
and for any of the proposed configurations. In particular, we are able to achieve perfect scores for 𝑃 @10
and 𝑁 𝐷𝐶𝐺@10 after receiving 1, 100, 500 and 1000 messages, and the best results for 𝑁 𝐷𝐶𝐺@100
Table 4
Test results: Results of the ranking-based evaluation for task T2. Bold indicates the best result for each consid-
ered metric.
                                 1 writing          100 writings          500 writings         1000 writings


                                                NDCG@100


                                                                               NDCG@100


                                                                                                              NDCG@100


                                                                                                                                             NDCG@100
                                     NDCG@10


                                                                    NDCG@10


                                                                                                   NDCG@10


                                                                                                                                  NDCG@10
                            P@10


                                                            P@10


                                                                                           P@10


                                                                                                                          P@10
 NLP-UNED            R0    1.00     1.00       0.44        1.00    1.00       0.89        1.00    1.00       0.91        1.00    1.00       0.91
 NLP-UNED            R1    1.00     1.00       0.44        1.00    1.00       0.89        1.00    1.00       0.92        1.00    1.00       0.92
 NLP-UNED            R2    1.00     1.00       0.44        1.00    1.00       0.89        1.00    1.00       0.91        1.00    1.00       0.91
 NLP-UNED            R3    1.00     1.00       0.45        1.00    1.00       0.91        1.00    1.00       0.91        1.00    1.00       0.91
 NLP-UNED            R4    1.00     1.00       0.44        1.00    1.00       0.89        1.00    1.00       0.91        1.00    1.00       0.91
 UNSL                R1    1.00     1.00       0.69        1.00    1.00       0.80        0.90    0.81       0.69        0.80    0.88       0.72
 Riewe-Perla         R0    0.50     0.47       0.17        0.70    0.62       0.74        0.70    0.62       0.74        0.70    0.62       0.75
 GVIS                R1    0.40     0.37       0.40        0.30    0.32       0.42        0.00    0.00       0.00        0.00    0.00       0.00
 ELiRF-UPV           R0    0.20     0.12       0.14        0.20    0.13       0.14        0.20    0.13       0.14        0.20    0.13       0.14
 UMUTeam             R1    0.20     0.12       0.14        0.10    0.06       0.03        0.00    0.00       0.05        0.20    0.21       0.12
 BioNLP-IISERB       R4    0.20     0.21       0.10        0.00    0.00       0.00        0.00    0.00       0.00        0.00    0.00       0.00
 ABP-UC3M            R0    0.00     0.00       0.03        0.40    0.56       0.26        0.00    0.00       0.09        0.00    0.00       0.13
 SINAI               R3    0.00     0.00       0.07        0.10    0.07       0.06        0.00    0.00       0.07        0.00    0.00       0.07
 COS-470-Team-2      R0    0.00     0.00       0.00        0.00    0.00       0.00        0.00    0.00       0.00        0.00    0.00       0.00


after receiving 100, 500 and 1000. Only the UNSL team is able to beat our system for the 𝑁 𝐷𝐶𝐺@100
after seeing only the first message of each user. Together with our latency and speed values in the
decision-based evaluation, this fact indicates that our system could be improved in terms of speed in
finding true positives, this is, determining that a user is at risk of suffering from anorexia.
   Finally, Table 5 shows some information regarding the number of runs submitted by the participating
teams, the number of total writings processed by each team, and the total time employed in processing
the messages.

Table 5
Participating teams, number of runs, number of user writings processed by the team, and lapse of time taken
for the entire process.
  Team                 #Runs       #User Writings Processed                     Lapse of Time (from 1st to last response)
  BioNLP-IISERB          5                    10                                                  09:39
  GVIS                   5                    352                                             3 days 12:36
  Riewe-Perla            5                   2001                                             2 days 11:25
  UNSL                   3                   2001                                                 07:00
  UMUTeam                5                   2001                                                 06:34
  COS-470-Team-2         5                     1                                                    -
  ELiRF-UPV              4                   2001                                                 12:27
  NLP-UNED               5                   2001                                                 09:40
  SINAI                  5                   2001                                             3 days 23:49
  APB-UC3M               2                   2001                                             6 days 21:34

  Compared to the other participating systems that processed the complete set of user writings, our
system is the third best performing regarding execution times, the time interval being in the order of
hours, in a similar manner to the best performing teams.


6. Conclusions and Future Work
This paper presents our participation in Task 2 of the CLEF eRisk 2024 competition: Early Detection of
Signs of Anorexia. The developed system is a new version of the system designed for previous editions
of the competition, in which a relabelling method based on the use of approximate nearest neighbors
(ANN) is applied on the training dataset, and the same ANN techniques are then used for classifying
new messages and determining whether a user is at risk of suffering from a mental problem, in this
case anorexia. The new improvements incorporated to the system is the use of contrastive learning
techniques for fine-tuning the embeddings of the text messages, initially generated through a Universal
Sentence Encoder, and the increasing of the amount of information employed for classification by
including a set of rules or heuristics that consider a message window of 𝑁 previous messages. The
developed system is able to obtain the best results among the participating systems in terms of F-Measure
and latency-weighted F1 (decision-based evaluation), as well as in terms of ranking-based evaluation
metrics. In particular, all the tested configurations of the system overcome the second best participating
team by around 9% of latency-weighted F1. In general, the main results indicate that the refinement
of the vector representations obtained through contrastive learning techniques has been crucial for a
better discrimination between positive and negative messages, thus leading the system to effectively
determine when a message may indicate that the user is at risk of suffering from anorexia. On the other
hand, expanding the message window considered for performing the final classification has not shown
significant impact on the test results, although during the validation stage those configurations using
these heuristics were able to obtain better overall results with respect to configurations only using one
message for making a decision.
    As mentioned in Section 5.1, future lines of work should focus on improving the rules designed for
considering the history of messages before classifying a user. A trade-off must be found between the
latency (this is, number of messages analyzed before emitting an alert) and the amount of information
that should be gathered before making a decision. Also, the treatment of these previous messages can
be improved: for instance, the current rules underestimate the weight of similar positive messages
when few messages have been received, since the buffer of previous scores is initialized with zeros. This
implies that even if a message is quite similar to positive messages its score is going to decrease when
it is one of the first analyzed messages for a user. The current decision of selecting only the nearest
positive messages for calculating the score can also be detrimental for the final results. More research
should be done on the type of functions that better model the similarity of a given message with both
positive and negative nearest neighbors, and its influence on the classification decision.
    An additional future line of research involves further refinement of the embeddings used for rep-
resenting users’ messages. In particular, the hyperparameters used in the contrastive learning phase,
described in Section 4.3 can be studied in greater depth through validation techniques, in order to search
for optimal values. Additionally, different encoding models beyond the Universal Sentence Encoder
could be also considered, exploring issues such as multilingualism or models that have already used
contrastive learning techniques in their original training, like E5 [31].


Acknowledgments
This work has been partially supported by the Spanish Ministry of Science and Innovation within the
DOTT-HEALTH Project (MCI/AEI/FEDER, UE) under Grant PID2019-106942RB-C32, OBSER-MENH
Project (MCIN/AEI/10.13039 and NextGenerationEU”/PRTR) under Grant TED2021-130398B-C21 and
EDHER-MED Project under grant PID2022-136522OB-C21, as well as by the Universidad Nacional de
Educación a Distancia (UNED) within project SICAMESP (2023-VICE-0029).


References
 [1] J. Parapar, P. Martín Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2024: Early risk prediction
     on the internet., Experimental IR Meets Multilinguality, Multimodality, and Interaction. 15th
     International Conference of the CLEF Association, CLEF 2024. Springer International Grenoble,
     France. (2024).
 [2] J. Parapar, P. Martín Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2024: Early risk prediction
     on the internet (extended overview)., Working Notes of the Conference and Labs of the Evaluation
     Forum CLEF 2024, Grenoble, France, September 9th to 12th, 2024, CLEF 2024. CEUR Workshop
     Proceedings (2024).
 [3] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk: Early risk prediction on the internet
     (extended lab overview), in: L. Cappellato, N. Ferro, J. Nie, L. Soulier (Eds.), Working Notes of
     CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14,
     2018, volume 2125 of CEUR Workshop Proceedings, CEUR-WS.org, 2018. URL: https://ceur-ws.org/
     Vol-2125/invited_paper_1.pdf.
 [4] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk at CLEF 2019: Early risk prediction on
     the internet (extended overview), in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.),
     Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland,
     September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. URL:
     https://ceur-ws.org/Vol-2380/paper_248.pdf.
 [5] H. Fabregat, A. Duque, L. Araujo, J. Martínez-Romo, UNED-NLP at erisk 2022: Analyzing gambling
     disorders in social media using approximate nearest neighbors, in: G. Faggioli, N. Ferro, A. Hanbury,
     M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the
     Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2022, pp. 894–904. URL: https://ceur-ws.org/Vol-3180/paper-71.pdf.
 [6] H. Fabregat, A. Duque, L. Araujo, J. Martínez-Romo, NLP-UNED-2 at erisk 2023: Detect-
     ing pathological gambling in social media through dataset relabeling and neural networks,
     in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the Confer-
     ence and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to
     21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 672–683. URL:
     https://ceur-ws.org/Vol-3497/paper-056.pdf.
 [7] H. Fabregat, A. Duque, L. Araujo, J. Martínez-Romo, A re-labeling approach based on approximate
     nearest neighbors for identifying gambling disorders in social media, in: A. Arampatzis, E. Kanoulas,
     T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro
     (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 14th International
     Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18-21, 2023,
     Proceedings, volume 14163 of Lecture Notes in Computer Science, Springer, 2023, pp. 174–185. URL:
     https://doi.org/10.1007/978-3-031-42448-9_15. doi:10.1007/978-3-031-42448-9\_15.
 [8] J. Parapar, P. Martín Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2023: Early risk prediction
     on the internet., Experimental IR Meets Multilinguality, Multimodality, and Interaction. 14th
     International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece (2023).
 [9] J. Chim, A. Tsakalidis, D. Gkoumas, D. Atzil-Slonim, Y. Ophir, A. Zirikly, P. Resnik, M. Liakata,
     Overview of the CLPsych 2024 shared task: Leveraging large language models to identify evidence
     of suicidality risk in online posts, in: A. Yates, B. Desmet, E. Prud’hommeaux, A. Zirikly, S. Bedrick,
     S. MacAvaney, K. Bar, M. Ireland, Y. Ophir (Eds.), Proceedings of the 9th Workshop on Com-
     putational Linguistics and Clinical Psychology (CLPsych 2024), Association for Computational
     Linguistics, St. Julians, Malta, 2024, pp. 177–190. URL: https://aclanthology.org/2024.clpsych-1.15.
[10] A. M. Mármol-Romero, A. Moreno-Muñoz, F. M. P. del Arco, M. D. Molina-González, M. T. M.
     Valdivia, L. A. U. López, A. Montejo-Ráez, Overview of mentalriskes at iberlef 2023: Early
     detection of mental disorders risk in spanish, Proces. del Leng. Natural 71 (2023) 329–350. URL:
     http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6564.
[11] A. M. Mármol-Romero, A. Moreno-Muñoz, F. M. P. del Arco, M. D. Molina-González, M. T. M.
     Valdivia, L. A. U. López, A. Montejo-Ráez, Mentalriskes: A new corpus for early detection of
     mental disorders in spanish, in: N. Calzolari, M. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.),
     Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language
     Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, ELRA and ICCL,
     2024, pp. 11204–11214. URL: https://aclanthology.org/2024.lrec-main.978.
[12] C. M. Bulik, L. Reba, A.-M. Siega-Riz, T. Reichborn-Kjennerud, Anorexia nervosa: definition,
     epidemiology, and cycle of risk, International Journal of Eating Disorders 37 (2005) S2–S9.
[13] E. Mohammadi, H. Amini, L. Kosseim, Quick and (maybe not so) easy detection of anorexia in
     social media posts, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.), Working Notes of
     CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12,
     2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. URL: https://ceur-ws.org/
     Vol-2380/paper_74.pdf.
[14] W. Ragheb, J. Azé, S. Bringay, M. Servajean, Attentive multi-stage learning for early risk detection of
     signs of anorexia and self-harm on social media, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller
     (Eds.), Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano,
     Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org,
     2019. URL: https://ceur-ws.org/Vol-2380/paper_126.pdf.
[15] S. G. Burdisso, M. Errecalde, M. Montes-y-Gómez, UNSL at erisk 2019: a unified approach for
     anorexia, self-harm and depression detection in social media, in: L. Cappellato, N. Ferro, D. E.
     Losada, H. Müller (Eds.), Working Notes of CLEF 2019 - Conference and Labs of the Evaluation
     Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings,
     CEUR-WS.org, 2019. URL: https://ceur-ws.org/Vol-2380/paper_103.pdf.
[16] M. E. Aragón, A. P. López-Monroy, M. Montes-y-Gómez, INAOE-CIMAT at erisk 2019: Detecting
     signs of anorexia using fine-grained emotions, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller
     (Eds.), Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano,
     Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org,
     2019. URL: https://ceur-ws.org/Vol-2380/paper_113.pdf.
[17] R. M. Ortega-Mendoza, D. I. H. Farías, M. Montes-y-Gómez, Ltl-inaoe’s participation at erisk
     2019: Detecting anorexia in social media through shared personal information, in: L. Cappellato,
     N. Ferro, D. E. Losada, H. Müller (Eds.), Working Notes of CLEF 2019 - Conference and Labs of the
     Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2019. URL: https://ceur-ws.org/Vol-2380/paper_75.pdf.
[18] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant mapping, in:
     2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
     2006), 17-22 June 2006, New York, NY, USA, IEEE Computer Society, 2006, pp. 1735–1742. URL:
     https://doi.org/10.1109/CVPR.2006.100. doi:10.1109/CVPR.2006.100.
[19] P. H. Le-Khac, G. Healy, A. F. Smeaton, Contrastive representation learning: A framework and
     review, IEEE Access 8 (2020) 193907–193934. URL: https://doi.org/10.1109/ACCESS.2020.3031549.
     doi:10.1109/ACCESS.2020.3031549.
[20] T. Chen, S. Kornblith, M. Norouzi, G. E. Hinton, A simple framework for contrastive learning of
     visual representations, in: Proceedings of the 37th International Conference on Machine Learning,
     ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research,
     PMLR, 2020, pp. 1597–1607. URL: http://proceedings.mlr.press/v119/chen20j.html.
[21] T. Gao, X. Yao, D. Chen, Simcse: Simple contrastive learning of sentence embeddings, CoRR
     abs/2104.08821 (2021). URL: https://arxiv.org/abs/2104.08821. arXiv:2104.08821.
[22] A. van den Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding,
     CoRR abs/1807.03748 (2018). URL: http://arxiv.org/abs/1807.03748. arXiv:1807.03748.
[23] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk at CLEF 2021: Early risk
     prediction on the internet (extended overview), Proceedings of the Working Notes of CLEF 2021
     - Conference and Labs of the Evaluation Forum, Bucharest, Romania, 2021 2936 (2021) 864–887.
     URL: http://ceur-ws.org/Vol-2936/paper-72.pdf.
[24] D. E. Losada, F. Crestani, A test collection for research on depression and language use, in:
     N. Fuhr, P. Quaresma, T. Gonçalves, B. Larsen, K. Balog, C. Macdonald, L. Cappellato, N. Ferro
     (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 7th International
     Conference of the CLEF Association, CLEF 2016, Évora, Portugal, September 5-8, 2016, Proceedings,
     volume 9822 of Lecture Notes in Computer Science, Springer, 2016, pp. 28–39. URL: https://doi.org/
     10.1007/978-3-319-44564-9_3. doi:10.1007/978-3-319-44564-9\_3.
[25] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes,
     S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil, Universal sentence encoder, CoRR abs/1803.11175
     (2018). URL: http://arxiv.org/abs/1803.11175. arXiv:1803.11175.
[26] M. Iyyer, V. Manjunatha, J. Boyd-Graber, H. Daumé III, Deep unordered composition rivals syntactic
     methods for text classification, in: Proceedings of the 53rd Annual Meeting of the Association
     for Computational Linguistics and the 7th International Joint Conference on Natural Language
     Processing (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China,
     2015, pp. 1681–1691. URL: https://aclanthology.org/P15-1162. doi:10.3115/v1/P15-1162.
[27] E. Bernhardsson, Annoy: Approximate Nearest Neighbors in C++/Python, 2018. URL: https:
     //pypi.org/project/annoy/, python package version 1.13.0.
[28] Y. A. Malkov, D. A. Yashunin, Efficient and robust approximate nearest neighbor search using
     hierarchical navigable small world graphs, CoRR abs/1603.09320 (2016). URL: http://arxiv.org/abs/
     1603.09320. arXiv:1603.09320.
[29] N. Rethmeier, I. Augenstein, A primer on contrastive pretraining in language processing: Methods,
     lessons learned, and perspectives, ACM Comput. Surv. 55 (2023) 203:1–203:17. URL: https://doi.
     org/10.1145/3561970. doi:10.1145/3561970.
[30] K. Q. Weinberger, J. Blitzer, L. K. Saul, Distance metric learning for large margin near-
     est neighbor classification, in: Advances in Neural Information Processing Systems 18
     [Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver, British
     Columbia, Canada], 2005, pp. 1473–1480. URL: https://proceedings.neurips.cc/paper/2005/hash/
     a7f592cef8b130a6967a90617db5681b-Abstract.html.
[31] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, F. Wei, Text embeddings by
     weakly-supervised contrastive pre-training, CoRR abs/2212.03533 (2022). URL: https://doi.org/10.
     48550/arXiv.2212.03533. doi:10.48550/ARXIV.2212.03533. arXiv:2212.03533.