<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>NLP-UNED at eRisk 2024: Approximate Nearest Neighbors with Encoding Refinement for Early Detecting Signs of Anorexia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hermenegildo Fabregat</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Deniz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andres Duque</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lourdes Araujo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan Martinez-Romo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Avature Machine Learning, Marqués de Valdeiglesias</institution>
          ,
          <addr-line>3, Madrid 28004</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IMIENS: Instituto Mixto de Investigación, Escuela Nacional de Sanidad</institution>
          ,
          <addr-line>Monforte de Lemos 5, Madrid 28019</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>NLP &amp; IR Group, Dpto. Lenguajes y Sistemas Informáticos, Universidad Nacional de Educación a Distancia (UNED)</institution>
          ,
          <addr-line>Juan del Rosal 16, Madrid 28040</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>This paper describes our participation in Task 2 (Early Detection of Signs of Anorexia) from the CLEF 2024 eRisk Workshop, addressed to detecting early signs of anorexia in Social Media users through the analysis of their posts. A relabelling step based on Approximate Nearest Neighbors (ANN) is performed for generating a training dataset annotated at message level instead of user level, and then contrastive learning techniques are applied for refining the previously generated vector representations of the messages. ANNs are used also for classification purposes, combined with the use of rules and heuristics focused on expanding the number of considered messages from the user for making the final decision. Our system obtains the best results in both the decision-based evaluation, with 9 percentage points over the second best system in terms of latency-weighted F1, and in the ranking-based evaluation, with the best scores for 11 out of the 12 metrics employed.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Early risk detection</kwd>
        <kwd>Anorexia</kwd>
        <kwd>Approximate Nearest Neighbors</kwd>
        <kwd>Contrastive Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, the analysis of social media for early detection of health risks has become an intriguing
and significant area of research. Within this research field, the eRisk workshop, part of the
Conference and Labs of the Evaluation Forum (CLEF) since 2017, has played a pivotal role. This workshop
fosters collaborative eforts to develop innovative methodologies and practical solutions for the early
identification of various health concerns, including eating disorders, self-harm, pathological gambling
and depression, through the analysis of textual content on social media platforms. By analyzing social
media posts and messages, researchers can obtain valuable insights to identify individuals at risk.</p>
      <p>
        This paper details our approach to tackling Task 2 of the eRisk 2024 Workshop [
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ]: Early Detection
of Signs of Anorexia. In this task, systems must sequentially process messages posted by diferent
users in Reddit forums, searching for early traces of anorexia, this is, detecting as soon as possible
whether a user is at risk of sufering from anorexia. The task is a continuation of Task 2 of the eRisk
2018 Workshop [3] and Task 1 of the eRisk 2019 Workshop [4].
      </p>
      <p>Building upon our previous work in the detection of pathological gambling [5, 6, 7], we have refined
our system by incorporating contrastive learning techniques for fine-tuning the encoded representations
of text messages written by the analyzed users. Additional heuristics have been also included in the
system in order to expand the context of the user’s messages, this way taking into account a larger
number of previous messages when making the final decision on whether the user is at risk. These
improvements have proven to enhance the system’s accuracy and reliability in detecting potential cases
of anorexia from social media content.</p>
      <p>The rest of the paper is structured as follows: Section 2 gathers information about previous research
works related to early detection of risks, as well as systems participating in previous eRisk competitions.
A brief description of the addressed task, and the dataset and evaluation metrics involved is presented
in Section 3. The diferent components of the proposed system are described in Section 4, and the
results obtained by this system are shown and analyzed in Section 5. Finally, Section 6 depicts some
conclusions about the work, together with possible future lines of work regarding this research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The automatic detection of mental health issues is currently a hot research topic within machine
learning, specifically regarding natural language processing. The availability of information sources
with large amounts of data, such as social media, is enabling the development of new systems aimed
at the early detection of these types of issues. Within this context, diferent evaluation frameworks
and campaigns such as CLEF’s eRisk [8], CLPSych [9] or IberLEF’s MentalRiskES [10, 11] represent a
significant efort by the scientific community to support the development and dissemination of these
types of systems.</p>
      <p>Anorexia nervosa (AN) is a severe eating disorder characterized by an inability to maintain a healthy
body weight, often falling below 85% of the ideal weight. Individuals with AN obsess over weight gain,
perceive their bodies as larger than they are, and engage in behaviors to sustain weight loss. This illness
profoundly afects both mind and body, with suferers placing significant importance on their shape
and weight, intertwining their self-esteem with their body image [12]. The 2018 and 2019 CLEF eRisk
competitions addressed the automatic detection of signs of anorexia in Social Media posts, encouraging
the participating systems to develop techniques for determining whether a user can be classified as at
risk of sufering from this illness. Although the stage of development of neural models was nowhere
near the current level when the last edition of this task was held (2019), some of the best participating
systems at that time used such models for their predictions. An ensemble approach with diferent neural
attention-based models is used in [13] for feature extraction, and then combined with Support Vector
Machines to determine the final decision. Deep learning models are also used in [ 14] for developing
a time series dataset representing the evolution of the user’s mood through time. Then, Bayesian
inference is employed for performing the final classification. Other approaches obtained good results
in the competition by using more classic machine learning methods such as statistical word-based
techniques [15], or Support Vector Machines with customized feature sets based on emotions derived
from the text [16] or content-based features from phrases with personal pronouns [17]. In general, and
also based on the results obtained by our own participations in early risk detection tasks, systems not
relying on deep learning techniques or large language models are also able to achieve good results [7].</p>
      <p>Contrastive learning techniques can be defined as methods aimed to learn and refine efective
representations of data by pulling semantically close neighbors together and pushing dissimilar ones
apart [18]. One of the most important characteristics of contrastive learning is that the model learns by
comparison, this is, it is not necessary for the instances whose representations are to be refined to be
accompanied by their corresponding labels. Instead, these approaches only need to define the similarity
distribution. This way, the model should learn to map together similar instances, while separating
dissimilar instances in the embedding space [19]. These techniques have been successfully applied to
computer vision problems [20] and natural language processing tasks [21], as well as to other domains
such as audio or reinforcement learning [22]. Considering our system presented in previous eRisk
competitions, based on approximate nearest neighbors with vector representations of text messages,
exploring these techniques seems like a logical step for its improvement.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task 2: Early Detection of Signs of Anorexia</title>
      <p>As previously mentioned, we have participated in task 2 of the eRisk 2024 competition, denoted “Early
Detection of Signs of Anorexia". In this task, participants have access to a training dataset containing
the whole history of writings (Reddit posts) for a set of users. These users are annotated depending on
whether they have explicitly mentioned to have been diagnosed with anorexia (positive users) or not
(negative or control users). In the test stage, systems are asked to determine, as soon as possible, whether
a new user is at risk of sufering from anorexia according to the user’s writing history. In particular, for
each new message of a user, systems must determine whether the user is positive or negative. Once
a user is labelled as positive, the decision is considered to be final, and hence all subsequent labels
assigned to this user are ignored. Systems must also assign, after each message, a score measuring the
user’s risk of sufering from anorexia. This score is considered for evaluation purposes even after a user
has been labelled as positive.</p>
      <p>The statistics of the test dataset used for evaluating systems participating in this task are shown in
Table 1:</p>
      <p>System evaluation is conducted using two diferent paradigms: decision-based evaluation and
rankingbased evaluation. Complete information about the employed metrics can be found in [23].
• Decision-based evaluation: This type of evaluation only attends to the label assigned by the
system to each user (positive or negative), as well as the delay in determining that a positive user
is indeed at risk of sufering from anorexia. For this aim, standard metrics used for classification
such as precision, recall and F-Measure are combined with metrics that take into account this
delay information. The early risk detection error metric ERDE [24] is also used, although their
values have low interpretability. To overcome this, other metrics regarding the latency and speed
on detecting true positives are also proposed, and a final latency-weighted F1 measure is computed
by weighting the F-Measure with these delay-related metrics.
• Ranking-based evaluation: The score assigned to each user by the system, after analyzing each
received message, is used in this evaluation for computing ranking-based metrics. This is, users
are ranked after  messages according to this score, and then standard ranking metrics such as
 @ and  @ are applied for measuring the performance of the systems.</p>
      <p>Finally, the lapse of time employed by the system for processing the whole test dataset is also
measured and reported, in order to illustrate the eficiency of the proposed systems.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed System</title>
      <p>The system developed for performing early detection of signs of anorexia is presented in this Section.
In particular, the diferent components that constitute the complete system pipeline are enumerated
and described in detail. The main diferences with the original research, based on dataset relabelling
and approximate nearest neighbors techniques, presented in [5], are the use of a contrastive learning
technique for fine-tuning the embedding representations of the user’s messages (Section 4.3), as well
as the development of a set of heuristics for considering previous messages for the final classification,
instead of only taking into account the last message received (Section 4.4).</p>
      <sec id="sec-4-1">
        <title>4.1. Data representation</title>
        <p>The encoder used in this work for obtaining embeddings representing each of the messages of a
particular user is the Universal Sentence Encoder [25]. Through its use, all messages in the training
dataset are transformed into 512-dimensional embeddings. The specific model used in the encoding is
based on a Deep Average Network (DAN) [26], trained on diferent sources of data written in English,
and normally used for generating vector representations of texts longer than words, i.e., sentences,
phrases or short paragraphs.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Relabelling process</title>
        <p>The relabelling process has been described in previous works [5, 7]. Its main objective is to generate
a training dataset labelled at message-level, starting from the user-level annotation provided by the
organizers. The intuition behind this decision, already tested in previous eRisk competitions devoted to
detecting pathological gambling, is that message-level annotations can help the system to emit accurate
alerts about the risk of a user of sufering from anorexia by analyzing the user’s individual messages.</p>
        <p>In this stage a technique for generating indexes based on approximate nearest neighbors (ANN) is
applied, this way creating a data structure that allows us to obtain the  most similar messages to
a specific one. Two diferent ANN approaches have been explored in this work: first, Annoy [ 27] is
a partitioning method based on the use of hyperplanes that recursively divide the search space with
random direction. The generated index has the shape of a binary tree, and through its use the most
similar elements to a query can be easily retrieved. On the other hand, the Hierarchical Navigable
Small World (HNSW) method, implemented by the Non-Metric Space Library (NMSLIB) [28] is a
graph-based ANN technique. In this case, the search index has the form of a proximity graph in which
nodes correspond to particular instances (in our case, messages), and edges define the neighborhood
relationship. The main idea behind the use of this technique is that a neighbor’s neighbor is likely to
also be a neighbor of a particular instance. Nearest neighbor retrieval is then performed by using a
best-first search strategy on the graph.</p>
        <p>Once that the selected index has been built on all the messages composing the training dataset, we
are able to retrieve all the desired nearest neighbors given a particular message. In the first iteration of
the relabelling process, all messages are labelled as belonging to the same class (positive or negative)
as the user that created them. Then, for each positive message  in the training dataset, a set of its
 nearest neighbors is retrieved from the index. The message will be relabelled as negative only if at
least  of those  nearest messages belong to the negative class. In our implementation, only positive
messages can be relabelled as negative. This is due to the fact that only positive users can have negative
messages, because if negative users had any positive message they would have been labelled as positive.
Only messages containing title information, this is, messages representing the opening of a Reddit
thread, are taken into account for generating our training dataset. This filtering allows us to focus
on discussions originally initiated by the analyzed user, which are more likely to contain information
about particular worries or calls for help from the user. Moreover, this also reduces the computational
complexity of the system, while the final results do not significantly difer from those obtained by using
the complete set of messages. The relabelling step is iteratively repeated until convergence is reached,
this is, no new relabellings are done during an iteration. A random sample of 33% of the users in the
original training dataset is employed for validation purposes, allowing us to explore the optimal values
of the  and  parameters. Through this validation step, these values have been set to  = 10 and
 = 6.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Contrastive Learning</title>
        <p>After completing the relabelling process, we propose an additional technique in the encoding step
of our system based on fine-tuning the generated embeddings representing the diferent messages.
This fine-tuning relies on a contrastive learning technique [ 29], a method employed for maximizing
the distance between embeddings of messages belonging to diferent classes and minimizing it when
the messages belong to the same class. In particular, in our system this is achieved by retraining the
Universal Sentence Encoder used for generating the initial representations of the messages. However,
during this retraining, we employ a particular type of loss function, known as triplet loss [30]. For each
message in the training dataset, either labelled as positive or negative, a triplet (, , ) is created, being
 the original message,  a message belonging to the same class, and  a message belonging to the
opposite class. The triplet loss function used in our retraining is ℒ = ((, ) − (, ) + , 0),
where  is a function measuring the distance between the generated embeddings. The distance function
employed for this work is cosine distance. This implies that the main aim of the training process will be
to minimize the distance between messages belonging to the same class and maximize the distance
between messages belonging to diferent classes. An additional parameter  is included into the loss
function in order to determine the minimum desired distance between positive and negative instances,
considering  as reference instance.</p>
        <p>The main idea behind the contrastive learning process is illustrated in Figure 1.
• Number of instances: 20 triplets (, , ) are generated for each message , by randomly
selecting positive (same class) and negative (opposite class) instances.
• Batch size: Batch size value is 32.
• Learning rate: The learning rate is set to 1− 5.
• Epochs: The number of epochs is 4.
• Steps per epoch: The number of steps per epoch is 128.
• Margin: The triplet loss margin ( ) is set to 0.15 (normalized values are used for distances and
margin).</p>
        <p>With this configuration, a maximum of 128*32 (steps per epoch times batch size) instances are fed
to the the network in each epoch. This implies that 128*32*4 = 16,384 instances are used for training.
Hence, given the size of the training dataset, only a fraction of the generated triplets are efectively
used for training. Also, not all instances are seen the same number of times.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Final classification</title>
        <p>Once that the representation of the text messages is refined using contrastive learning techniques, the
ifnal classification step is somehow similar to the relabelling process described in Section 4.2. However,
some additional heuristics have been added to this stage in order to consider more than one individual
message for determining whether a user is at risk of sufering from anorexia.</p>
        <p>Two new  and  parameters are calculated in this step for performing the final classification. Each
time a new message  is received, the  nearest neighbors are retrieved. If at least  of those 
neighbors are positive, the message, and hence the user, is directly classified as positive. Through the
use of the validation split aforementioned, the values of these parameters have been set to  = 19 and
 = 19 for the classification step.</p>
        <p>As previously mentioned, we are also interested in analyzing whether the history of previous messages
from the user can be useful for performing a more accurate classification. With this purpose, we have
explored in more depth how assigning risk scores to the user after analyzing each message can afect
the final classification. Besides the classification of the user as positive or negative, and regarding the
ranking-based evaluation, a score is expected to be assigned to the user after receiving each message,
representing the user’s risk of sufering from anorexia. In our system, this score is computed by
calculating the average distance between a received message  and all its nearest neighbors labelled as
positive,  = 1 ∑︀</p>
        <p>=1 (,  ), where  is a message within the set of  nearest neighbors
that is labelled as positive. The distance function employed returns values between 0 and 2, and hence
the scoring assigned to the user is  = (2 − ). This way, a message really close to its positive
neighbors would receive a distance value of  ≈ 0 and hence its score would be  ≈ 2. This
score is calculated for test messages classified as positive, but also for those classified as negative, and a
bufer containing the scores of the  previous messages from the user is stored. The bufer is originally
iflled with zeros. Hence, if the system initially classifies a message as negative, the average score value
for the last  messages is calculated, and the message (and user) will be classified as positive if this
average is over a particular threshold . The optimal values of  and  (this is, the message window
considered and the score threshold) are also determined using the validation split and vary depending
on the submitted run (see Section 5.1).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>Main results achieved by the proposed system are presented in this Section. Experiments using the
validation split are first depicted in order to justify the configurations selected for the submitted runs.
Only decision-based evaluation, and more particularly, latency-weighted F1 values, were taken into
account for tuning the hyperparameters through the validation split. Then, results obtained on the test
dataset by the 5 diferent configurations selected are shown.</p>
      <sec id="sec-5-1">
        <title>5.1. Validation and Selected Runs</title>
        <p>As previously mentioned, a random split of 33% of the users in the training dataset is employed for
validation purposes. Through these experiments we have confirmed that the use of the contrastive
learning technique is able to improve all previous results obtained when using the Universal Sentence
Encoder with no modifications for generating the embeddings. In particular, the latency-weighted F1
value of the best performing configuration that uses the original encoder is around 6% lower than the
best performing system in our validation process. For this reason, we decided to use the contrastive
learning encoder in all the submitted runs. In general, applying the relabelling method also improves
the results with respect to not using it (this is, labelling all messages from a positive user as positive and
all messages from a negative user as negative). However, we included a run that does not perform any
relabelling in the test configurations, in order to compare results. The remaining parameters (values 
and  in either relabelling or classification, and values  and ) have been adjusted by selecting the
best performing configurations in the validation phase. As already stated, values of  = 10 and  = 6
during relabelling and  = 19 and  = 19 during classification showed the best results in this stage.</p>
        <p>Table 2 shows the configurations of the proposed system, for each of the five runs allowed to be
submitted in the task.</p>
        <p>Column “ANN system” indicates the technique employed for building the nearest neighbor index:
Annoy or NMSLIB. The type of encoder employed is always the one that refines the Universal Sentence
Encoding with contrastive learning (CL_USE). Column “Relabel” indicates whether the relabelling step
has been followed or not, while column “Heuristics” shows values for parameters  (window size)
and  (decision threshold) in case the rules described in Section 4.4 have been employed, and “None”
otherwise. It can be noticed how the best value for parameter  is always set to 1.0, this is, half the
maximum scoring value that the average score for the  last messages can reach. Finally, we can
observe how the latency-weighted F1 metric is quite similar in this validation for all the proposed
configurations, except for R3, which does not include the relabelling step.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Test results</title>
        <p>The following tables illustrate the main results achieved by our system regarding the two types of
evaluations considered, as well as the comparison with the other teams participating in the task. In
particular, Table 3 shows results according to the decision-based evaluation.</p>
        <p>As the table shows, all the configurations proposed for our system are able to overcome all
participating systems in terms of latency-weighted F1. In particular, our best performing run, R1, is 9% ahead of
the second best performing team. Although some other teams obtain slightly better results regarding
precision and recall, the F1 and latency-weighted F1 values show that our proposal is the most robust
across the considered metrics. Our system also obtains good results for some of the early risk detection
metrics. In particular, it achieves the third best ERDE5 and second best ERDE50 values, although the
latency and speed values are somewhat worse. It is particularly noticeable how all the proposed runs
are able to obtain good results. This probably indicates that the main improvement proposed, which is
the use of a contrastive learning technique for refining the embeddings representing text messages,
has a powerful impact on the performance of our system. On the other hand, the use of heuristics
for increasing the amount of information considered before classifying a message, does not seem to
have that much impact on the final results. However, in the validation stage we have stated that when
contrastive learning is not performed on the original embeddings, the use of these heuristics does
positively influence the results. Therefore, future eforts should be focused on improving these rules.</p>
        <p>Table 4 shows the main results on the ranking-based evaluation.</p>
        <p>Once again, our system ranks first in this type of evaluation, for almost all the considered metrics,
and for any of the proposed configurations. In particular, we are able to achieve perfect scores for  @10
and  @10 after receiving 1, 100, 500 and 1000 messages, and the best results for  @100
after receiving 100, 500 and 1000. Only the UNSL team is able to beat our system for the  @100
after seeing only the first message of each user. Together with our latency and speed values in the
decision-based evaluation, this fact indicates that our system could be improved in terms of speed in
ifnding true positives, this is, determining that a user is at risk of sufering from anorexia.</p>
        <p>Finally, Table 5 shows some information regarding the number of runs submitted by the participating
teams, the number of total writings processed by each team, and the total time employed in processing
the messages.</p>
        <p>Compared to the other participating systems that processed the complete set of user writings, our
system is the third best performing regarding execution times, the time interval being in the order of
hours, in a similar manner to the best performing teams.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>This paper presents our participation in Task 2 of the CLEF eRisk 2024 competition: Early Detection of
Signs of Anorexia. The developed system is a new version of the system designed for previous editions
of the competition, in which a relabelling method based on the use of approximate nearest neighbors
(ANN) is applied on the training dataset, and the same ANN techniques are then used for classifying
new messages and determining whether a user is at risk of sufering from a mental problem, in this
case anorexia. The new improvements incorporated to the system is the use of contrastive learning
techniques for fine-tuning the embeddings of the text messages, initially generated through a Universal
Sentence Encoder, and the increasing of the amount of information employed for classification by
including a set of rules or heuristics that consider a message window of  previous messages. The
developed system is able to obtain the best results among the participating systems in terms of F-Measure
and latency-weighted F1 (decision-based evaluation), as well as in terms of ranking-based evaluation
metrics. In particular, all the tested configurations of the system overcome the second best participating
team by around 9% of latency-weighted F1. In general, the main results indicate that the refinement
of the vector representations obtained through contrastive learning techniques has been crucial for a
better discrimination between positive and negative messages, thus leading the system to efectively
determine when a message may indicate that the user is at risk of sufering from anorexia. On the other
hand, expanding the message window considered for performing the final classification has not shown
significant impact on the test results, although during the validation stage those configurations using
these heuristics were able to obtain better overall results with respect to configurations only using one
message for making a decision.</p>
      <p>As mentioned in Section 5.1, future lines of work should focus on improving the rules designed for
considering the history of messages before classifying a user. A trade-of must be found between the
latency (this is, number of messages analyzed before emitting an alert) and the amount of information
that should be gathered before making a decision. Also, the treatment of these previous messages can
be improved: for instance, the current rules underestimate the weight of similar positive messages
when few messages have been received, since the bufer of previous scores is initialized with zeros. This
implies that even if a message is quite similar to positive messages its score is going to decrease when
it is one of the first analyzed messages for a user. The current decision of selecting only the nearest
positive messages for calculating the score can also be detrimental for the final results. More research
should be done on the type of functions that better model the similarity of a given message with both
positive and negative nearest neighbors, and its influence on the classification decision.</p>
      <p>An additional future line of research involves further refinement of the embeddings used for
representing users’ messages. In particular, the hyperparameters used in the contrastive learning phase,
described in Section 4.3 can be studied in greater depth through validation techniques, in order to search
for optimal values. Additionally, diferent encoding models beyond the Universal Sentence Encoder
could be also considered, exploring issues such as multilingualism or models that have already used
contrastive learning techniques in their original training, like E5 [31].</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by the Spanish Ministry of Science and Innovation within the
DOTT-HEALTH Project (MCI/AEI/FEDER, UE) under Grant PID2019-106942RB-C32, OBSER-MENH
Project (MCIN/AEI/10.13039 and NextGenerationEU”/PRTR) under Grant TED2021-130398B-C21 and
EDHER-MED Project under grant PID2022-136522OB-C21, as well as by the Universidad Nacional de
Educación a Distancia (UNED) within project SICAMESP (2023-VICE-0029).
[2] J. Parapar, P. Martín Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2024: Early risk prediction
on the internet (extended overview)., Working Notes of the Conference and Labs of the Evaluation
Forum CLEF 2024, Grenoble, France, September 9th to 12th, 2024, CLEF 2024. CEUR Workshop
Proceedings (2024).
[3] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk: Early risk prediction on the internet
(extended lab overview), in: L. Cappellato, N. Ferro, J. Nie, L. Soulier (Eds.), Working Notes of
CLEF 2018 - Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14,
2018, volume 2125 of CEUR Workshop Proceedings, CEUR-WS.org, 2018. URL: https://ceur-ws.org/
Vol-2125/invited_paper_1.pdf.
[4] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk at CLEF 2019: Early risk prediction on
the internet (extended overview), in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.),
Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland,
September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. URL:
https://ceur-ws.org/Vol-2380/paper_248.pdf.
[5] H. Fabregat, A. Duque, L. Araujo, J. Martínez-Romo, UNED-NLP at erisk 2022: Analyzing gambling
disorders in social media using approximate nearest neighbors, in: G. Faggioli, N. Ferro, A. Hanbury,
M. Potthast (Eds.), Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the
Evaluation Forum, Bologna, Italy, September 5th - to - 8th, 2022, volume 3180 of CEUR Workshop
Proceedings, CEUR-WS.org, 2022, pp. 894–904. URL: https://ceur-ws.org/Vol-3180/paper-71.pdf.
[6] H. Fabregat, A. Duque, L. Araujo, J. Martínez-Romo, NLP-UNED-2 at erisk 2023:
Detecting pathological gambling in social media through dataset relabeling and neural networks,
in: M. Aliannejadi, G. Faggioli, N. Ferro, M. Vlachos (Eds.), Working Notes of the
Conference and Labs of the Evaluation Forum (CLEF 2023), Thessaloniki, Greece, September 18th to
21st, 2023, volume 3497 of CEUR Workshop Proceedings, CEUR-WS.org, 2023, pp. 672–683. URL:
https://ceur-ws.org/Vol-3497/paper-056.pdf.
[7] H. Fabregat, A. Duque, L. Araujo, J. Martínez-Romo, A re-labeling approach based on approximate
nearest neighbors for identifying gambling disorders in social media, in: A. Arampatzis, E. Kanoulas,
T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. Faggioli, N. Ferro
(Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 14th International
Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece, September 18-21, 2023,
Proceedings, volume 14163 of Lecture Notes in Computer Science, Springer, 2023, pp. 174–185. URL:
https://doi.org/10.1007/978-3-031-42448-9_15. doi:10.1007/978-3-031-42448-9\_15.
[8] J. Parapar, P. Martín Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2023: Early risk prediction
on the internet., Experimental IR Meets Multilinguality, Multimodality, and Interaction. 14th
International Conference of the CLEF Association, CLEF 2023, Thessaloniki, Greece (2023).
[9] J. Chim, A. Tsakalidis, D. Gkoumas, D. Atzil-Slonim, Y. Ophir, A. Zirikly, P. Resnik, M. Liakata,
Overview of the CLPsych 2024 shared task: Leveraging large language models to identify evidence
of suicidality risk in online posts, in: A. Yates, B. Desmet, E. Prud’hommeaux, A. Zirikly, S. Bedrick,
S. MacAvaney, K. Bar, M. Ireland, Y. Ophir (Eds.), Proceedings of the 9th Workshop on
Computational Linguistics and Clinical Psychology (CLPsych 2024), Association for Computational
Linguistics, St. Julians, Malta, 2024, pp. 177–190. URL: https://aclanthology.org/2024.clpsych-1.15.
[10] A. M. Mármol-Romero, A. Moreno-Muñoz, F. M. P. del Arco, M. D. Molina-González, M. T. M.</p>
      <p>Valdivia, L. A. U. López, A. Montejo-Ráez, Overview of mentalriskes at iberlef 2023: Early
detection of mental disorders risk in spanish, Proces. del Leng. Natural 71 (2023) 329–350. URL:
http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6564.
[11] A. M. Mármol-Romero, A. Moreno-Muñoz, F. M. P. del Arco, M. D. Molina-González, M. T. M.</p>
      <p>Valdivia, L. A. U. López, A. Montejo-Ráez, Mentalriskes: A new corpus for early detection of
mental disorders in spanish, in: N. Calzolari, M. Kan, V. Hoste, A. Lenci, S. Sakti, N. Xue (Eds.),
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language
Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino, Italy, ELRA and ICCL,
2024, pp. 11204–11214. URL: https://aclanthology.org/2024.lrec-main.978.
[12] C. M. Bulik, L. Reba, A.-M. Siega-Riz, T. Reichborn-Kjennerud, Anorexia nervosa: definition,
epidemiology, and cycle of risk, International Journal of Eating Disorders 37 (2005) S2–S9.
[13] E. Mohammadi, H. Amini, L. Kosseim, Quick and (maybe not so) easy detection of anorexia in
social media posts, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.), Working Notes of
CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12,
2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. URL: https://ceur-ws.org/
Vol-2380/paper_74.pdf.
[14] W. Ragheb, J. Azé, S. Bringay, M. Servajean, Attentive multi-stage learning for early risk detection of
signs of anorexia and self-harm on social media, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller
(Eds.), Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano,
Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org,
2019. URL: https://ceur-ws.org/Vol-2380/paper_126.pdf.
[15] S. G. Burdisso, M. Errecalde, M. Montes-y-Gómez, UNSL at erisk 2019: a unified approach for
anorexia, self-harm and depression detection in social media, in: L. Cappellato, N. Ferro, D. E.
Losada, H. Müller (Eds.), Working Notes of CLEF 2019 - Conference and Labs of the Evaluation
Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings,
CEUR-WS.org, 2019. URL: https://ceur-ws.org/Vol-2380/paper_103.pdf.
[16] M. E. Aragón, A. P. López-Monroy, M. Montes-y-Gómez, INAOE-CIMAT at erisk 2019: Detecting
signs of anorexia using fine-grained emotions, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller
(Eds.), Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano,
Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org,
2019. URL: https://ceur-ws.org/Vol-2380/paper_113.pdf.
[17] R. M. Ortega-Mendoza, D. I. H. Farías, M. Montes-y-Gómez, Ltl-inaoe’s participation at erisk
2019: Detecting anorexia in social media through shared personal information, in: L. Cappellato,
N. Ferro, D. E. Losada, H. Müller (Eds.), Working Notes of CLEF 2019 - Conference and Labs of the
Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop
Proceedings, CEUR-WS.org, 2019. URL: https://ceur-ws.org/Vol-2380/paper_75.pdf.
[18] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant mapping, in:
2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR
2006), 17-22 June 2006, New York, NY, USA, IEEE Computer Society, 2006, pp. 1735–1742. URL:
https://doi.org/10.1109/CVPR.2006.100. doi:10.1109/CVPR.2006.100.
[19] P. H. Le-Khac, G. Healy, A. F. Smeaton, Contrastive representation learning: A framework and
review, IEEE Access 8 (2020) 193907–193934. URL: https://doi.org/10.1109/ACCESS.2020.3031549.
doi:10.1109/ACCESS.2020.3031549.
[20] T. Chen, S. Kornblith, M. Norouzi, G. E. Hinton, A simple framework for contrastive learning of
visual representations, in: Proceedings of the 37th International Conference on Machine Learning,
ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research,
PMLR, 2020, pp. 1597–1607. URL: http://proceedings.mlr.press/v119/chen20j.html.
[21] T. Gao, X. Yao, D. Chen, Simcse: Simple contrastive learning of sentence embeddings, CoRR
abs/2104.08821 (2021). URL: https://arxiv.org/abs/2104.08821. arXiv:2104.08821.
[22] A. van den Oord, Y. Li, O. Vinyals, Representation learning with contrastive predictive coding,</p>
      <p>CoRR abs/1807.03748 (2018). URL: http://arxiv.org/abs/1807.03748. arXiv:1807.03748.
[23] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk at CLEF 2021: Early risk
prediction on the internet (extended overview), Proceedings of the Working Notes of CLEF 2021
- Conference and Labs of the Evaluation Forum, Bucharest, Romania, 2021 2936 (2021) 864–887.</p>
      <p>URL: http://ceur-ws.org/Vol-2936/paper-72.pdf.
[24] D. E. Losada, F. Crestani, A test collection for research on depression and language use, in:
N. Fuhr, P. Quaresma, T. Gonçalves, B. Larsen, K. Balog, C. Macdonald, L. Cappellato, N. Ferro
(Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction - 7th International
Conference of the CLEF Association, CLEF 2016, Évora, Portugal, September 5-8, 2016, Proceedings,
volume 9822 of Lecture Notes in Computer Science, Springer, 2016, pp. 28–39. URL: https://doi.org/
10.1007/978-3-319-44564-9_3. doi:10.1007/978-3-319-44564-9\_3.
[25] D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Cespedes,
S. Yuan, C. Tar, Y. Sung, B. Strope, R. Kurzweil, Universal sentence encoder, CoRR abs/1803.11175
(2018). URL: http://arxiv.org/abs/1803.11175. arXiv:1803.11175.
[26] M. Iyyer, V. Manjunatha, J. Boyd-Graber, H. Daumé III, Deep unordered composition rivals syntactic
methods for text classification, in: Proceedings of the 53rd Annual Meeting of the Association
for Computational Linguistics and the 7th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), Association for Computational Linguistics, Beijing, China,
2015, pp. 1681–1691. URL: https://aclanthology.org/P15-1162. doi:10.3115/v1/P15-1162.
[27] E. Bernhardsson, Annoy: Approximate Nearest Neighbors in C++/Python, 2018. URL: https:
//pypi.org/project/annoy/, python package version 1.13.0.
[28] Y. A. Malkov, D. A. Yashunin, Eficient and robust approximate nearest neighbor search using
hierarchical navigable small world graphs, CoRR abs/1603.09320 (2016). URL: http://arxiv.org/abs/
1603.09320. arXiv:1603.09320.
[29] N. Rethmeier, I. Augenstein, A primer on contrastive pretraining in language processing: Methods,
lessons learned, and perspectives, ACM Comput. Surv. 55 (2023) 203:1–203:17. URL: https://doi.
org/10.1145/3561970. doi:10.1145/3561970.
[30] K. Q. Weinberger, J. Blitzer, L. K. Saul, Distance metric learning for large margin
nearest neighbor classification, in: Advances in Neural Information Processing Systems 18
[Neural Information Processing Systems, NIPS 2005, December 5-8, 2005, Vancouver, British
Columbia, Canada], 2005, pp. 1473–1480. URL: https://proceedings.neurips.cc/paper/2005/hash/
a7f592cef8b130a6967a90617db5681b-Abstract.html.
[31] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, F. Wei, Text embeddings by
weakly-supervised contrastive pre-training, CoRR abs/2212.03533 (2022). URL: https://doi.org/10.
48550/arXiv.2212.03533. doi:10.48550/ARXIV.2212.03533. arXiv:2212.03533.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2024:
          <article-title>Early risk prediction on the internet</article-title>
          .,
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. 15th International Conference of the CLEF Association</source>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2024</year>
          . Springer International Grenoble, France. (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>