<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SINAI at eRisk@CLEF 2025: Transformer-Based and Conversational Strategies for Depression Detection⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alba María Mármol-Romero</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel García-Vega</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miguel Ángel García-Cumbreras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arturo Montejo-Ráez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department</institution>
          ,
          <addr-line>SINAI, CEATIC</addr-line>
          ,
          <institution>University of Jaén</institution>
          ,
          <addr-line>23071</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the SINAI-UJA team in the eRisk@CLEF 2025 lab. Specifically, we addressed two of the proposed tasks: (i) Task 2: Contextualized Early Detection of Depression, and (ii) Pilot Task: Conversational Depression Detection via LLMs. Our approach for Task 2 combines an extensive preprocessing pipeline with the use of several transformer-based models, such as RoBERTa Base or MentalRoBERTA Large, to capture the contextual and sequential nature of multi-user conversations. For the Pilot Task, we designed a set of conversational strategies to interact with LLM-powered personas, focusing on maximizing information gain within a limited number of dialogue turns. In Task 2, our system ranked 8th out of 12 participating teams based on F1 score. However, a deeper analysis revealed that our models were among the fastest in issuing early predictions, which is a critical factor in real-world deployment scenarios. This highlights the trade-of between early detection and classification accuracy, suggesting potential avenues for optimizing both jointly in future work. In the Pilot Task, we achieved 1st place out of 5 teams, obtaining the best overall performance across all evaluation metrics: DCHR, ADODL and ASHR. Our success in this task demonstrates the efectiveness of structured conversational design when combined with powerful language models, reinforcing the feasibility of deploying LLMs in sensitive mental health assessment contexts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Early risk prediction</kwd>
        <kwd>Depression detection</kwd>
        <kwd>Symptoms of depression detection</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Transformers</kwd>
        <kwd>Large Language Model</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Task 1 - Search for symptoms of depression. It consists of ranking sentences from a collection
of user writings according to their relevance to a depression symptom. Then, the participants
will have to provide rankings for the 21 symptoms of depression from the BDI Questionnaire. It
is a continuation of Task 1 proposed for eRisk 2023 [4] and eRisk 2024 [5].</p>
      <p>This work presents the participation of our research group, the SINAI1 team, in Task 2: Contextualized
Early Detection of Depression and Pilot Task: Conversational Depression Detection via LLMs. The rest of
the paper is organized as follows: sections 2 and 3 describe in detail our participation in task 2 and the
pilot task, respectively. Each of them is divided into subsections in which, first, we introduce what the
task consists of, the data provided, and the evaluation measures used. Secondly, the system developed
and the methodology used are presented. Thirdly, the experimental setup is detailed. Subsequently,
the results obtained and a discussion on them are presented. Finally, Section 4 shows the conclusions
obtained after the participation in the eRisk lab and the perspectives for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task 2: Contextualized Early Detection of Depression</title>
      <sec id="sec-2-1">
        <title>2.1. Task description</title>
        <p>This task focuses on the early detection of depression by analyzing full conversational contexts, including
the contributions of all users involved in a discussion. Unlike previous editions that relied on isolated
user posts, this task highlights the importance of dialogue dynamics and sequential processing of
messages. The challenge is structured in two phases: a training phase with individual user writings
(without context), and a test phase where models must process conversations chronologically and make
real-time predictions as new messages appear. Evaluation considers both accuracy and timeliness,
using metrics such as ERDE (Early Risk Detection Error), flatency, and traditional scores like precision,
recall, and F1. The task encourages the development of context-aware systems suited for real-world
applications in mental health monitoring.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Dataset</title>
        <sec id="sec-2-2-1">
          <title>2.2.1. Dataset provided</title>
          <p>For this task, we have made use of the dataset provided by the organisers, together with one extracted
specifically for this task.</p>
          <p>The provided testing data consists of a dataset in which there are two types of instances: submissions
and comments. Submissions represent the primary posts created by users. They are the main content
entries. Comments are the responses or replies made by users to a submission or other comments,
forming a hierarchical structure. Moreover, the main objective of this phase is to classify and assign a
score of risk to a target subject that sometimes can be the author of the primary post but other times
can only appear in some comments.</p>
          <p>However, the training data provided [6] only proves the user’s writings, not the full context of the
conversation, so there are no hierarchical structures to train our systems. Training data includes users
of previous early depression detection tasks of the eRisk shared task. Figure 1 shows some graphs about
volumetric analysis. The train data provided consists of 121,889 subjects, 112,846 negative subjects and
9,043 positive subjects. In addition to the imbalanced data, as we mentioned before, the training data
does not provide the relation between comments and posts. For that reason, we developed our data set.
In that way, we merge our dataset and a subset of the data provided by organizers to train our systems.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Dataset extracted</title>
          <p>1https://sinai.ujaen.es/
2https://github.com/praw-dev/praw
3https://www.reddit.com/
Since the data provided by the organizers lacked context, we created our dataset using the Reddit API
(PRAW2) by scraping data from the Reddit3 platform. The data provided by the organizers originated
from Reddit as well, which made this approach suitable for our analysis. The following steps were
undertaken to construct our dataset:
• We scraped posts and their associated comments from the Reddit subreddit /depression, which is
typically associated with individuals discussing depression. We labeled these posts as 1 (positive
in risk of sufering depression), as they are expected to contain content related to depression.
• We also collected data from the /AdviceForTeens subreddit, where posts discussing feelings related
to emotions are common. We also used keywords such as “sad" and “friendship" to identify
posts from the whole Reddit platform. These posts were labeled 0 (negative in risk of sufering
depression), as we assumed they represent content less indicative of depression compared to posts
from /depression. This labeling was based on the assumption that discussions around sadness,
friendship, and the context of advice for teens could reflect a lot of emotional states, which we
hypothesized to be similar to those found in depression-related discussions.</p>
          <p>For our dataset, we consider the target subject to be the same author of the primary post. Then, we
did some data pre-processing. For privacy reasons, we replaced all usernames with the generic term
“user", and we also removed references to the subreddits /depression and /AdviceForTeens in both the
posts and comments to ensure that the focus remains on the content of the interactions rather than the
subreddit labels.</p>
          <p>A total of 1,782 posts were collected from the positive group and 975 posts from the negative
group, with a maximum of 50 comments per post due to API request limitations. In Table 1 are some
quantitative data about the dataset. Figure 2 and figure 3 show some statistics relative to posts and
comments extracted. Although we have got more posts for the positive group than negative, the average
number of words is slightly lower for the positive group (192.83 words) than for the negative group
(272.24 words); this can be seen in Figures 2a and 2b. As we can see in figures 3a and 3c, the number
of comments in negative post are quite higher than positive posts (figure 3a), however, the average
number of words are almost the same (figure 3d). Moreover, figure 3b shows the average number of
comments written by the same author per post (in this case, equal to the target subject).</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.3. Dataset generated</title>
          <p>To maintain a similar balance of the data provided, we take a random sample of 2,757 negative subjects
(the same number of subjects we scraped) from the dataset provided by the organizer without comments.
We merge both datasets, the sample of the one provided and the whole data extracted. So, finally, our
train dataset is 5,514 subjects (1,782 positive subjects and 3,732 negative subjects). Some graphs can be
seen in Figure 4.</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. System and methods</title>
        <sec id="sec-2-3-1">
          <title>2.3.1. Pre-processing</title>
          <p>We have explored a method based on transformer encoders because these have been proven to obtain
really good results in several related tasks.</p>
          <p>Since the dataset is a hierarchical structure with three main elements: post, target subject and list of
comments, we consider two key cases:
• The target subject appears in some comments: in this case, relevant data are these posts whose
direct parent is the target subject and these that merge a target subject post with the primary
post. An example is shown in figure 5a, where we see two cases, one where the target subject
only comments and other when the target subject is the author of the primary post and also
comments in that. Other data is discarded.
• The target subject does not appear in comments: this is the simplest case, so we only consider
direct children as relevant data. An example is shown in figure 5b.</p>
          <p>To extract the relevant data and provide context for each target subject’s text, we begin by cleaning
the raw data. Specifically, we remove all URLs, newline characters, and any messages enclosed in square
brackets, as these are not necessary for our analysis (most of them removed messages). Once the text is
cleaned, we structure it in a tree-like format for better organization:
[MSG] [USER] {type} {text} [MSG] [USER] {type} {text} ...</p>
          <p>We replace {type} with “CONTEXT" or “TARGET" depending on whether the message comes from
the target subject, and {text} is replaced with the actual content of the post (we concatenate title and
body in case title exists). This structure ensures that the model can easily identify the type of message
(whether it’s a context message or from the target subject) and process the relevant text for context.
(a) Total Number of Comments by Label.
(b) Average Self Comments by Label.
(c) Average Comments per Post by Label.</p>
          <p>(d) Average Comment Word Count by Label.</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>2.3.2. Training</title>
          <p>Since we want to test the ability of encoder models to understand the following format, we trained
the models with the hyperparameters established in Table 3 for each run once we made a previous
hyperparameter search with Optuna [7] in the search space shown in Table 2.</p>
          <p>(a) Target user appears in comments.</p>
          <p>(b) Target user appears only in the primary post.</p>
          <p>Evaluation Infrastructure The experiments were conducted on a dedicated cluster owned by the
SINAI4 research group. The processing pipeline was implemented in Python and executed using the
vLLM framework [12] for eficient inference with large language models. We used a single NVIDIA RTX
4000 GPU, running on a Linux-based system. The environment included PyTorch, and all processes
were orchestrated via custom scripts. The combination of optimized software and hardware acceleration
enabled us to complete each run eficiently, with minimal latency per thread.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Results and discussion</title>
        <p>In Task 2, our team submitted the maximum number of allowed runs (five), successfully processing all
1,280 user threads in each of them. Our system demonstrated high eficiency, completing the full set
of conversations in just 9 hours and 53 minutes. This placed us among the three fastest teams in the
competition, alongside ELiRF–UPV and PJs-team (Table 4), all of which completed the task in under ten
hours. The short processing time indicates high automation and optimization in our pipeline, which
could handle the entire dataset without manual intervention or delays.</p>
        <p>Table 5 presents the decision-based results for Task 2, where our team submitted five diferent runs
(R0–R4). All our systems achieved perfect recall (1.00), which means that all true positive cases were
ELiRF–UPV
PJs-team</p>
        <sec id="sec-2-4-1">
          <title>SINAI–UJA</title>
          <p>Average (all teams)
5
5
5
–
#Runs #Threads Processed
1,280
1,280
1,280
1,280
08:33
08:36
09:53
2 days 14:41
successfully identified by the system. However, this high recall came at the expense of low precision,
with values ranging from 0.17 (R1) to 0.24 (R0), resulting in F1-scores between 0.29 and 0.39. The best
overall F1 was obtained in run R0, with a value of 0.39, although this run also showed higher latency
compared to R1 and R2.</p>
          <p>When analyzing the latency-weighted F1 metric (Flatency), our best performance was again achieved
by R0 (0.38), which aligns with its higher precision and better trade-of between early detection and
correctness. Our systems maintained competitive speed (above 0.99) and low ERDE values (e.g., 0.08–0.09
for ERDE5), which indicates that when our system made correct decisions, it did so quickly. The rest
of the teams in the competition that obtain a higher accuracy do so by needing a higher number of
rounds (more latency), with the exception of the Lotu-Ixa team. Future eforts will focus on improving
the precision without compromising recall, potentially by incorporating more robust post-processing
techniques or confidence-based calibration strategies.</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>Team P</title>
          <p>HIT-SCIR (best) 0.77
ELiRF-UPV (best) 0.78
HU (best) 0.72
UET-Psyche-Warriors (best) 0.63
PJs-team (best) 0.66
Lotu-Ixa (best) 0.53
COTECMAR-UTB (best) 0.29
NYCUNLP (best) 0.20
FU-TU-DFKI (best) 0.17
Capy-team (best) 0.11
DS-GT (best) 0.11</p>
        </sec>
        <sec id="sec-2-4-3">
          <title>SINAI-UJA Run 0</title>
        </sec>
        <sec id="sec-2-4-4">
          <title>SINAI-UJA Run 1</title>
        </sec>
        <sec id="sec-2-4-5">
          <title>SINAI-UJA Run 2</title>
        </sec>
        <sec id="sec-2-4-6">
          <title>SINAI-UJA Run 3</title>
        </sec>
        <sec id="sec-2-4-7">
          <title>SINAI-UJA Run 4</title>
          <p>0.24
0.17
0.22
0.21
0.20</p>
          <p>R</p>
          <p>Table 6 illustrates the ranking-based evaluation, in which our team achieved competitive performance
at early stages. Notably, Run 0 reached perfect scores (P@10 = 1.00, NDCG@10 = 1.00) after processing
just one message, aligning with the top-ranked teams. As the number of messages increased, SINAI-UJA
runs maintained strong performance, with NDCG@100 values consistently between 0.50–0.54.</p>
          <p>Run 2 was particularly efective after processing 1,000 messages, again achieving perfect early
precision (P@10 = 1.00, NDCG@10 = 1.00). Although our systems showed slightly lower NDCG@100
compared to top performers like HIT-SCIR in later stages, the results suggest that our models are
well-suited for early risk detection, ofering timely alerts with solid ranking reliability.</p>
        </sec>
        <sec id="sec-2-4-8">
          <title>ERDE5</title>
        </sec>
        <sec id="sec-2-4-9">
          <title>ERDE50 latencyT speed Flatency 0.03</title>
          <p>0.04
0.05
0.04
0.06
0.03
0.10
0.07
0.07
0.10
0.10
0.05
0.07
0.05
0.05
0.06
8.00
7.00
11.00
16.00
17.00
1.00
69.00
18.00
11.00
1.00
2.00
3.00
2.00
2.00
3.00
3.00
0.97
0.98
0.96
0.94
0.94
1.00
0.74
0.93
0.96
1.00
1.00
0.99
1.00
1.00
0.99
0.99
0.82
0.78
0.72
0.68
0.66
0.63
0.29
0.31
0.28
0.20
0.20
0.38
0.29
0.36
0.35
0.33</p>
          <p>The results obtained reveal an interesting contrast between decision-based and ranking-based
evaluations. While our system demonstrates strong performance in ranking metrics, suggesting its ability to
efectively prioritize users at higher risk, its performance in decision metrics is comparatively weaker
due to low precision values.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Pilot Task: Conversational Depression Detection via LLMs</title>
      <sec id="sec-3-1">
        <title>3.1. Task description</title>
        <p>This task focuses on interacting with a large language model (LLM) persona that has been fine-tuned
using user writings, simulating real-world conversational exchanges. The challenge lies in determining
whether the LLM persona exhibits signs of depression, accompanied by an explanation of the main
symptoms that informed their decision.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Systems and methods</title>
        <p>Our main objective was to develop a system capable of estimating the severity of 21 depressive symptoms
by interacting with simulated LLM-based users within a maximum of 21 dialogue turns. We hypothesize
that it is feasible to extract multiple symptom indicators from a single user interaction, making it
possible to reach a reliable symptom assessment within this constraint.</p>
        <p>
          Our initial approach involved prompting a single large language model (LLM) to perform the task. We
have broken down the entire task into the following sub-tasks: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) ask the user about their life or feelings,
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) answer naturally to the user’s replies, (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) infer the presence or absence of depressive symptoms, and
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) update the internal values for each symptom. However, we observed that this approach yielded
suboptimal results. The LLM often failed to balance coherent conversation flow with systematic
symptom tracking, and its outputs lacked consistency in terms of coverage and interpretability.
        </p>
        <p>To address these issues, we implemented a modular system composed of two collaborating LLMs, as
illustrated in Figure 6:
• LLM 1 (Conversational Agent): This LLM is responsible for interacting directly with the user. Its
primary goal is to maintain a coherent and engaging conversation while implicitly collecting
information relevant to depressive symptoms. It asks context-aware questions and responds to
user replies naturally.
• LLM 2 (Evaluation Agent): This model does not interact with the user. Instead, it receives the
dialogue history and analyses each exchange to infer and update the current state of the 21
depressive symptoms. It assigns a severity value on a 0–3 scale to each symptom based on the
content of the conversation so far. Furthermore, it reasons whether it needs to keep updating
symptom values or has enough information, in which case it sends the Conversational Agent to
end the conversation.</p>
        <p>This separation of responsibilities enables each LLM to focus on specialized sub-tasks, improving
overall system performance. LLM 1 maximizes natural and psychologically sensitive dialogue, while
LLM 2 ensures accurate and structured symptom tracking.</p>
        <p>The system operates in a cyclic process that tries to ensure a minimum of two interactions before
allowing the conversation to conclude:
1. LLM 1 Initiation: The conversation begins with LLM 1 sending an initial message, always asking
about mood and sadness. LLM 1 used the prompts defined in Appendix A.
2. User Interaction: The user responds, and this response is appended to the conversation log.
3. LLM 2 Analysis: The complete conversation history is then sent to LLM 2, using a prompt in
the first round and another prompt in subsequent rounds. In the later rounds, LLM 2 has the
authority to signal that it has gathered enough information to stop the conversation based on its
evaluation of the symptom scores. Prompts used in LLM2 are in Appendix B.
4. Continuation or Termination: Based on LLM 2’s feedback—which includes both the updated
symptom scores and an indication of whether further clarification is needed—LLM 1 decides to
either continue the conversation or conclude it.
5. Final Submission: Once LLM 2 determines that no further information is necessary (i.e., it returns
"None" for the next symptom query), the conversation is terminated, and the final symptom
scores are recorded.</p>
        <p>We implemented a planning mechanism where LLM 2 dynamically suggests which symptoms are
underexplored, allowing LLM 1 to prioritize specific topics in the next messages. This tries to ensure
that all symptoms are assessed at least once, and high-risk symptoms can be probed in more detail.</p>
        <p>To evaluate diferent conversational strategies, we implemented three distinct run variants:
Run 0: The system responds in a coherent and empathetic manner to the received message and
incorporates a short personal experience to foster a deeper connection. This strategy leverages the
self-disclosure technique, which has been shown in prior studies to increase trust and encourage users
to open up about their feelings. We have successfully applied this concept in our earlier work with a
GPT-based chatbot for discussing mental disorders with teenagers [13]. Finally, the model asks the user
a direct question about the symptom.</p>
        <p>Run 1: The system responds empathetically without sharing personal anecdotes. The chatbot still
maintains user engagement by focusing solely on empathy through validating responses and
symptomrelated inquiries. This approach draws on evidence that even without personal self-disclosure, empathic
responses can efectively promote user openness and emotional disclosure. Finally, the model asks the
user a direct question about the symptom.</p>
        <p>Run 2: The system simplifies the interaction by directly asking the user a question about the symptom
without any additional empathetic commentary or personal experience. This minimalistic style allows
us to isolate the efect of direct symptom inquiry on user responses and diagnostic accuracy.</p>
        <p>By combining these structured approaches and referencing established techniques in self-disclosure
and emotional engagement, our system aims to achieve robust symptom detection while providing
nuanced user interactions that are tailored to the specific conversational strategy employed.
Implementation Details The LLMs used in this architecture are based on the Llama model family
[14], specifically the Llama 3.1 family of models [ 15]. We used the Llama-3.1-8B-Instruct variant.
We leveraged this model’s robust language understanding and generation capabilities to handle both
conversational and evaluative tasks in our system. To ensure seamless integration with downstream
analysis tools, both LLM 1 (the Conversational Agent) and LLM 2 (the Evaluation Agent) are prompted
to output responses in a strict JSON-like format. This approach standardizes the output for easy
parsing and subsequent processing. Moreover, we use the same infrastructure described in “Evaluation
Infrastructure" in Section 2.3.2. All model parameters for this task, such as temperature, top-k, etc., the
default settings were used.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results and discussion</title>
        <p>In this pilot task, our team submitted three fully automated runs, each implementing a distinct
conversational strategy (as described in Section 3). Our primary goal was to explore the efectiveness of
symptom detection within a constrained number of dialogue turns while maintaining informative and
psychologically sensitive interactions.</p>
        <p>According to the oficial statistics released by the task organizers (Table 7), our team adopted one of
the fastest interaction strategies across all participating teams, with an average of 6.54 messages per
run. Despite the brevity of the conversations, our messages were among the densest, averaging 488.25
characters per message. This suggests that our approach prioritized compact, high-information prompts,
allowing us to probe for depressive symptoms eficiently within a limited number of exchanges.</p>
        <p>This interaction style aligns with our system design philosophy: using a modular architecture where
one LLM guides the conversation while the other continuously monitors symptom coverage. The
system was able to conclude dialogues early if suficient evidence was collected, thereby optimizing
both eficiency and diagnostic focus.</p>
        <sec id="sec-3-3-1">
          <title>Runs</title>
        </sec>
        <sec id="sec-3-3-2">
          <title>Mean messages/run</title>
        </sec>
        <sec id="sec-3-3-3">
          <title>Mean characters/message Table 7</title>
          <p>The evaluation of the pilot task was based on three key efectiveness metrics: Depression Category Hit</p>
          <p>Our system achieved the best overall ADODL score (0.93) among all participating teams, indicating
that our estimations of depression severity were highly aligned with the actual BDI-II levels of the
simulated personas. This result confirms the eficacy of our dual-agent architecture in guiding the
conversation towards evidence-rich utterances that enable accurate scoring.</p>
          <p>Notably, one of our runs also achieved the highest ASHR (0.29), suggesting that our approach was
efective in identifying the key symptoms associated with each persona. While our average symptom hit
rate was modest overall, this metric is particularly challenging given the limited number of interaction
turns and the sparsity of symptoms in some scenarios.</p>
          <p>In terms of DCHR, our best run reached a score of 0.66, also one of the highest among all participants.
This metric evaluates the correctness of our classification into standard depression categories (e.g.,
minimal, moderate, severe), reinforcing that our system could translate fine-grained score predictions
into clinically relevant categories.</p>
          <p>Another observation is that Run 2, although still competitive, obtained the lowest DCHR (0.41) and
ASHR (0.21) of our three runs. This highlights the impact of prompt variation and model configuration
on efectiveness.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and future work</title>
      <p>This paper presents the participation of the SINAI-UJA team in Task 2 and the Pilot Task of the
eRisk@CLEF 2025 edition. Both tasks are newly introduced this year and share a common focus on
exploring conversational settings for early depression detection.</p>
      <p>Task 2 addresses the contextualized early detection of depression within multi-participant natural
conversations. Unlike previous editions that relied on isolated user posts, this task requires the sequential
analysis of complete dialogues, emphasizing the importance of conversational context and the interaction
between participants over time. The Pilot Task explores a novel scenario where participants must
interact with simulated personas powered by large language models (LLMs) and estimate their depression
severity based on limited conversational exchanges. These personas reflect diferent levels of depression
according to the BDI-II questionnaire, adding complexity to the task.</p>
      <p>For Task 2, we developed a single approach based on a transformer model trained on formatted
and augmented data. Although our system showed promising results during development, the final
performance was not entirely satisfactory, possibly due to overfitting caused by the structure of the
training dataset. In the Pilot Task, we focused on crafting a conversational strategy that maximized
information gain within a restricted number of interactions. Among the three fully automated runs we
submitted, the best-performing model adopted an empathic communication style without self-disclosure,
leading to the most accurate results in terms of depression level estimation.</p>
      <p>As future work, we plan to conduct a thorough error analysis for both tasks to identify the main
limitations of our approaches and explore improved preprocessing techniques and modeling strategies.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was partly supported by the grants FedDAP (PID2020-116118GA-I00), MODERATES
(TED2021-130145B-I00), SocialTOX (PDC2022-133146-C21) and CONSENSO (PID2021-122263OB- C21)
funded by MCIN/AEI/10.13039/501100011033, “ERDF A way of making Europe” and “European Union
NextGenerationEU/PRTR”. This work was also funded by the Ministerio para la Transformación Digital
y de la Función Pública and Plan de Recuperación, Transformación y Resiliencia - Funded by EU –
NextGenerationEU within the framework of the project Desarrollo Modelos ALIA.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling
check. After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Conversational Agent Prompts</title>
      <p>We have attached the prompts used to infer the Llama model to chat with the user. In the first place of
each grey box is the system’s content, and in the second is the user’s content. The prompts are show in
Figures 7, 8 and 9.</p>
    </sec>
    <sec id="sec-8">
      <title>B. Evaluation Agent Prompts</title>
      <p>We attached the prompts used to infer the Llama model to chat with the user. In the first place of each
gray box is the system’s content, and in the second is the user content. This prompt is the same in all
runs. Prompt is show in Figure 10.</p>
      <p>Prompt used in run 0 for conversational agent (LLM 1)
You are an LLM expert on the BDI questionnaire and the symptoms of depression. You are always asking
about depression symptoms. You must only answer with the following format:
```
"message": "your message",
"experience": "your experience",
"question": "your question"
```
In this task, your objective is to determine whether a user exhibits signs of depression based on the
conversation. You are not expected to ask the user directly whether they are experiencing depression.
The user has been explicitly designed to avoid answering such direct questions and may respond as if
they are uncomfortable when confronted with inquiries about their mental health.</p>
      <p>Instead, you should infer the possible condition (or if it is a control user) by analyzing the persona’s
language, tone, and expressed thoughts during the conversation.</p>
      <p>
        For each message you send to the user, you should:
1. Respond coherently and empathetically to the message received.
2. Mention a short personal experience to foster a deeper connection and encourage the user to open up.
You must talk about something related to the symptom: ’Sadness’
3. Ask the user a question about the symptom: ’Sadness’ To be able to infer their status (
        <xref ref-type="bibr" rid="ref1 ref2 ref3">0-3</xref>
        ).
Lets start! You will initiate the conversation. The user name is {USER_NAME}
You must only answer with the following format:
```
"reasoning": "your reasoning",
"message": "your message",
"experience": "your experience",
"question": "your question"
```
      </p>
      <p>Prompt used in run 1 for conversational agent (LLM 1)
You are an LLM expert on the BDI questionnaire and the symptoms of depression. You are always asking
about depression symptoms. You must only answer with the following format:
```
"message": "your message",
"question": "your question"
```
In this task, your objective is to determine whether a user exhibits signs of depression based on the
conversation.</p>
      <p>You are not expected to ask the user directly whether they are experiencing depression. The user has been
explicitly designed to avoid answering such direct questions and may respond as if they are uncomfortable
when confronted with inquiries about their mental health.</p>
      <p>
        Instead, you should infer the possible condition (or if it is a control user) by analyzing the persona’s
language, tone, and expressed thoughts during the conversation. For each message you send to the user,
you should:
1. Respond coherently and empathetically to the message received.
2. Ask the user a question about the symptom: ’Sadness’ To be able to infer their status (
        <xref ref-type="bibr" rid="ref1 ref2 ref3">0-3</xref>
        ).
Lets start! You will initiate the conversation. The user name is {USER_NAME}
You must only answer with the following format:
```
"reasoning": "your reasoning",
"message": "your message",
"question": "your question"
```
      </p>
      <p>Prompt used in run 2 for conversational agent (LLM 1)
You are an LLM expert on the BDI questionnaire and the symptoms of depression. You are always asking
about depression symptoms. You must only answer with the following format:
```
"question": "your question"
```
In this task, your objective is to determine whether a user exhibits signs of depression based on the
conversation.</p>
      <p>You are not expected to ask the user directly whether they are experiencing depression. The user has been
explicitly designed to avoid answering such direct questions and may respond as if they are uncomfortable
when confronted with inquiries about their mental health.</p>
      <p>
        Instead, you should infer the possible condition (or if it is a control user) by analyzing the persona’s
language, tone, and expressed thoughts during the conversation. For each message you send to the user,
you should:
1. Don’t give your opinion on anything.
2. Ask the user a direct question about the symptom: ’Sadness’ To be able to infer their status (
        <xref ref-type="bibr" rid="ref1 ref2 ref3">0-3</xref>
        ).
Let’s start! You will initiate the conversation. The user name is {USER_NAME}
You must only answer with the following format:
```
"reasoning": "your reasoning",
"question": "your question"
```
Figure 9: The prompt used for the conversational agent in run 2 (only questions). In this prompt, we replace
{USER_NAME} with the name of the persona simulator.
      </p>
      <p>
        Prompt used in all runs for evaluation agent (LLM 2)
You will receive a conversation between a user called {USER_NAME} and another user. Your goal is
to analyze the conversation and assign severity scores (
        <xref ref-type="bibr" rid="ref1 ref2 ref3">0-3</xref>
        ) to the 21 depression symptoms, focusing
*exclusively* on the {USER_NAME}’s messages within the provided {CHAT}.
1. Reasoning: Carefully analyze the {USER_NAME}’s. Identify any language, expressions, or themes that
relate to the 21 depression symptoms.
2. Symptom Scoring: Assign a score (
        <xref ref-type="bibr" rid="ref1 ref2 ref3">0-3</xref>
        ) to each of the following 21 depression symptoms based on your
analysis of the {USER_NAME}’s messages. If a symptom cannot be reliably scored based on the provided
messages, provide a score of 0.
3. Further Information (Optional): After scoring all 21 symptoms, determine if you require additional
information to improve the accuracy of your assessment. If you need clarification on a specific symptom,
provide the symptom name. If you have suficient information, state "None".
      </p>
      <p>Think step by step and format always the final response as follows:
```
"reasoning": "your step-by-step analysis of the user’s messages",
"symptoms detected":
Symptom1: Score,
Symptom2: Score,
Symptom3: Score,
"reason for selecting the next symptom": "your reasoning for needing more information, or ’None’",
"next symptom": "the specific symptom requiring clarification, or ’None’"
```
Figure 10: The prompt used for the synonym generation attack stage. In this prompt, we replace {USER_NAME}
with the name of the persona simulator and {CHAT} is substituted with the string representation of the chat
between users. For the initial round of messages, any content that could lead the model to generate ’None’ as
the next symptom is removed from the prompt.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , erisk
          <year>2025</year>
          :
          <article-title>Contextual and conversational approaches for depression challenges</article-title>
          ,
          <source>in: Advances in Information Retrieval: 47th European Conference on Information Retrieval</source>
          ,
          <string-name>
            <surname>ECIR</surname>
          </string-name>
          <year>2025</year>
          , Lucca, Italy, April 6-
          <issue>10</issue>
          ,
          <year>2025</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>V</given-names>
          </string-name>
          , SpringerVerlag, Berlin, Heidelberg,
          <year>2025</year>
          , p.
          <fpage>416</fpage>
          -
          <lpage>424</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -88720-8_
          <fpage>62</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -88720-8_
          <fpage>62</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet (extended overview)</article-title>
          ,
          <source>in: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2025</year>
          ), Madrid, Spain,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2025</year>
          , volume To be published of CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2025:
          <article-title>Early risk prediction on the internet, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction - 16th
          <source>International Conference of the CLEF Association, CLEF</source>
          <year>2025</year>
          , Madrid, Spain, September 9-
          <issue>12</issue>
          ,
          <year>2025</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume To be
          <source>published of Lecture Notes in Computer Science</source>
          , Springer,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2023:
          <article-title>Early risk prediction on the internet</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2023</year>
          , pp.
          <fpage>294</fpage>
          -
          <lpage>315</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martín-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , Overview of erisk 2024:
          <article-title>Early risk prediction on the internet</article-title>
          , in: L.
          <string-name>
            <surname>Goeuriot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mulhem</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Quénot</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Schwab</surname>
            ,
            <given-names>G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Di Nunzio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Soulier</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Galuščáková</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>92</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <article-title>Early detection of mental health disorders by social media monitoring</article-title>
          ,
          <source>Studies in Computational Intelligence</source>
          <volume>1018</volume>
          (
          <year>2022</year>
          )
          <article-title>4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Akiba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yanase</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koyama</surname>
          </string-name>
          ,
          <article-title>Optuna: A next-generation hyperparameter optimization framework</article-title>
          ,
          <source>in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized BERT pretraining approach</article-title>
          , CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1907</year>
          .11692. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Ansari,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tiwari</surname>
          </string-name>
          , E. Cambria,
          <article-title>MentalBERT: Publicly Available Pretrained Language Models for Mental Healthcare</article-title>
          ,
          <source>in: Proceedings of LREC</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ananiadou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cambria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          ,
          <article-title>Domain-specific continued pretraining of language models for capturing long context in mental health</article-title>
          ,
          <year>2023</year>
          . URL: https: //arxiv.org/abs/2304.10447. arXiv:
          <volume>2304</volume>
          .
          <fpage>10447</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Transformers:
          <article-title>State-of-the-art natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-demos.6. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-demos.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I. Stoica,
          <article-title>Eficient memory management for large language model serving with pagedattention</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2309.06180. arXiv:
          <volume>2309</volume>
          .
          <fpage>06180</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>A. M. Marmol-Romero</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>García-Vega</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Á</surname>
          </string-name>
          .
          <string-name>
            <surname>García-Cumbreras</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Montejo-Ráez</surname>
          </string-name>
          ,
          <article-title>An empathic gpt-based chatbot to talk about mental disorders with spanish teenagers</article-title>
          ,
          <source>International Journal of Human-Computer Interaction</source>
          <volume>41</volume>
          (
          <year>2025</year>
          )
          <fpage>3957</fpage>
          -
          <lpage>3973</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2302.13971. arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          , et al.,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>