TUA1 at eRisk 2022: Exploring Affective Memories
for Early Detection of Depression
Xin Kang1 , Rongyu Dou1 and Haitao Yu2
1
Tokushima University, 2-1, Minamijyousanjima, Tokushima, 770-8506, Japan
2
University of Tsukuba, 1-2, Kasuga, Tsukuba, Ibaraki, 305-8550, Japan
Abstract
This paper describes the participation of the Tokushima University A1 (TUA1) group in the Early De-
tection of Depression task at the CLEF eRisk 2022 Lab. We propose a Time-Aware Affective Memories
(TAM) network for early detection of the user depression risk, based on the stream of user postings on
the Internet. The TAM network regularly maintains an enriched memory of the affective state for each
user with a Time-Aware LSTM (T-LSTM) model. The embedding of the affective memory and that of
the new post are then integrated through a Transformer Decoder for predicting the user’s depression
risk. To encourage early detections of the depression risk, we propose a latency penalty to the risk pre-
dictions during training. The model raises a risk decision based on the binary classification result and
estimates a risk-ranking score based on the difference between the positive and negative probabilities.
Our experimental results show that the proposed affective memory is effective in Early Detection of
Depression and achieve two state-of-the-art results in the ranking-based Early Detection of Depression
evaluation.
Keywords
Time-aware affective memory, affective state, latency penalty, Early Detection of Depression
1. Introduction
Early risk prediction on the Internet (eRisk) has been a long-running Lab at CLEF [1, 2, 3, 4, 5, 6],
which aims at exploring the early detection technologies to predict potential risks in the Internet
users’ health and safety. In this year, the Early Detection of Depression task at the CLEF eRisk
2022 Lab [6] focuses on predicting the depression risk in users based on their social media
postings. A user is depression-positive if an explicit mention of being diagnosed with depression
was made by the user [1, 2]. By observing the posts of a user from the very beginning, a detection
system needs to raise the risk decision as early as possible if the user is depression-positive and
to estimate a risk-ranking score indicating the level of depression.
The early studies of language usage in depression patients [7, 8, 9, 10, 11] suggest that
depression and language usage are internally correlated, while the recent psychological studies
of depression [12, 13] indicate that depression is indeed a complex emotional state and highly
associates with several negative emotions [14, 15], such as sad and anxiety. These findings
CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
" kang-xin@is.tokushima-u.ac.jp (X. Kang); c502047004@tokushima-u.ac.jp (R. Dou);
yuhaitao@slis.tsukuba.ac.jp (H. Yu)
0000-0001-6024-3598 (X. Kang)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
have inspired recent studies to explore linguistic features [16, 17, 18, 19, 20, 21], emotions,
and sentiments [22, 23, 24] in user posts for detecting depression and several related mental
disorders, such as suicide ideation [25].
We extend these studies by exploring the history of user affective states, based on the
connection between depression and the long-term negative affects reflected in one’s posts,
that is, the difficulty of removing negative feelings from one’s working memory [15, 26]. We
consider affective state as the embedding of user emotion in a post, which is retrieved by a
pre-trained DistilBERT-Emotion model. A Time-Aware Affective Memories (TAM) network is
proposed to maintain the memory of an Internet user’s affective state, which gets update with
the user’s latest affective state and the time interval Δ𝜏𝑡 between the user’s latest (𝜏𝑡 ) and last
(𝜏𝑡−1 ) postings. This affective memory is fed together with the semantic information of the
latest post to a Transformer Decoder, and TAM uses the decoded information to predict a user’s
depression risk.
To encourage early detection of the depression risk, we propose a latency penalty that
penalizes the latency of the first-positive predictions for the depression-positive users. Our
initial experiment suggests that latency penalty is effective for reducing the Early Risk Detection
Error (ERDE) score for the Early Detection of Depression task.
The rest of this paper is arranged as follows. Section 2 briefly reviews the recent depression
detection studies. Section 3 depicts the TAM network and the latency penalty for the Early
Detection of Depression task. Section 4 details our submissions to the task and presents the
results. Section 5 concludes our work.
2. Related Work
Early Detection of Depression task was firstly proposed by Losada and Crestani [27], in which
researchers built a test collection on depression and language and proposed ERDE for system-
atically evaluating early detection algorithms in accuracy and latency of depression-positive
predictions. The task received 30 and 45 contributions in CLEF 2017 [1] and CLEF 2018 [2],
respectively.
In Early Detection of Depression at CLEF 2017, Trotzek et al. [28] explored various linguistic
and meta features, such as personal and possessive pronouns, past tense verbs, word I, and
text readability measures. These features were combined with the n-gram feature for training
a logistic regression classifier. This work won the best precision, F1, and ERDE5 scores by
averaging the logistic probabilities obtained from the above classifier and from a paragraph-
embedding based logistic regression classifier. In Early Detection of Depression at CLEF 2018,
Trotzek et al. [29] achieved the best ERDE50 and F1 scores, by setting a threshold to the logistic
regression probabilities. Funez et al. [30] employed a Flexible Temporal Variation of Terms
(FTVT) approach, which utilized a sequential information about the variation of terms among
different post chunks. This approach obtained the best ERDE5 in the same task. Paul et al. [31]
employed an Ada Boost classifier with the Bag of Word (BoW) features and got the best precision
score.
DistilBERT
AAAB8HicbVDLSgNBEOz1mcRX1IsQhMUgxEvYFdEcA148RjAPSdYwO5lNhszMLjOzwrLkJ/TiQRGv/oD/4c2fESePgyYWNBRV3XR3+RGjSjvOl7W0vLK6tp7J5jY2t7Z38rt7DRXGEpM6DlkoWz5ShFFB6ppqRlqRJIj7jDT94eXYb94TqWgobnQSEY+jvqABxUgb6Ta5S0v0ZNTV3XzRKTsT2IvEnZFi9aBw+PDxna1185+dXohjToTGDCnVdp1IeymSmmJGRrlOrEiE8BD1SdtQgThRXjo5eGQfG6VnB6E0JbQ9UX9PpIgrlXDfdHKkB2reG4v/ee1YBxUvpSKKNRF4uiiIma1De/y93aOSYM0SQxCW1Nxq4wGSCGuTUc6E4M6/vEgap2X3vHx2bdKowBQZKMARlMCFC6jCFdSgDhg4PMIzvFjSerJerbdp65I1m9mHP7DefwBXrJML
(i)
yt
AAAB8HicbVDLSgNBEJyNryS+ol6EICwGIV7CrojmGPDiMYJ5SLKG2clsMmRmdpnpFcOSn9CLB0W8+gP+hzd/Rpw8DppY0FBUddPd5UecaXCcLyu1tLyyupbOZNc3Nre2czu7dR3GitAaCXmomj7WlDNJa8CA02akKBY+pw1/cDH2G3dUaRbKaxhG1BO4J1nACAYj3dzfJkV2POpAJ1dwSs4E9iJxZ6RQ2c8fPHx8Z6qd3Ge7G5JYUAmEY61brhOBl2AFjHA6yrZjTSNMBrhHW4ZKLKj2ksnBI/vIKF07CJUpCfZE/T2RYKH1UPimU2Do63lvLP7ntWIIyl7CZBQDlWS6KIi5DaE9/t7uMkUJ8KEhmChmbrVJHytMwGSUNSG48y8vkvpJyT0rnV6ZNMpoijTKo0NURC46RxV0iaqohggS6BE9oxdLWU/Wq/U2bU1Zs5k99AfW+w9WIJMK
(i) DistilBERT Transformer Risk
xt T-LSTM
Emotion Decoder Classification AAAB8HicbVDLSgNBEJz1mcRX1IsQhMEgxEvYFdEcA148RjAPSdYwO5kkQ2Zml5leISz5Cb14UMSrP+B/ePNnxMnjoIkFDUVVN91dQSS4Adf9cpaWV1bX1lPpzMbm1vZOdnevZsJYU1aloQh1IyCGCa5YFTgI1og0IzIQrB4MLsd+/Z5pw0N1A8OI+ZL0FO9ySsBKt+YuKfCTURva2bxbdCfAi8SbkXz5IHf48PGdrrSzn61OSGPJFFBBjGl6bgR+QjRwKtgo04oNiwgdkB5rWqqIZMZPJgeP8LFVOrgbalsK8ET9PZEQacxQBrZTEuibeW8s/uc1Y+iW/ISrKAam6HRRNxYYQjz+Hne4ZhTE0BJCNbe3YtonmlCwGWVsCN78y4ukdlr0zotn1zaNEpoihXLoCBWQhy5QGV2hCqoiiiR6RM/oxdHOk/PqvE1bl5zZzD76A+f9B05kkwU=
(i)
st
Mt
AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEtMeCFy9CRfsBbSib7aZdutmE3YlQQn+CFw+KePUXefPfuG1z0NYHA4/3ZpiZFyRSGHTdb6ewtr6xuVXcLu3s7u0flA+PWiZONeNNFstYdwJquBSKN1Gg5J1EcxoFkreD8c3Mbz9xbUSsHnGScD+iQyVCwSha6eGuj/1yxa26c5BV4uWkAjka/fJXbxCzNOIKmaTGdD03QT+jGgWTfFrqpYYnlI3pkHctVTTixs/mp07JmVUGJIy1LYVkrv6eyGhkzCQKbGdEcWSWvZn4n9dNMaz5mVBJilyxxaIwlQRjMvubDITmDOXEEsq0sLcSNqKaMrTplGwI3vLLq6R1UfWuqpf3l5V6LY+jCCdwCufgwTXU4RYa0AQGQ3iGV3hzpPPivDsfi9aCk88cwx84nz8vpo23
g⇤
AAAB/HicbVDLSgNBEJz1mcRXNBchCINBiJewK6I5BvTgMYJ5QDYus5NJMmR2dpnpFcISP8If8OJBEa/6H978GXHyOGhiQUNR1U13lx8JrsG2v6yl5ZXVtfVUOrOxubW9k93dq+swVpTVaChC1fSJZoJLVgMOgjUjxUjgC9bwBxdjv3HHlOahvIFhxNoB6Une5ZSAkbxszr1kAgh2gcS3SZEfjzzwsgW7ZE+AF4kzI4XKfv7g4eM7XfWyn24npHHAJFBBtG45dgTthCjgVLBRxo01iwgdkB5rGSpJwHQ7mRw/wkdG6eBuqExJwBP190RCAq2HgW86AwJ9Pe+Nxf+8VgzdcjvhMoqBSTpd1I0FhhCPk8AdrhgFMTSEUMXNrZj2iSIUTF4ZE4Iz//IiqZ+UnLPS6bVJo4ymSKE8OkRF5KBzVEFXqIpqiKIhekTP6MW6t56sV+tt2rpkzWZy6A+s9x/dHpcz
(i)
…
AAAB6nicbVC7SgNBFL0bXzG+oiltBkNALMKuBE0ZsLGMaB6QLGF2MrsZMju7zMwKYUlna2OhiK3f4gfY6Qf4BX6Ak0ehiQcuHM65l3vv8WLOlLbtDyuzsrq2vpHdzG1t7+zu5fcPmipKJKENEvFItj2sKGeCNjTTnLZjSXHocdryhhcTv3VLpWKRuNGjmLohDgTzGcHaSNdB76SXL9plewq0TJw5KdYKpbvvt6/Pei//3u1HJAmp0IRjpTqOHWs3xVIzwuk4100UjTEZ4oB2DBU4pMpNp6eOUckofeRH0pTQaKr+nkhxqNQo9ExniPVALXoT8T+vk2i/6qZMxImmgswW+QlHOkKTv1GfSUo0HxmCiWTmVkQGWGKiTTo5E4Kz+PIyaZ6WnbNy5cqkUYUZsnAIR3AMDpxDDS6hDg0gEMA9PMKTxa0H69l6mbVmrPlMAf7Aev0B4ZyR+Q==
⌧t
Mt lMEM +1
AAACAXicbVDLSsNAFJ3UV62vqBvBzWARBLEkUrTLgghuChXsA9oQJtNJO3RmEmYmQglx46+4caGIW//CnX/jpO1CqwcuHM65l3vvCWJGlXacL6uwtLyyulZcL21sbm3v2Lt7bRUlEpMWjlgkuwFShFFBWppqRrqxJIgHjHSC8VXud+6JVDQSd3oSE4+joaAhxUgbybcPGn6qz5if9jnSI8nTxnUjy07dzLfLTsWZAv4l7pyUwRxN3/7sDyKccCI0ZkipnuvE2kuR1BQzkpX6iSIxwmM0JD1DBeJEeen0gwweG2UAw0iaEhpO1Z8TKeJKTXhgOvM71aKXi/95vUSHNS+lIk40EXi2KEwY1BHM44ADKgnWbGIIwpKaWyEeIYmwNqGVTAju4st/Sfu84l5UqrfVcr02j6MIDsEROAEuuAR1cAOaoAUweABP4AW8Wo/Ws/Vmvc9aC9Z8Zh/8gvXxDTu7lro=
Figure 1: Overview of the Time-Aware Affective Memories (TAM) network for Early Detection of De-
pression. Modules for affective processing and semantic processing are indicated in the light green and
light blue squares, respectively. An affective state is stored in the T-LSTM network for each user.
3. TAM Network for Early Detection of Depression
3.1. Time-Aware Affective Memories Network
To explore the history of user affective states for Early Detection of Depression, we propose a
Time-Aware Affective Memories (TAM) network as shown in Fig. 1. TAM is composed of an
affective processing module and a semantic processing module, which are indicated in the light
green and the light blue squares, respectively.
(𝑖)
First, the affective processing module expects the latest post 𝑥𝑡 from user 𝑖 at step 𝑡 and
(𝑖)
the time interval Δ𝜏𝑡 between the user’s latest and last postings as input. We concatenate
(𝑖)
the title and body of a post into 𝑥𝑡 , with the user-sensitive and task-insensitive information
replaced with special tokens1 . The time interval Δ𝜏𝑡 is given by
(𝑖)
(𝑖) (𝑖) (𝑖)
Δ𝜏𝑡 = 𝜏𝑡 − 𝜏𝑡−1 , (1)
(𝑖) (𝑖)
where 𝜏𝑡 and 𝜏𝑡−1 are the time logs of the user’s latest and last postings.
(𝑖) (𝑖)
Second, the user’s emotion in post 𝑥𝑡 is mapped into an affective state 𝐴𝑡 based on a
pre-trained DistilBERT Emotion classification model DistilBERT𝐸 2 . The mapping is given by
(︁ )︁
(𝑖) (𝑖)
𝐴𝑡 = 𝜙 DistilBERT𝐸 (𝑥𝑡 ) , (2)
(𝑖)
where the affective state 𝐴𝑡 ∈ R𝑑BERT corresponds to a 𝜙-pooled activation of the pre-
(𝑖)
classification layer in DistilBERT𝐸 with input 𝑥𝑡 , 𝜙(·) corresponds to either a mean-pooling
or a CLS-pooling among the first dimension of a tensor, and 𝑑BERT is the DistilBERT model
dimension. DistilBERT𝐸 is pre-trained on an English Twitter Emotion dataset [32], which
1
User-sensitive Email addresses and phone numbers are replaced with 〈EMAIL〉 and 〈PHONE〉, and task-
insensitive numbers and currency symbols are replaced with 〈NUMBER〉 and 〈CUR〉, respectively with clean-text.
2
https://huggingface.co/bhadresh-savani/distilbert-base-uncased-emotion
classifies user postings into joy, love, surprise , sadness, anger, and fear. The pre-trained Distil-
BERT is slightly inferior to that of BERT in emotion classification but is over two times faster in
processing speed.
Third, the affective states 𝐴(𝑖) of user 𝑖 is remembered by a Time-Aware LSTM (T-LSTM)
(𝑖) (𝑖)
network [33]. T-LSTM takes the affective state 𝐴𝑡 ∈ R𝑑BERT for the current post 𝑥𝑡 as the
first input and discounts its internal affective memory in 𝐶 ∈ R𝑑MEM with the time interval
(𝑖)
Δ𝜏𝑡 ∈ R>0 as the second input. In the following description we omit the user index 𝑖 for
abbreviation. Given the internal memory 𝐶𝑡−1 ∈ R𝑑MEM and the hidden state ℎ𝑡−1 ∈ R𝑑MEM
(𝑖) (𝑖)
at the last step 𝑡 − 1 as well as inputs 𝐴𝑡 and Δ𝜏𝑡 at the latest step 𝑡, T-LSTM updates its
internal memory and hidden state by
𝑆
𝐶𝑡−1 = tanh(𝑊𝑑 𝐶𝑡−1 + 𝑏𝑑 ) (Short-term memory)
^ 𝑆 = 𝐶 𝑆 𝑔* (Δ𝜏𝑡 )
𝐶 (Discounted short-term memory)
𝑡−1 𝑡−1
𝐿
𝐶𝑡−1 𝑆
= 𝐶𝑡−1 − 𝐶𝑡−1 (Long-term memory)
𝐴
𝐶𝑡−1 𝐿
= 𝐶𝑡−1 +𝐶^𝑆 (Adjusted previous memory)
𝑡−1
𝑓𝑡 = 𝜎(𝑊𝑓 𝐴𝑡 + 𝑈𝑓 ℎ𝑡−1 + 𝑏𝑓 ) (Forget gate)
𝑖𝑡 = 𝜎(𝑊𝑖 𝐴𝑡 + 𝑈𝑖 ℎ𝑡−1 + 𝑏𝑖 ) (Input gate)
𝑜𝑡 = 𝜎(𝑊𝑜 𝐴𝑡 + 𝑈𝑜 ℎ𝑡−1 + 𝑏𝑜 ) (Output gate)
˜ 𝑡 = tanh(𝑊𝑐 𝐴𝑡 + 𝑈𝑐 ℎ𝑡−1 + 𝑏𝑐 )
𝐶 (Candidate current memory)
*
𝐶𝑡 = 𝑓𝑡 𝐶𝑡−1 ˜𝑡
+ 𝑖𝑡 𝐶 (Current memory)
ℎ𝑡 = 𝑜𝑡 tanh(𝐶𝑡 ), (Current hidden state)
where 𝑊𝑑 ∈ R𝑑MEM ×𝑑MEM and 𝑏𝑑 ∈ R𝑑MEM are parameters for decomposing the memory.
𝑊𝑓 , 𝑊𝑖 , 𝑊𝑜 , 𝑊𝑐 ∈ R𝑑BERT ×𝑑MEM , 𝑈* ∈ R𝑑MEM ×𝑑MEM , and 𝑏𝑓 , 𝑏𝑖 , 𝑏𝑜 , 𝑏𝑐 ∈ R𝑑MEM are parame-
ters for calculating the forget, input, output gates and the candidate current memory, respectively.
𝑔* is a set of discount functions that monotonically decrease with the time interval Δ𝜏𝑡 . We
employ two discount functions 𝑔slog and 𝑔flex for the detection of depression task, in which
𝑔slog is reciprocal to the logarithm of interval seconds
𝑔slog (Δ𝜏 ) = 1/ log(Δ𝜏 + 𝜖), (3)
with a hyper-parameter 𝜖 of 1.0, and the 𝑔flex is a flexible power function of the interval seconds
inspired by [34]
𝑞1 𝑞2
𝑔flex (Δ𝜏 ) = + , (4)
𝑎Δ𝜏 1 + (Δ𝜏 /𝑏)𝑐
with trainable parameters 𝑞1 , 𝑞2 , 𝑎, 𝑏, 𝑐 ∈ R.
(𝑖) 𝑑
Last, a linear layer is employed to map the T-LSTM hidden state ℎ𝑡 ∈ RR MEM to an affective
(𝑖)
memory 𝑀𝑡 ∈ R𝑑BERT by
(𝑖) (𝑖)
𝑀𝑡 = 𝑊𝑀 ℎ𝑡 + 𝑏𝑀 , (5)
with parameters 𝑊𝑀 ∈ R𝑑MEM ×𝑑BERT and 𝑏𝑀 ∈ R𝑑BERT . To enrich the memorization of a
user’s affective states for TAM, we concatenate the most recent 𝑙MEM affective memories by
^ (𝑖)
𝑀
(𝑖) (𝑖)
𝑡 = Concat(𝑀𝑡−𝑙MEM +1 , . . . , 𝑀𝑡 ), (6)
(𝑖)
where 𝑀 ^ 𝑡 ∈ R𝑙MEM ×𝑑BERT is the enriched affective memory.
(𝑖)
The semantic processing module takes the latest post writing 𝑥𝑡 from user 𝑖 at time step
𝑡 as input, which is similar as the affective processing module, and encodes it to a semantic
embedding 𝑆𝑡 with a pre-trained DistilBERT model3 by
(𝑖)
(𝑖) (𝑖)
𝑆𝑡 = DistilBERT(𝑥𝑡 ), (7)
(𝑖)
where 𝑆𝑡 ∈ R𝑙TEXT ×𝑑BERT corresponds to the activation of the pre-classification layer in Dis-
(𝑖)
tilBERT, 𝑙TEXT corresponds to the length of 𝑥𝑡 , and 𝑑BERT is the DistilBERT model dimension.
To integrate the enriched affective memory 𝑀 ^ (𝑖) and the semantic embedding 𝑆 (𝑖) in TAM,
𝑡 𝑡
we employ a Transformer Decoder network as shown in 1. We denote 𝑀 ^ (𝑖) (𝑖)
𝑡 and 𝑆𝑡 as 𝐴
and 𝐵 for illustrating the integration mechanism as below. A Transformer Decoder is a multi-
head cross-attention architecture, each head of which makes queries for elements from an
input sequence 𝐴 and retrieves new values from a reference input sequence 𝐵, based on the
element-wise similarity between 𝐴 and 𝐵. Specifically, the cross-attention MultiHead(𝐴, 𝐵)
is concatenated by head1 , . . . head𝐻 with
MultiHead(𝐴, 𝐵) = Concat(head1 , . . . head𝐻 )𝑊 𝑂 , (8)
headℎ = Attention(𝐴𝑊ℎ𝑄 , 𝐵𝑊ℎ𝐾 , 𝐵𝑊ℎ𝑉 ), (9)
𝑄𝐾 ⊺
Attention(𝑄, 𝐾, 𝑉 ) = softmax( √ )𝑉, (10)
𝑑2
where 𝐴 ∈ R𝑛×𝑑1 , 𝐵 ∈ R𝑚×𝑑1 are sequences of 𝑛 and 𝑚 embeddings and 𝑑1 is the embedding
dimension. To empower the attention mechanism, 𝐴 and 𝐵 are first mapped from the 𝑑1 -
dimensional space to query 𝑄ℎ ∈ R𝑛×𝑑2 , key 𝐾ℎ ∈ R𝑚×𝑑2 , and value 𝑉ℎ ∈ R𝑚×𝑑2 in a larger
𝑑2 -dimensional space through linear projection with parameters 𝑊ℎ𝑄 , 𝑊ℎ𝐾 , 𝑊ℎ𝑉 ∈ R𝑑1 ×𝑑2 , and
ℎ is the index of attention heads. Each headℎ ∈ R𝑚×𝑑2 is then calculated by the Attention
function with the corresponding query, key, and value as the input. Last, the concatenated
attention head is mapped from 𝑚 × 𝑑2 back to 𝑑1 dimension with a projection parameter
𝑊 𝑂 ∈ R𝐻𝑑2 ×𝑑1 .
We propose to integrate the affective memory and semantic embedding with either one
Transformer Decoder by
(𝑖) (𝑖) (𝑖)
^ 𝑡 , 𝑆 )),
𝐻𝑡 = mean(MultiHead(𝑀 (11)
𝑡
or two Transformer Decoders by
^ (𝑖) ), 𝑆 (𝑖) +
(︁ (︁ )︁)︁
(𝑖)
𝐻𝑡 = mean MultiHead mean(𝑀 𝑡 𝑡
(12)
^ (𝑖)
(︁ (︁ )︁)︁
(𝑖)
mean MultiHead 𝜙(𝑆 ), 𝑀 𝑡𝑡 ,
3
https://huggingface.co/distilbert-base-uncased
where mean(·) indicates a mean-pooling in the first dimension of a tensor while 𝜙(·) corresponds
to either a mean-pooling or a CLS-pooling. Both decoding strategies render an integration
(𝑖)
𝐻𝑡 ∈ R𝑑BERT .
(𝑖) (𝑖)
The depression probability 𝑝𝑡 and its logit 𝛾𝑡 are predicted by a Risk Classification network,
(𝑖) (𝑖)
based on 𝐻𝑡 and a 𝜙-pooled semantic embedding 𝜙(𝑆𝑡 ). Specifically, the concatenation of
(𝑖) (𝑖)
𝐻𝑡 and 𝜙(𝑆𝑡 ) is passed through a linear layer with layer normalization and ReLU activation,
a dropout layer, and a final classification layer of the Risk Classification network. The outputs
(𝑖)
are a score ^𝑠𝑡 that indicates the level of depression
(𝑖) (𝑖) (𝑖)
^𝑠𝑡 = 𝑝𝑡 − (1 − 𝑝𝑡 )
(𝑖)
(13)
= 2𝑝𝑡 − 1,
(𝑖)
and a risk decision 𝑦^𝑡
(𝑖) (𝑖)
𝑦^𝑡 = 1{𝛾𝑡 > 0}, (14)
where 1{·} is an indicator function.
Besides the stepwise risk classification, we employ a score accumulation technique [35] that
accumulates the historical risk scores for the current score by
𝑡
(𝑖) (𝑖)
∑︁
˜𝑠𝑡 = ^𝑠𝑡′ , (15)
𝑡′ =1
and predict the risk decision by
{︁ (︁ )︁ (︁ )︁}︁
(𝑖) (𝑖) (𝑖) (𝑖)
𝑦˜𝑡 = 1 ˜𝑠𝑡 > median ˜𝑠[1:𝑡] + 𝛾MAD ˜𝑠[1:𝑡] , (16)
(𝑖)
where ˜𝑠[1:𝑡] is a list of the accumulated scores for user 𝑖 up to time step 𝑡 and median(·) renders
the median value of a list. The MAD function is given by
(︁⃒ (︁ )︁⃒)︁
(𝑖) ⃒ (𝑖) (𝑖) ⃒
MAD(𝑠˜𝑡 ) = median ⃒˜𝑠[1:𝑡] − median ˜𝑠[1:𝑡] ⃒ , (17)
(𝑖)
which evaluates the Median Absolute Deviation of the accumulated scores ˜𝑠[1:𝑡] .
3.2. Latency Penalty
We propose a latency penalty 𝜓 that penalizes TAM for the latency of the first-positive predic-
tions, in terms of the depression-positive users. The latency penalty for user 𝑖 at time step 𝑡 is
given by
(︁ )︁
(𝑖) (𝑖)
𝜓 𝑦 (𝑖) , 𝛾𝑡 , 𝛾max(𝑡) , 𝑡; 𝛼, 𝑜 =
(︁ (︁ (︁ )︁ (︁ )︁)︁ )︁
(𝑖) (𝑖) (𝑖)
𝜎(𝛾𝑡 ) · 𝑦 (𝑖) · 𝑙𝑐 𝑡 · tanh 𝛼 · ReLU 𝛾𝑡 · ReLU −𝛾max(𝑡) ; 𝑜 , (18)
(𝑖) (𝑖)
where 𝑦 (𝑖) ∈ {0, 1} is the ground truth label, 𝛾𝑡 ∈ R is the current predicted logit, 𝛾max(𝑡) =
(𝑖)
𝑡′ =1 𝛾𝑡′ is the maximum logit up to 𝑡 − 1, and 𝑡 ∈ Z indicates the current time step. 𝛼 and
max𝑡−1
𝑜 are two hyper-parameters, respectively, which control the latency sensitivity and the time
step at which the latency cost grows quickly as described below. 𝜎 is the sigmoid function. The
latency cost function 𝑙𝑐 is first proposed in the ERDE metric [27], which is given by
1
𝑙𝑐(𝑡; 𝑜) = 1 − , (19)
1 + exp𝑡−𝑜
with input 𝑡 denoting the latency step of a true-positive prediction. The latency cost 𝑙𝑐 ∈ (0, 1)
monotonically grows with the latency step 𝑡 and grows the most quickly at the step 𝑜 with a
latency cost of 0.5. In practice, 𝑜 is usually set to 5 and 50, the latter of which is employed for
training the proposed TAM network.
In Eq. 18, we obtain the latency of the first-positive prediction for user 𝑖 through a series of
(𝑖) (𝑖)
neural activation functions of the sequence of logit predictions 𝛾[1:𝑡] . Specifically, ReLU(𝛾𝑡 )
(𝑖)
renders a positive value 𝛾𝑡 if the logit with respect to the latest (𝑡) posting is positive, and
(𝑖) (𝑖)
renders 0 otherwise. Similarly, ReLU(−𝛾max(𝑡) ) renders a positive value −𝛾max(𝑡) if all logits
up to the last (𝑡 − 1) posting are negative, and renders 0 otherwise. We scale their product with
the latency sensitivity 𝛼 = 10,000 and feed the result to tanh(·). The output turns to be an
indicator that takes a value close to 1 if the model renders a positive prediction for the latest
posting for user 𝑖 and all-negative predictions before that, while takes the value of 0 otherwise.
By multiplying the latest time step 𝑡 with the indicator, we obtain the step of first-positive
prediction, that is the latency, and feed it to the latency cost function in Eq. 19. The latency
(𝑖)
penalty 𝜓 is finally given by the product of the depression probability 𝜎(𝛾𝑡 ), the ground-truth
label 𝑦 (𝑖) , and the latency cost 𝑙𝑐.
We add the latency penalty in Eq. 18 to a cross-entropy loss to produce the final training
target for Early Detection of Depression by
𝑇 ∑︁
∑︁ 𝑁
ℓ(𝑦, 𝛾; 𝛼, 𝑜) =
𝑡=1 𝑖=1
(︁ )︁ (︁ )︁
(𝑖) (𝑖) (𝑖) (𝑖)
− 𝑦 (𝑖)
log 𝜎(𝛾𝑡 ) + (1 − 𝑦 (𝑖) ) log(1 − 𝜎(𝛾𝑡 )) + 𝜓 𝑦 (𝑖) , 𝛾𝑡 , 𝛾max(𝑡) , 𝑡; 𝛼, 𝑜 , (20)
where 𝑁 and 𝑇 are the number of users and the number of time steps in the training data,
respectively.
4. Experiment
The training data of Early Detection of Depression at the CLEF eRisk 2022 Lab [6] consists of
the training and test data of CLEF eRisk 2017 Lab and the test data of CLEF eRisk 2018 Lab. The
details can be found in Table 1.
The test data of Early Detection of Depression at the CLEF eRisk 2022 Lab [6] consists of
1,400 users. The posts of these users are accessible in an interactive manner during the test
Table 1
Number of positive and negative users in the training data of Early Detection of Depression at the CLEF
2022 Lab.
Positive Negative
2017 135 752
2018 79 741
Total 214 1,493
Table 2
Distinctive configurations of the submitted models.
Configuration TUA1#0 TUA1#1 TUA1#2 TUA1#3 TUA1#4
Balance Strategy Balance All Balance All Balance
𝜙DistilBERT Mean CLS Mean CLS N/A
𝜙DistilBERT−Emo CLS Mean CLS Mean N/A
Max Memory Len 30 1 30 1 Full
Discount Function 𝑔slog 𝑔flex 𝑔slog 𝑔flex N/A
Decoder Num 1 2 1 2 N/A
Score Accumulation False False True True True
phase, that is, the server only replies one post per-user at step 𝑡 after receiving the depression
predictions for all users at step 𝑡 − 1. Posts at step 0 from all users are accessible at the very
beginning.
We submit five groups of risk decisions and risk scores for 2,000 steps in this interactive
manner, which takes around 16.5 hours. Among all participants in Early Detection of Depression,
our system turns to be the most efficient.
The distinctive configurations of the submitted models are shown in Table 2. Specifically,
Balance Strategy indicates the way of selecting positive and negative users from the training
data, for which Balance indicates that as many as the positive users are randomly selected from
the negative set while All indicates that all users are utilized. 𝜙DistilBERT and 𝜙DistilBERT−Emo
indicate a Mean-pooling or a CLS-pooling for 𝜙(·) in Eq. 12 and Eq. 2, respectively. Max Memory
Len corresponds to 𝑙MEM , which is the length of enriched affective memory 𝑀 ^ . Discount
Function indicates the utilization of either 𝑔slog or 𝑔flex for discounting the short-term memory
𝐶^ 𝑆 . Decoder Num specifies the number of Transformer Decoders in the TAM network for
integrating the affective memory 𝑀 ^ and the semantic embedding 𝑆. Score Accumulation
indicates predicting the depression scores and risk decisions by either accumulating the historical
risk scores or not. TUA1#0 to TUA1#3 corresponds to the TAM-based models with distinctive
configurations, while TUA1#4 is a SS3-based model [35]. Configurations which are not applicable
to the model are denoted as N/A. To avoid making reckless risk decisions, we halt the positive
predictions by producing all-zero decisions in the first two time steps for all models.
Table 3 shows the decision-based evaluation results. First, we find that Score Accumulation
in the TAM-based models obtains similar decision-based evaluation scores, which is possibly
because that the TAM network has already maintained a long-term memory of the affective
Table 3
Decision-based evaluation for the Early Detection of Depression task. Results obtained by our models
and the best performing models on each metric are included.
Model P R F1 ERDE5 ERDE50 latencyTP speed Flatency
TUA1#0 0.155 0.806 0.260 0.055 0.037 3.0 0.922 0.258
TUA1#1 0.129 0.816 0.223 0.053 0.041 3.0 0.992 0.221
TUA1#2 0.155 0.806 0.260 0.055 0.037 3.0 0.992 0.258
TUA1#3 0.129 0.816 0.223 0.053 0.041 3.0 0.992 0.221
TUA1#4 0.159 0.959 0.272 0.052 0.036 3.0 0.992 0.270
CYUT#2 0.106 0.867 0.189 0.056 0.047 1.0 1.000 0.189
LauSAn#0 0.137 0.827 0.235 0.041 0.038 1.0 1.000 0.235
LauSAn#4 0.201 0.724 0.315 0.039 0.033 1.0 1.000 0.315
BLUE#2 0.106 1.000 0.192 0.074 0.048 4.0 0.988 0.190
NLPGroup-IISERB#0 0.682 0.745 0.712 0.055 0.032 9.0 0.969 0.690
Sunday-Rocker2#0 0.091 1.000 0.167 0.080 0.053 4.0 0.988 0.165
Sunday-Rocker2#4 0.108 1.000 0.195 0.082 0.047 6.0 0.981 0.191
SCIR2#3 0.316 0.847 0.460 0.079 0.026 44.0 0.834 0.383
E8-IJS#0 0.684 0.133 0.222 0.061 0.061 1.0 1.000 0.144
states through T-LSTM as well as an enriched affective memory. Next, TUA1#0 and TUA1#2
achieve better Precision, F1, ERDE50 and Flatency scores than TUA1#1 and TUA1#3, which
indicates a long affective memory and a balanced training data could be helpful for improving
the decision predictions in TAM. Our results also suggest the importance of exploring language
usage patterns for predicting the depression decisions. Last, it is reasonalbe to speculate that
halting positive predictions for the first two time steps could be an important factor that reduces
the latency-sensitive metric scores, such as ERDE5 , ERDE50 , latencyTP , and Flatency , in our result.
Table 4 shows the ranking-based evaluation results. First, the ranking-based decisions of
TUA1#0 and TUA1#2 render the state-of-the-art results in P@10 and NDCG@10 based on
only 1 user post. The result suggests that the TAM network with a long affective memory
could effectively recognize the users’ depression risk at a very early state. It also implies
that taking the decision-halting strategy off from TAM might render better decision-based
evaluation results. Next, TUA1#1 obtains better results than TUA1#3, which indicates that Score
Accumulation might not be necessary for the ranking-based prediction in TAM. TUA1#0 and
TUA1#2 generally obtain better P@10, NDCT@10, NDCG@100 scores for 1 post, 100 posts,
500 posts, and 1,000 posts, which suggests that long affective memory and balanced data are
also helpful in improving the ranking-based predictions for TAM. Last, the TAM-based models
significantly outperform the SS3-based model in terms of the ranking-based metrics.
5. Conclusion
In this paper, we propose a Time-Aware Affective Memories (TAM) network with a latency-
penalized cross-entropy loss for Early Detection of Depression at the CLEF eRisk 2022 Lab.
Both decision- and ranking-based evaluation results indicate that affective state is an important
Table 4
Ranking-based evaluation for the Early Detection of Depression task. Results obtained by our models
and the best performing models on each metric are included.
1 post 100 posts 500 posts 1000 posts
Model
NDCG@100
NDCG@100
NDCG@100
NDCG@100
NDCG@10
NDCG@10
NDCG@10
NDCG@10
P@10
P@10
P@10
P@10
TUA1#0 0.80 0.88 0.44 0.60 0.72 0.52 0.60 0.67 0.52 0.70 0.80 0.57
TUA1#1 0.70 0.77 0.44 0.50 0.54 0.39 0.50 0.56 0.42 0.50 0.65 0.43
TUA1#2 0.80 0.88 0.44 0.60 0.72 0.52 0.60 0.67 0.52 0.70 0.80 0.57
TUA1#3 0.60 0.69 0.43 0.50 0.54 0.39 0.50 0.56 0.42 0.50 0.65 0.43
TUA1#4 0.50 0.37 0.35 0.00 0.00 0.36 0.00 0.00 0.36 0.20 0.12 0.31
CYUT#3 0.10 0.07 0.12 0.70 0.70 0.57 0.70 0.72 0.59 0.80 0.74 0.60
CYUT#4 0.10 0.06 0.12 0.60 0.68 0.55 0.60 0.69 0.59 0.80 0.84 0.61
BLUE#0 0.80 0.88 0.54 0.60 0.56 0.59 0.80 0.81 0.66 0.80 0.80 0.68
BLUE#1 0.80 0.88 0.54 0.70 0.64 0.67 0.80 0.84 0.74 0.80 0.86 0.72
BLUE#2 0.80 0.75 0.46 0.40 0.40 0.30 0.30 0.35 0.20 0.30 0.38 0.16
NLPGroup-IISERB#0 0.00 0.00 0.02 0.90 0.92 0.30 0.90 0.92 0.33 0.00 0.00 0.00
NLPGroup-IISERB#1 0.30 0.32 0.13 0.90 0.81 0.27 0.80 0.84 0.33 0.00 0.00 0.00
NLPGroup-IISERB#4 0.00 0.00 0.04 0.90 0.93 0.66 0.90 0.92 0.69 0.00 0.00 0.00
UNED-MED#3 0.80 0.82 0.29 0.60 0.44 0.31 0.80 0.73 0.36 0.40 0.51 0.30
Sunday-Rocker2#1 0.70 0.81 0.39 0.90 0.93 0.66 0.90 0.88 0.65 0.00 0.00 0.00
Sunday-Rocker2#3 0.80 0.88 0.41 0.50 0.50 0.23 0.60 0.69 0.34 0.00 0.00 0.00
UNSL#1 0.80 0.88 0.46 0.60 0.73 0.64 0.60 0.73 0.66 0.60 0.71 0.66
indicator of depression and that a long affective memory is crutial for TAM to explore the
users’ affective states. Our initial experiment suggests that adding a latency penalty to the
cross-entropy loss is effective for training early detection models. Among all participants,
our system turns to be the most efficient and achieves two state-of-the-art results in terms of
the ranking-based evaluation. Our results also suggest that language usage patterns, such as
n-grams, could be an important feature for depression detection. Integrating language usage
patterns into the TAM network could be a promising work in the future.
Acknowledgments
This research has been supported by JSPS KAKENHI Grant Number 19H04215.
References
[1] D. E. Losada, F. Crestani, J. Parapar, Clef 2017 erisk overview: Early risk prediction on the
internet: Experimental foundations., in: CLEF (Working Notes), 2017.
[2] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk: early risk prediction on the
internet, in: International conference of the cross-language evaluation forum for european
languages, Springer, 2018, pp. 343–361.
[3] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk 2019 early risk prediction on
the internet, in: International Conference of the Cross-Language Evaluation Forum for
European Languages, Springer, 2019, pp. 340–357.
[4] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk at clef 2020: Early risk prediction
on the internet (extended overview)., CLEF (Working Notes) (2020).
[5] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk at clef 2021: Early
risk prediction on the internet (extended overview), CLEF (Working Notes) (2021).
[6] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Evaluation report of erisk 2022:
Early risk prediction on the internet, CLEF (Working Notes) (2022).
[7] M. Park, D. McDonald, M. Cha, Perception differences between the depressed and non-
depressed users in twitter, in: Proceedings of the International AAAI Conference on Web
and Social Media, volume 7, 2013, pp. 476–485.
[8] M. De Choudhury, S. Counts, E. Horvitz, Social media as a measurement tool of depression
in populations, in: Proceedings of the 5th annual ACM web science conference, 2013, pp.
47–56.
[9] J. Parapar, D. E. Losada, A. Barreiro, A learning-based approach for the identification of
sexual predators in chat logs., in: CLEF (Online working notes/labs/workshop), volume
1178, 2012.
[10] M. A. Moreno, L. A. Jelenchick, K. G. Egan, E. Cox, H. Young, K. E. Gannon, T. Becker,
Feeling bad on facebook: Depression disclosures by college students on a social networking
site, Depression and anxiety 28 (2011) 447–455.
[11] S. Rude, E.-M. Gortner, J. Pennebaker, Language use of depressed and depression-vulnerable
college students, Cognition & Emotion 18 (2004) 1121–1133.
[12] S. J. Blatt, Experiences of depression: Theoretical, clinical, and research perspectives.,
American Psychological Association, 2004.
[13] M. Aaron T. Beck, P. Brad A. Alford, Depression: Causes and Treatment, University of
Pennsylvania Press, 2014. URL: https://doi.org/10.9783/9780812290882. doi:doi:10.9783/
9780812290882.
[14] J. Rottenberg, Mood and emotion in major depression, Current Directions in Psychological
Science 14 (2005) 167–170.
[15] J. Joormann, C. H. Stanton, Examining emotion regulation in depression: A review and
future directions, Behaviour research and therapy 86 (2016) 35–49.
[16] M. Stankevich, V. Isakov, D. Devyatkin, I. V. Smirnov, Feature engineering for depression
detection in social media., in: ICPRAM, 2018, pp. 426–431.
[17] T. Shen, J. Jia, G. Shen, F. Feng, X. He, H. Luan, J. Tang, T. Tiropanis, T. S. Chua, W. Hall,
Cross-domain depression detection via harvesting social media, International Joint Con-
ferences on Artificial Intelligence, 2018.
[18] S. Tsugawa, Y. Kikuchi, F. Kishino, K. Nakajima, Y. Itoh, H. Ohsaki, Recognizing depression
from twitter activity, in: Proceedings of the 33rd annual ACM conference on human
factors in computing systems, 2015, pp. 3187–3196.
[19] A. Yates, A. Cohan, N. Goharian, Depression and self-harm risk assessment in online
forums, arXiv preprint arXiv:1709.01848 (2017).
[20] M. M. Tadesse, H. Lin, B. Xu, L. Yang, Detection of depression-related posts in reddit social
media forum, IEEE Access 7 (2019) 44883–44893.
[21] A. Rinaldi, J. E. F. Tree, S. Chaturvedi, Predicting depression in screening interviews from
latent categorization of interview prompts, in: Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, 2020, pp. 7–18.
[22] S. Ji, C. P. Yu, S.-f. Fung, S. Pan, G. Long, Supervised learning for suicidal ideation detection
in online user content, Complexity 2018 (2018).
[23] S. Ji, X. Li, Z. Huang, E. Cambria, Suicidal ideation and mental disorder detection with
attentive relation networks, Neural Computing and Applications (2021) 1–11.
[24] L. Ansari, S. Ji, Q. Chen, E. Cambria, Ensemble hybrid learning methods for automated
depression detection, IEEE Transactions on Computational Social Systems (2022).
[25] C. Yang, Y. Zhang, S. Muresan, Weakly-supervised methods for suicide risk assessment:
Role of related domains, arXiv preprint arXiv:2106.02792 (2021).
[26] J. Joormann, I. H. Gotlib, Updating the contents of working memory in depression:
interference from irrelevant negative material., Journal of abnormal psychology 117 (2008)
182.
[27] D. E. Losada, F. Crestani, A test collection for research on depression and language
use, in: International Conference of the Cross-Language Evaluation Forum for European
Languages, Springer, 2016, pp. 28–39.
[28] M. Trotzek, S. Koitka, C. M. Friedrich, Linguistic metadata augmented classifiers at the
clef 2017 task for early detection of depression., in: CLEF (Working Notes), 2017.
[29] M. Trotzek, S. Koitka, C. M. Friedrich, Word embeddings and linguistic metadata at the
clef 2018 tasks for early detection of depression and anorexia., in: CLEF (Working Notes),
2018.
[30] D. G. Funez, M. J. G. Ucelay, M. P. Villegas, S. Burdisso, L. C. Cagnina, M. Montes-y Gómez,
M. Errecalde, Unsl’s participation at erisk 2018 lab., in: CLEF (Working Notes), 2018.
[31] S. Paul, S. K. Jandhyala, T. Basu, Early detection of signs of anorexia and depression over
social media using effective machine learning frameworks., in: CLEF (Working notes),
2018.
[32] E. Saravia, H.-C. T. Liu, Y.-H. Huang, J. Wu, Y.-S. Chen, CARER: Contextualized affect
representations for emotion recognition, in: Proceedings of the 2018 Conference on Empir-
ical Methods in Natural Language Processing, Association for Computational Linguistics,
Brussels, Belgium, 2018, pp. 3687–3697. URL: https://www.aclweb.org/anthology/D18-1404.
doi:10.18653/v1/D18-1404.
[33] I. M. Baytas, C. Xiao, X. Zhang, F. Wang, A. K. Jain, J. Zhou, Patient subtyping via time-
aware lstm networks, in: Proceedings of the 23rd ACM SIGKDD international conference
on knowledge discovery and data mining, 2017, pp. 65–74.
[34] D. Zhang, J. Thadajarassiri, C. Sen, E. Rundensteiner, Time-aware transformer-based net-
work for clinical notes series prediction, in: Machine Learning for Healthcare Conference,
PMLR, 2020, pp. 566–588.
[35] S. G. Burdisso, M. Errecalde, M. Montes-y Gómez, A text classification framework for
simple and effective early depression detection over social media streams, Expert Systems
with Applications 133 (2019) 182–197.