An End-to-End Set Transformer for User-Level Classification of Depression and Gambling Disorder

An End-to-End Set Transformer for User-Level Classification of Depression and Gambling Disorder Ana-MariaBucur ana-maria.bucur@drd.unibuc.ro Interdisciplinary School of Doctoral Studies University of Bucharest

Romania

PRHLT Research Center Universitat Politècnica de València

Spain

AdrianCosma cosma.i.adrian@gmail.com Politehnica University of Bucharest

Romania

LiviuPDinu ldinu@fmi.unibuc.ro Faculty of Mathematics and Computer Science University of Bucharest

Romania

Human Language Technologies Research Center University of Bucharest

Romania

PaoloRosso prosso@dsic.upv.es PRHLT Research Center Universitat Politècnica de València

Spain

Evaluation Forum

September 5-8 2022 Bologna Italy

An End-to-End Set Transformer for User-Level Classification of Depression and Gambling Disorder 1613-0073 973F9F248C416677FB88F6373E06A926 GROBID - A machine learning software for extracting information from scholarly documents set transformer sentence encoder gambling disorder detection depression detection social media

This work proposes a transformer architecture for user-level classification of gambling addiction and depression that is trainable end-to-end. As opposed to other methods that operate at the post level, we process a set of social media posts from a particular individual, to make use of the interactions between posts and eliminate label noise at the post level. We exploit the fact that, by not injecting positional encodings, multi-head attention is permutation invariant and we process randomly sampled sets of texts from a user after being encoded with a modern pretrained sentence encoder (RoBERTa / MiniLM). Moreover, our architecture is interpretable with modern feature attribution methods and allows for automatic dataset creation by identifying discriminating posts in a user's text-set. We perform ablation studies on hyper-parameters and evaluate our method for the eRisk 2022 Lab on early detection of signs of pathological gambling and early risk detection of depression. The method proposed by our team BLUE obtained the best ERDE5 score of 0.015, and the second-best ERDE50 score of 0.009 for pathological gambling detection. For the early detection of depression, we obtained the second-best ERDE50 of 0.027.

Introduction

How much can one know about someone from their social media interactions? Billions of people 1 use social media sites like Facebook, Instagram, Twitter, and Reddit every day. While some sites like Facebook and Instagram encourage users to use their real names, websites such as Reddit are often praised for enabling users to hide between a pseudonym, offering the illusion of privacy. Under the guise of anonymity, users tend to post more personal information related to their lives and their everyday struggles instead of striving to maintain an image and a persona when their identities are open [1]. Many aspects of a user's personal life can be uncovered in their posting history. Of course, not one single post can be all-encompassing, but rather the information is scattered across many unrelated comments and posts. For instance, on the r/relationship_advice2 subreddit a user might reveal their gender and age when discussing intimate relationship struggles, while on r/depression3 a user might provide clues for their internal conflicts and experiences.

In the task of mental health disorders detection from social media text, many approaches operate on the post-level [2,3,4], considering that, for instance, if a user is depressed, then all their posts might contain some information regarding this issue. However, we posit that this method of post-level classification is unsuitable -many posts are unrelated and uninformative to the particular task. Their interaction, however, might contain clues to the mental well-being of a user.

As such, we propose an architecture that performs user-level classification by processing a set of posts from a user. We exploit the fact that the multi-head attention operation in transformers is permutation invariant and inputs multiple texts from a single user into the network, modeling their interaction and classifying the user. This approach has several advantages: (i) it is trainable end-to-end, mitigating the need for hand-crafted construction of global user features (ii) it is robust to label noise, as some posts might be uninformative, the network learns to ignore them in the decision and (iii) it is interpretable, using feature attribution methods [5] we can extract the most important posts for the decision.

The Early Risk Prediction on the Internet (eRisk) 4 Lab started in 2017 with one pilot task and, since then, tacked the early risk detection of several mental illnesses: depression, self-harm, eating disorders, and pathological gambling. This work showcases team BLUE's proposed approach for Tasks 1 and 2 of eRisk 2022 Lab [6], of gambling and depression detection, respectively.

The paper makes the following contributions:

1. We propose a set-based transformer architecture for user-level classification, which makes a decision by processing multiple texts of a particular user. 2. We show that our architecture is robust to label noise and is interpretable with modern feature attribution methods, allowing it to be used as a dataset filtering tool. 3. We obtained promising results on the eRisk 2022 tasks on early risk detection of pathological gambling (best ERDE 55 score of 0.015 and the second-best ERDE 50 score of 0.009) and depression detection (second-best ERDE 50 of 0.027).

Related Work

Pathological Gambling For the detection of gambling disorder, the eRisk Lab is the first to use social media data for the assessment of gambling risk. Usually, the automated methods use data from behavioral markers [7,8] or personality biomarkers [9]. In the first iteration of the task for gambling addiction detection, the best-performing systems were developed by Maupomé et al. [10] and Loyola et al. [11]. Maupomé et al. [10] used a user-level approach based on the similarity distance between the vector of topic probabilities of the users' texts to be assessed for pathological gambling risk and testimonials or items from a self-evaluation questionnaire for compulsive gamblers. By using this method, the authors obtain the best ERDE 5 of 0.048. Loyola et al. [11] attain the best ERDE 50 (0.020) and latency-weighted F1 (0.693) through a post-level rule-based early alert policy on bag-of-words text representation classified with SVM.

Depression Depression detection from social media data is an interdisciplinary topic, and efforts have been made by researchers from both NLP and Psychology to detect different markers of depression found in the online discourse of individuals. Some depression cues found in language are: greater use of the first-person singular pronouns "I" [12], lesser use of first-person plural "we" [13], increased use of negative or absolutist terms (e.g., "never", "forever") [14], greater use of verbs at past tense [15].

For the task of early detection of depression, the best systems from the first iteration of the task (eRisk 2017) used as input linguistic meta information extracted from the texts such as LIWC [16], readability and hand-crafted features [17] obtaining the best ERDE 5 (12.70%) or a combination of linguistic information and temporal variation of terms from users' posts [18] achieving the best ERDE 50 (9.68%). The best-performing systems from eRisk 2018 were the ones from Funez et al. [19] and Trotzek et al. [20]. Funez et al. [19] propose a user-level approach using an SVM classifier on semantic representations that take into account the temporal variation of terms between the users' posts and achieve an ERDE 5 of 8.78%. On the other hand, the best ERDE 50 (6.44%) is attained by Trotzek et al. [20] using a chunk-level6 approach using an ensemble of logistic regression classifiers on bag-of-words features. The dataset from the depression detection task from the eRisk Lab was an important resource later used in different research articles tackling the detection problem using approaches such as a neural network architecture on topic modeling features [21], SVM or deep learning architectures using fine-grained emotions features [22] or deep learning methods using content, writing style and emotion features [23].

Method

The transformer encoder, as proposed by Vaswani et al. [24], essentially consists of multiple sequential layers of multi-head attention. Scaled dot-product attention of a query 𝑄 relative to a set of values 𝑉 and a set of keys 𝐾 is computed using the following equation (𝑑 𝑘 is the dimensionality of the query and keys):

Attention(𝑄, 𝐾, 𝑉 ) = softmax( 𝑄𝐾 𝑇 √ 𝑑 𝑘 )𝑉(1)

As such, multi-head attention consists of multiple applications of the attention mechanism to the same input. The multi-head attention is defined as:

MultiHead(𝑄, 𝐾, 𝑉 ) = Concat(head 1 , head 2 . . . head ℎ )𝑊 𝑂 head 𝑖 = Attention(𝑄𝑊 𝑄 𝑖 , 𝐾𝑊 𝐾 𝑖 , 𝑉 𝑊 𝑉 𝑖 )(2)

In this formulation, multi-head attention is permutation invariant, and the current way to inject temporal information into the input sequence is by employing positional encodings [25]. This is useful when processing sequential data such as texts. However, by omitting positional encodings, the transformer essentially acts as a set encoder. Lee et al. [26] introduced the Set Transformer, in which they prove that multi-head attention is permutation invariant and that the Set Transformer is a universal approximator of permutation invariant functions. We make use of this fact to perform user-level classification by processing sets of texts (in the form of social media posts) from a particular user. The intuition behind processing a set of texts from a user is that no single social media post is sufficiently informative for a classifier decision, but rather their interaction and the user behavior as a whole. Moreover, through mean pooling, the inevitable noise (in terms of unrelated posts) is dampened, which aids classification in weakly-supervised scenarios, such as ours, in which a user is labeled rather than all of their posts.

We consider a user 𝑖 to contain multiple social media posts 𝑈 𝑖 . A set of 𝐾 texts 𝑡 are randomly sampled from 𝑈 𝑖 , which defines our text-set 𝑆 𝑖 = {𝑡 𝑗 ∼ 𝑈 𝑖 , 𝑗 ∈ (1 . . . 𝐾)}. We sample 𝐾 posts from the user's history, instead of processing all of them due to memory limitations -some individuals have thousands of posts while others have only in the order of tens. Moreover, stochasticity is introduced in the training procedure, which prevents overfitting. As such, for training, an input batch of size 𝑛 is defined by the concatenation of 𝑛 such text-sets: 𝐵 = {𝑆 𝑏 1 , 𝑆 𝑏 2 , . . . 𝑆 𝑏𝑛 }. We do not consider the relative order of the texts for a particular user, and text-sets are fed into the transformer encoder without using positional encoding. Since some users have a total number of texts smaller than 𝐾, creating a batch of text-sets is impossible without padding and masking. However, to alleviate this problem, we train with an effective batch size of 1 and chose to employ gradient accumulation to simulate a larger batch size.

Figure 1 showcases our proposed model architecture for user-level classification. Each text in a text-set is embedded into a fixed-size vector using available pretrained sentence encoder models (i.e., RoBERTa / MiniLM). The text embeddings are fed into the transformer encoder network, and after processing, we perform mean pooling and output the decision. We compute binary cross-entropy at the user-level, for a text-set. The pretrained sentence encoder is frozen and not updated during training.

Baytas et al. [27] proposed to use a T-LSTM to process social media posts sequentially as a time-series. The authors modify the LSTM architecture to include a relative time component. However, in our case it is unclear how to incorporate such a mechanism into the transformer architecture, aside from using a relative positional encoding [28], which ignores long-ranged dependencies between posts. As such, we chose to ignore the temporal order of the posts and process them directly as a set. The main reason for considering the posts as a set is that in a user's post history, many posts are uninformative to the modeling task, and by processing a set of texts, label noise is reduced naturally as a direct consequence of the attention mechanism, which assigns more importance to informative posts. However, training with a sufficiently large dataset might achieve the same effect, but previous attempts at post-level classification have proven ineffective [4].

In order to assess the impact of the sentence representations, we chose two different sentence encoders: RoBERTa [29] and MiniLM [30]. We chose RoBERTa since it is one of the best performing English language models in downstream tasks [29], and MiniLM, a multi-lingual model, since some users have social media posts in languages other than English. Figure 2 showcases the performance gap between the two sentence encoders, averaged across multiple values of 𝐾. RoBERTa yields a consistently superior performance across training steps. Similarly, to assess the impact of the text-set size 𝐾, we performed an ablation study, as shown in Figure 3. We kept the sentence encoder fixed to RoBERTa, and vary the number of texts per user 𝐾 ∈ {4, 8, 16, 32, 64, 128}. The best performance was achieved with 𝐾 = 16 and 𝐾 = 32 for Tasks 1 and 2, respectively.

In our final submission, we chose RoBERTa as a sentence encoder and sampled 𝐾 = 16 texts per user for Task 1 and 𝐾 = 32 for Task 2. We used the standard formulation of the transformer network [24], with 4 encoder layers, 8 attention heads each and a dimensionality of 256. Both networks were trained for 120 epochs, with AdamW optimizer [31], with a cyclical learning rate [32] ranging from 0.00001 to 0.0001 across 6 epochs and a batch size of 128. To account for class imbalance, we computed balanced class weights with respect to each dataset and adjusted the loss function accordingly. Finally, we opted for a very high threshold when predicting the final decision.

Our proposed architecture can be easily interpretable using modern explainability methods for feature attribution [33,34,5], such as Integrated Gradients [5]. It automatically identifies social media posts containing signs of mental health disorders and filters out uninformative posts.

Interpretability

Since our model operates on sets of social media texts from a particular user, we can employ model explainability methods to assess the importance of a piece of text to the model decision. Through this, automatic filtering and selection of the most indicative posts of a user can be made for use in dataset creation. This idea is similar to Ríssola et al. [3], which employed a series of heuristics to recognize posts portraying depression symptoms for use in constructing a post-level training set from existing depression datasets annotated at the user level. As such, we use Integrated Gradients [5] to compute attribution scores for a text-set. The integrated gradients method has been used in NLP to explore the contribution of individual words and phrases to a decision made by a classifier. Since we are not operating on words, but rather on whole texts, this method computes the most important text to the classifier decision. Figure 4 showcases selected samples ordered by their attribution score from the validation set of each task. All samples belong to the same user for each task, and the attribution scores are computed within the respective text-set. Posts with a high positive contribution to the decision contain more explicit descriptions of symptoms, while posts with more negative contributions are mainly unrelated to the particular mental illness. We use the integrated gradients method in one of our runs to select the most important posts in the user history. However, we emphasize that the best application of this approach is for automatic dataset creation in scenarios of weak supervision, which we aim to explore in future work.

Results

Evaluation

There are two kinds of evaluation used for measuring the performance of the systems, decisionbased and raking-based. The decision-based evaluation is used for quantifying the capacity of a system to perform the binary classification and predicting if a user is from the positive class (i.e., pathological gambling or depression) or the negative one. It is comprised of standard measures for classification (Precision, Recall, F1) and measures for this specific task of early detection that consider the delay and the speed of the decision. The early risk detection error (ERDE) [35] measures the correct predictions considering a late decision penalty (for predictions taken after the 5 or 50 first submissions of a user). To overcome the limitations of this metric [36], the latency-weighted F1 score [37] was also proposed to measure the performance of early risk detection. Latency measures the delay in detecting true positives based on the median number of submissions seen by the system before taking a decision. The speed of a system that correctly predicts true positives from the first submission is equal to 1, while a slow system which decides after processing hundreds of texts. The latency-weighted F1 combines the F1-score with the delay in decision-taking for true positives. A perfect system should achieve a latency-weighted F1 of 1. Besides the binary classification decisions, the participating teams were asked to also submit a score for estimating the risk of users for the ranking-based evaluation. These scores are used to rank users' risk for pathological gambling or depression. Standard IR metrics (P@10, NDCG@10, and NDCG@100) are used to measure the models' ranking-based performance after processing 1, 100, 500, or 1000 submissions.

Task 1: Early Detection of Signs of Pathological Gambling

The first task proposes the detection of gambling addiction from social media data. This being the second edition of this task, the organizers provided the last year's test data for training the systems. The dataset was collected from Reddit, following the methodology described by Losada and Crestani [35] and contains a chronological sequence of posts from each user. The training dataset was comprised of 164 pathological gamblers, with a total of 54,674 submissions, and 2,184 control users with 1,073,883 submissions. The test dataset contains 81 users with gambling addiction, summing 14,627 posts, and 1,998 control users with a total of 1,014,122 posts. For the testing phase, the submissions of users were released sequentially, the systems proposed by the participating teams received one submission at a time from all the users. We submitted three runs for the early detection of pathological gambling: Run 0 is comprised of the text-set transformer model using the most recent 𝐾 = 16 posts for prediction; the system for Run 1 is the same text-set transformer model using as input the set of 𝐾 = 16 texts that are most important in a user's history, selected with Integrated Gradients; Run 2 is a baseline run, using the proposed model architecture for predicting at post-level, on one sample at a time.

Table 1

Decision-based evaluation on Task 1: Early Detection of Signs of Pathological Gambling. We show the performance of our systems compared to the best-performing run from each team.

Table 2

Ranking-based evaluation on Task 1: Early Detection of Signs of Pathological Gambling.

1 writing 100 writings 500 writings 1000 writings Team Run ID P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100 BLUE 0 1.00 1.00 0.76 1.00 1.00 0.81 1.00 1.00 0.89 1.00 1.00 0.89 BLUE 1 1.00 1.00 0.76 1.00 1.00 0.89 1.00 1.00 0.91 1.00 1.00 0.91 BLUE 2 1.00 1.00 0.69 1.00 1.00 0.40 0.00 0.00 0.02 0.00 0.00 0.01 UNED-NLP 4 1.00 1.00 0.56 1.00 1.00 0.88 1.00 1.00 0.95 1.00 1.00 0.95 UNSL 0 1.00 1.00 0.68 1.00 1.00 0.90 1.00 1.00 0.93 1.00 1.00 0.95

Table 1 showcases the performance of the systems measured using the decision-based measures. Regarding ERDE, our first run (Run 0), using the transformer architecture on the most recent texts from each user, manages to achieve the best ERDE 5 score of 0.015, and the secondbest ERDE 50 score of 0.009, demonstrating that the system could detect early the true positive cases. The perfect scores for latency 𝑇 𝑃 and speed show that our models were successful at detecting the true positive cases after the first writing. As expected, the baseline run using a post-level approach (Run 2) has the lowest performance. Regarding Run 2, we expected it to achieve the best performance from our submitted runs, as this approach is more aggressive in taking decisions by using for classification the most informative posts from users' history. Furthermore, our best run from this year's task surpasses all the runs from our participation in the first iteration of the task in 2021 [4], showing that a user-level approach considering a set of texts from each individual is more suitable than a post-level approach. In Table 2 we show the results of the ranking-based evaluation, in which each team had to submit the rankings of users' risk for pathological gambling. Our team has excellent results for NDCG and P@10 in all the situations (after 1, 100, 1000, 5000 writings).

Task 2: Early Detection of Depression

This year marks the third iteration of the early detection of depression task, continuing the 2017 T1 and 2018 T2 tasks. The organizers provided the data from the previous two editions for training the models. Users from the depression class were labeled by their mention of diagnosis on their Reddit posts (e.g., "I was diagnosed with depression"). In contrast, users from the control class are users who do not have any mention of diagnosis in their posts [35]. The training dataset comprises 214 users diagnosed with depression with 270,666 submissions and 1493 control users with a total of 2,959,080 submissions. The test set contains 98 users with depression with 35,332 posts, and 1,302 users in the control group with a total of 687,228 posts. The texts for making the predictions for the testing phase were released sequentially, and the systems from the participating teams had to decide on firing a decision for a specific user or waiting for more data. We submitted three runs for the early detection of depression: Run 0 is the text-set transformer model using the most recent 𝐾 = 32 posts for prediction; for Run 1 we employ the same text-set transformer model using as input the set of 𝐾 = 32 texts that are most important in a user's history, selected with Integrated Gradients; Run 2 is a baseline run, using the proposed model architecture for predicting at post-level, on one sample at a time.

Table 3

Decision-based evaluation on Task 2: Early Detection of Depression. We show the performance of our systems compared to the best-performing run from each team. In Table 3 we present the performance of the systems using the decision-based metrics. Our best performing run is the transformer architecture using the most recent texts from users (Run 0), followed by the system that considers only the most informative submissions from each user for the model's decisions (Run 1). The post-level system (Run 2) has the worst performance. Our three submitted runs achieve high Recall at the expense of lower Precision scores. The precision of our models can be improved by incorporating a mechanism for weighting user posts according to the prevalence of signs of depression [38]. As such, a text-set containing few posts with signs of depression will not induce a positive prediction. Regarding the early detection evaluation, our team has the second-best score on the ERDE 50 metric (0.027), while our ERDE 5 score is close to the best one. Compared to the best metrics from the 2018 edition of this task, when the best ERDE 5 and ERDE 50 were 0.087 and 0.064, respectively, current systems surpass these scores due to more data being available for training the models and the advancements in the field of machine learning in the last few years. Regarding the standard metrics for classification, a slight improvement was made in terms of F1 score, from 0.64 in 2018 to 0.71 in 2022. The ranking-based evaluation performance from Table 4 shows that for 1 and 1000 writings, our systems attain some of the best scores for P@10 and NDCG.

Table 4

Ranking-based evaluation on Task 2: Early Detection of Depression.

1 writing 100 writings 500 writings 1000 writings Team Run ID P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100 P@10 NDCG@10 NDCG@100

BLUE

Conclusion

In this work, we proposed a transformer architecture that performs user-level classification of gambling addiction and depression detection. For each individual, the transformer processes a set of texts encoded by a pretrained sentence encoder to model the interactions between posts and mitigate noise in the dataset. Our network is interpretable and allows for automatic dataset creation by filtering uninformative posts in a user's history. Our method is a promising approach, especially for social media text processing, where a user has many texts: some informative and some unrelated to the particular modeling task. However, their interaction is indicative of the mental state of the user. We attained the best ERDE 5 score of 0.015, and the second-best ERDE 50 score of 0.009 for pathological gambling detection. For the early detection of depression, we obtained the second-best ERDE 50 (0.027).

For future work, we aim to extend our method and construct a mechanism for encoding the relative order of a user's posts with a modified version of relative positional embeddings [39]. While we chose an approach that ignores temporal ordering and processes posts as a set, preserving order is a natural way to increase the expressive power in modeling a user's entire social media interactions, similar to architectures such as the time-aware LSTM [27].

Figure 1 :1Figure 1: Proposed model architecture. We perform user-level classification by operating on a sample of K texts from a user. Texts are encoded with a pretrained sentence encoder and processed by a permutation-invariant transformer network. Binary cross-entropy loss is applied at the user level for a text-set.

Figure 2 :2Figure 2: Performance of our model across training steps, in terms of F 1 score, for different sentence encoders (RoBERTa / MiniLM). We show the mean and standard deviation of F 1 score across multiple values of 𝐾. For both tasks, RoBERTa yields consistent superior performance compared to MiniLM. Best viewed in color.

Figure 3 :3Figure 3: Performance of our model across training steps, in terms of validation F 1 score, for RoBERTa sentence embeddings and varying the 𝐾, the number of texts per user. For Tasks 1 and 2, the best performance is attained with 𝐾 = 16 and 𝐾 = 32, respectively. Best viewed in color.

Figure 4 :4Figure 4: Texts from a particular user, relatively ranked in terms of attribution scores (contribution to a positive decision by the model) computed with the Integrated Gradients method. For each task, all texts belong to a single text-set of a user. The model is able to identify posts with a clear discriminating information for each task. Best viewed in color. Examples have been paraphrased for anonymity. https://www.reddit.com/r/relationship_advice/ https://www.reddit.com/r/depression/ https://erisk.irlab.org/ Early Risk Detection Error, introduced in Section 5.1 in 2018 the test data was released in chunks of posts, not one post at a time as it is the case in this year's tasks

Acknowledgments

The work of Ana-Maria Bucur was in the framework of the research project NPRP13S-0206-200281. The work of Paolo Rosso was in the framework of the research project PROME-TEO/2019/121 (DeepPattern) by the Generalitat Valenciana. The authors thank the EU-FEDER Comunitat Valenciana 2014-2020 grant IDIFEDER/2018/025.

CEUR Workshop Proceedings (CEUR-WS.org) 1 https://www.statista.com/statistics/272014/global-social-networks-ranked-by-number-of-users/

Mental health discourse on reddit: Self-disclosure, social support, and anonymity MDeChoudhury SDe Eighth international AAAI conference on weblogs and social media 2014 Detection of suicide ideation in social media forums using deep learning MMTadesse HLin BXu LYang Algorithms 13 7 2019 A dataset for research on depression in social media EARíssola SABahrainian FCrestani Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization the 28th ACM Conference on User Modeling, Adaptation and Personalization 2020 Early risk detection of pathological gambling, self-harm and depression using bert A.-MBucur ACosma LPDinu CLEF (Working Notes 2021 Axiomatic attribution for deep networks MSundararajan ATaly QYan International conference on machine learning

PMLR

2017 Overview of erisk 2022: Early risk prediction on the internet JParapar PMRodilla DELosada FACrestani Experimental IR Meets Multilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF Association

CLEF

Springer International Publishing 2022. 2022 Identifying high-risk online gamblers: A comparison of data mining procedures KSPhilander International Gambling Studies 14 2014 Applying data science to behavioral analysis of online gambling XDeng TLesch LClark Current Addiction Reports 6 2019 Personality biomarkers of pathological gambling: A machine learning study ACerasa DLofaro PCavedini IMartino ABruni ASarica DMauro GMerante IRossomanno MRizzuto Journal of neuroscience methods 294 2018 Early detection of signs of pathological gambling, self-harm and depression through topic extraction and neural networks DMaupomé MDArmstrong FRancourt TSoulas M.-JMeurs CLEF (Working Notes 2021 early alert policies for early risk detection JMLoyola SBurdisso HThompson LCagnina MErrecalde Unsl at erisk 2021: A comparison of three 2021 CLEF (Working Notes Language use of depressed and depression-vulnerable college students SRude E.-MGortner JPennebaker Cognition & Emotion 18 2004 A psychologically informed part-of-speech analysis of depression in social media A.-MBucur IRPodină LPDinu Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP the International Conference on Recent Advances in Natural Language Processing (RANLP 2021. 2021 The internet-a new source of data on suicide, depression and anxiety: a preliminary study SFekete Archives of Suicide Research 6 2002 Language patterns discriminate mild depression from normal sadness and euthymic state DSmirnova PCumming ESloeva NKuvshinova DRomanov GNosachev Frontiers in psychiatry 9 105 2018 Linguistic inquiry and word count: Liwc 2001 JWPennebaker MEFrancis RJBooth Mahway 71 2001 2001 Lawrence Erlbaum Associates Linguistic metadata augmented classifiers at the clef 2017 task for early detection of depression MTrotzek SKoitka CMFriedrich CLEF (Working Notes) 2017 Temporal variation of terms as concept space for early risk prediction MLErrecalde MPVillegas DGFunez MJ GUcelay LCCagnina CLEF (Working Notes 2017 DGFunez MJ GUcelay MPVillegas SBurdisso LCCagnina MMontes-Y Gómez MErrecalde Unsl's participation at erisk 2018 lab 2018 CLEF (Working Notes) Word embeddings and linguistic metadata at the clef 2018 tasks for early detection of depression and anorexia MTrotzek SKoitka CMFriedrich CLEF (Working Notes) 2018 Detecting early onset of depression from social media text using learned confidence scores ABucur LPDinu Proceedings of the Seventh Italian Conference on Computational Linguistics, CLiC-it 2020 CEUR Workshop Proceedings the Seventh Italian Conference on Computational Linguistics, CLiC-it 2020

Bologna, Italy

March 1-3, 2021. 2020 2769 Detecting mental disorders in social media through emotional patterns-the case of anorexia and depression MEAragon APLopez-Monroy L.-CGGonzalez-Gurrola MMontes IEEE Transactions on Affective Computing 2021 An emotion and cognitive based analysis of mental health disorders from social media data A.-SUban BChulvi PRosso Future Generation Computer Systems 124 2021 Attention is all you need AVaswani NShazeer NParmar JUszkoreit LJones ANGomez ŁKaiser IPolosukhin Advances in neural information processing systems 30 2017 Convolutional sequence to sequence learning JGehring MAuli DGrangier DYarats YNDauphin International Conference on Machine Learning

PMLR

2017 Set transformer: A framework for attention-based permutation-invariant neural networks JLee YLee JKim AKosiorek SChoi YWTeh International Conference on Machine Learning

PMLR

2019 Patient subtyping via timeaware lstm networks IMBaytas CXiao XZhang FWang AKJain JZhou Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining the 23rd ACM SIGKDD international conference on knowledge discovery and data mining 2017 SMKazemi RGoel SEghbali JRamanan JSahota SThakur SWu CSmyth PPoupart MBrubaker arXiv:1907.05321 Time2vec: Learning a vector representation of time 2019 arXiv preprint Roberta: A robustly optimized BERT pretraining approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov CoRR abs/1907.11692 2019 Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers WWang FWei LDong HBao NYang MZhou Advances in Neural Information Processing Systems 33 2020 DPKingma JBa Adam: A method for stochastic optimization 2015 ICLR (Poster Cyclical learning rates for training neural networks LNSmith IEEE winter conference on applications of computer vision (WACV) IEEE 2017. 2017 A unified approach to interpreting model predictions SMLundberg S.-ILee Advances in neural information processing systems 30 2017 why should i trust you?" explaining the predictions of any classifier MTRibeiro SSingh CGuestrin Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining the 22nd ACM SIGKDD international conference on knowledge discovery and data mining 2016 A test collection for research on depression and language use DELosada FCrestani International Conference of the Cross-Language Evaluation Forum for European Languages Springer 2016 DELosada FACrestani JParapar Overview of erisk at clef 2019: Early risk prediction on the internet (extended overview) 2019 CLEF (Working Notes) Measuring the latency of depression detection in social media FSadeque DXu SBethard Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining the Eleventh ACM International Conference on Web Search and Data Mining 2018 A dataset for research on depression in social media ERíssola SABahrainian FCrestani 10.1145/3340631.3394879 Proceedings of the 28th ACM Conference on User Modeling, Adaptation and Personalization the 28th ACM Conference on User Modeling, Adaptation and Personalization 2020 Explore better relative position embeddings from encoding perspective for transformer models AQu JNiu SMo Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing the 2021 Conference on Empirical Methods in Natural Language Processing 2021