Early detection of depression with linear models using
hand-crafted and contextual features
Ilija Tavchioski1,3 , Blaž Škrlj1 , Senja Pollak1,2 and Boshko Koloski1,2
1
  Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
2
  International Postgraduate School Jožef Stefan Institute, Jamova cesta 39, 1000 Ljubljana, Slovenia
3
  Faculty of Computer and Information Sciences, Večna Pot 113, 1000 Ljubljana, Slovenia


                                         Abstract
                                         Depression is a mental illness that affects millions of people; its early detection is of great importance
                                         for diagnosis and treatment. In this work, we describe our solution submitted to the joint task of Early
                                         Detection of Depression organized by CLEF, achieving 8th place out of 13 teams in terms of F1 score,
                                         however, performing the best in terms of precision. The result was obtained with one of the most
                                         computationally non-expensive approaches. Our approach focused on using linear models, such as
                                         logistic regression, that learned from different representations of the input space of documents.

                                         Keywords
                                         Depression detection, Document classification, Natural Language Processing, Machine Learning, Social
                                         Media


1. Introduction
This work presents our solution to the problem of early depression detection proposed as a
shared task in [1]. Since the appearance of social media and their exponential growth throughout
the past decade in terms of users, the social media had become an important part of our daily
life and opened a new ways to express ourselves. A study [2] has shown that our behavior
on social media is not very different from the one in real life, thus there is a possibility for
inferring health related problems, especially mental illnesses such as depression from posts on
social media. Due to this behaviour we consider utilizing machine learning methods to detect a
sign of depression as early as possible by using natural language processing techniques. The
remainder of this paper is structured as follows: in Section 2 we present the background and
related work for detecting depression from texts using machine learning, then in Section 3 we
describe the data that was given by the task’s organizers, next, in Section 4 we describe our
proposed methodology, in Section 7 we present our results obtained on the official test set and
in Section 8 we present conclusions and further work.


CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ ilijatavchioski@gmail.com (I. Tavchioski); blaz.skrlj@ijs.si (B. Škrlj); senja.pollak@ijs.si (S. Pollak);
boshko.koloski@ijs.si (B. Koloski)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
2. Background and Related work
The problem of this work is defined as follows: Given a user and a list of their posts on the social
platform Reddit, decide from which post onwards the user is depressed. Our goal is to develop
a method that will process sequentially each round of posts from all users and determine if a
user is depressed. If some post is labeled as depressing we consider the user as depressed, the
assessment is final and their additional posts are not to be taken into account. Prior to this
year’s task, there were several proposed solutions using machine learning and natural language
processing in the past few years on similar or almost identical tasks. Analyzing and profiling
users on social-media represents an active research area. Argamon et al. [3] did the pioneering
work in author profiling on the level of British National Corpora and concluded that man intend
to write in less formal way. Litvinova et al. [4] showed that the distinction via gender can be
captured based on person’s writing adjectives between male and female people. This hypothesis
lead to several shared-tasks aimed at author-profiling [5, 6, 7, 8, 9, 10, 11, 12]. The shared-tasks
profiled users on various parallels: gender, age, occupation, are the users potential spreaders of
fake or hate news and so on. The highest-scoring approaches to these tasks included models
that base on simpler linguistic features classified with either Support Vector Machine or Logistic
Regression. Martinc et al [13] focused on creating TF-IDF weighted n-gram features based
both on word and character n-grams classified via Support Vector Machines. Koloski et al. [14]
improved the proposed approach by introducing singular-value-decomposition of the n-gram
space. The proposed representation performed competitivly well in a multilignual setting, for
the task of fake-news spreaders identification [15]. In the domain of depression detection, Basile
et al. [16] proposed a solution to the task of depression detection with Hierarchical Attention
Networks, constructed based on 20000 most-frequent words, initialized with the GloVe [17]
embeddings. Campillo et al. [18] proposed a solution using TF-IDF weighted representations
derived from the word features, while for classification they utilized the Support Vector Machine
model. Key feature to this work was that the authors also included the position of a given post
in a series of posts, hypothesizing that a post earlier in the sequence can have higher impact on
the class prediction. The BERT based architectures are also commonly used for detection of
depression. One of the solutions including large pre-trained models is the method by Castaño
et al. [19] where the authors used XLM-RoBERTa [20] with an additional classification head.
Transfer-learning recently gained traction as a popular paradigm of utilizing knowledge of the
language model that was acquired by solving an unsupervised task. Spartailis et al. [21] used
the aforementioned paradigm by utilizing SBERT [22] with a combination of feature extraction
via classical machine learning.


3. Data set
The shared task consisted of two stages: a development and a test stage. In each stage we
were given up to 𝑁 users and their corresponding 𝑘 posts, accompanied by a binary label
(depressed or not-depressed). In the development stage, the organizers provided data for training
that consisted of the training and test data from eRisk2017 edition and the test data from the
eRisk2018 edition. A total of 1618 users were given. For each user up to 2000 posts were given.
The data was skewed towards non-depressed class, with more than 90% of the users labeled
as non-depressed. In order to build and evaluate models internally, we split the data into a
training (80% of the train data) and a development (20% of the train data) set with respect to
class distributions. Table 1 shows the data distribution per each training split. For learning
and modelling purposes, we first pair every given post per user with that user’s corresponding
binary label (depressed or not-depressed). In this manner we acquire 843,554 training data
points and 204,874 testing data points.

Table 1
Data distribution. The users column indicates the number of users per split, while the writings column
represents the total amount of posts per split.
                               Training data                   Development data                     Test data
     Label            Total users      Total writings    Total users   Total writings    Total users Total writings
  Depressed          135 (10.57 %)     75976 (9.01 %)   42 (12.31 %)   17916 (8.744%)      98 (7 %)         35,332
 Not depressed      1,142 (89.42 %) 767,578 (90.99 %)   299 (87.68 %) 186958 (91.25 %)   1,302 (93 %)       687,228
      All                1277             843,554            341          204,874           1,400           722,560


4. Methodology
We treat this problem as a binary document classification problem. In this section we describe
the chosen document representations, followed by the classifier description and finally we
explain our final task modelling.

4.1. Document representation
We consider two different document representation methods: one based on Latent Semantic
Analysis (see section 4.1.1) and one contextual based on the sentence transformers (see section
4.1.2). We use the implementation by Koloski et al. [23] in c19 python package 1 . For the
classification model we use the Logistic Regression model by scikit-learn [24].

4.1.1. Latent Semantic Analysis
For our first representation method we consider the [23] implementation of LSA based on n-gram
features recuded via SVD technique to create a new latent space of reduced dimensionality. The
method has two hyper-parameters 𝑛 - the total number of n-gram features and 𝑑 the dimension
of the latent space. The method first pre-processes the documents by removing the punctuation,
the hashtags, the URLs and stop-words. Next, the POS-tags are extracted with the NLTK library
[25]. The method constructs 𝑛2 features on basis of TF-IDF weighted word uni-grams and
bi-grams and 𝑛2 features of TF-IDF weighted char bi-grams and tri-grams. Finally, SVD is
applied in order to create the new latent space and simultaneously reduce the dimensionality to
dimension 𝑑. We search extensively thought the parameter space 𝑛 = {500, 1000, 1500, 2000}
and 𝑑 = {64, 128, 256, 512}. In our method the best-performing hyper-parameters were set to
𝑛 = 1000 and 𝑑 = 256 respectively.
    1
        https://github.com/bkolosk1/c19_rep
4.1.2. Contextual Features
For our second method we considered to use a model from sentence-transformer library distilbert-
base-nli-mean-tokens [22] in order to map the writings to a dense vector space of 768 dimensions.
Then, using the obtained vector representations we classify the writings using the aforemen-
tioned linear model.
   Sentence-BERT is a BERT [26] based model that it is used is to derive a semantically
meaningful vector representations for documents on sentence-level. It has added a pooling
operation in order to aggregate and generate the representations. We considered the following
variants of SBERT model: distillBERT [27], RoBERTa [28], XLM-RoBERTa [20].


5. Final Classification
We train a Logistic Regression Classifier on top of these aforementioned features with penalty
C set to 1. We learned a classifier on a given representation on the training set and evaluated
on the development set. For evaluation measure we used the F1-score.

5.1. User classification
We start by processing the first post, if that one is predicted as depressive by our system we
return depressing automatically for every next post. If not, we return that the user is not
depressed and proceed to classify the next post. Once we find a post that is depressing, we
proceed to automatically reply that the user is depressed until we finish. If there is no depressing
post we return that the user is not depressed at every step up to the last query to our system.


6. Measures
The F1-score is calculated as a harmonic mean between precision and recall.

    • Precision is a metric that provides us with the percentage of the positive predictions
      that are actually positive in the test set.
    • Recall is a metric that provides us with the proportion of the positive instances in the
      test set that are predicted as positive by our model.
    • F1-score is a harmonic mean of the precision and recall score.
    • ERDE [29] is a metric that, was provided by the task’s organizers, and penalizes the late
      response for correct predictions for depression.
    • speed is a metric that presents how fast the model predicts positive for the instances that
      are labeled as depressed.


7. Experiments and Evaluation Results
In the following Section we will describe the evaluation settings with the corresponding mea-
sures and the evaluation results - both on the internal and official test set.
7.1. Internal evaluation
In this subsection we explain our internal experimentation setup that was performed on the
internal data split defined in Section 2, followed by the presentation of the results obtained in
Table 2. As described in the previous sections, we first construct a representation (in the case
of LSA) or obtain it directly (in the case of sBERT) on the training set, and next we train a
classifier that we evaluate on the development set. The LSA representation outscored the four
other models in term of precision, achieving score of 0.5385. DistilBERT was next in line falling
behind by 0.2301 percentage points in terms of precision, that was followed by XLM-RoBERTa
and RoBERTa. In terms of recall best performing model was the RoBERTa model achieving
score of 0.9524, followed by XLM-RoBERTa, while LSA was at the last place in terms of recall
with score of 0.1667. The best-performing model in terms of F1-score was the DistilBERT
model achieving score of 0.4430 percentage points, followed by XLM-RoBERTa, RoBERTa
and finally LSA. The more granular evaluation of the DistilBERT model is presented in Figure
1. We considered the LSA model for our first submission as it had highest precision, and as for
the second submission we consider the DistilBERT since it has produced predictions with the
highest F1 score.

Table 2
Internal Evaluation Results.
                            Method         Precision   Recall    F_1 score
                           distillBERT      0.3084     0.7857     0.4430
                            RoBERTa         0.1961     0.9524     0.3252
                          XLM-RoBERTa       0.2031     0.9286     0.3333
                               LSA          0.5385     0.1667     0.2545


Figure 1: Confusion-matrix of the predictions of the DistilBERT model on the internal test set.
7.2. Official Test Set Results
Table 3 represents the results achieved on the official test set provided by the organizers. In
addition to our results, we added the top three results performed by three other teams that
also processed all 2000 writings for comparison. Our LSA solution achieved the best score in
terms of Precision with a highest score of 0.684, while our second run based on RoBERTa
achieved Recall score of 0.959 falling behind only by 4.1% behind the best recall score. In terms
of latencyTP and speed we achieved the best score. Finally we ranked 8𝑡ℎ out of 13 places.

Table 3
Official Test Results
             Method       Precision   Recall     F1    ERDE50   speed   writings processed
               LSA         0.684      0.133    0.222    0.061   1.000        2000/2000
         CF distillBERT     0.242     0.959    0.387    0.036   0.924        2000/2000
           SCIR2-run3       0.316     0.847    0.460    0.026   0.834        2000/2000
           UNSL-run2        0.400     0.755    0.523    0.026   0.992        2000/2000
           BLUE-run0        0.395     0.898    0.548    0.027   0.984        2000/2000


8. Conclusion and further work
In our attempt to solve this joint task proposed by CLEF, we considered using light machine
learning models such as logistic regression on different input representations based on LSA
and contextual features. Although we achieved a decent performance in terms of F1 score of
0.387, we still lag behind the top results in this task. On the other hand, we achieved the best
precision with the first method and almost perfect recall with the second method, which shows
the performance of our method in detecting depressed users and their early detection, as this
method also performed quite well on the ERDE50 metric. The results of the proposed method on
the official results indicate that false negatives are not well captured by the model (low recall),
however, false positives are easily identified – this is not necessarily optimal for a practical
application, where either F1 or recall can have a greater practical relevance, however, could
indicate the method’s usefulness in particular scenarios aimed to identify existing patients
that were mis-treated. And it is still worth mentioning that we were one of the 7 teams that
processed all the fonts, thanks to the low time consumption of our model, which also leads us to
achieve the best speed on the track. Of course, we make predictions based on only one writing,
and this output shows our direction for further improving these methods by using multiple
writings to get a more meaningful feeling from users’ writings. Additional further work can be
done by improving the performance of our system via making ensembles of classifiers as trying
to include background knowledge or test AutoML systems for automatic feature creation and
classifier selection.


9. Availability
The code can be found here: https://gitlab.com/teletton/erisk-task2-depression.
10. Acknowledgments
This work was supported by the Slovenian Research Agency (ARRS) grants for the core pro-
gramme Knowledge technologies (P2-0103), the project Computer-assisted multilingual news
discourse analysis with contextual embeddings (CANDAS, J6-2581).


References
 [1] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2021: Early
     risk prediction on the internet, in: Experimental IR Meets Multilinguality, Multimodality,
     and Interaction: 12th International Conference of the CLEF Association, CLEF 2021,
     Virtual Event, September 21–24, 2021, Proceedings, Springer-Verlag, Berlin, Heidel-
     berg, 2021, p. 324–344. URL: https://doi.org/10.1007/978-3-030-85251-1_22. doi:10.1007/
     978-3-030-85251-1_22.
 [2] T. C. Marriott, T. Buchanan, The true self online: Personality correlates of preference for
     self-expression online, and observer ratings of personality online and offline, Comput.
     Hum. Behav. 32 (2014) 171–177.
 [3] S. Argamon, M. Koppel, J. Fine, A. R. Shimoni, Gender, genre, and writing style in formal
     written texts, Text & talk 23 (2003) 321–346.
 [4] T. Litvinova, O. Zagorovskaya, O. Litvinova, P. Seredin, Profiling a set of personality traits
     of a text’s author: a corpus-based approach, in: International Conference on Speech and
     Computer, Springer, 2016, pp. 555–562.
 [5] F. Rangel, P. Rosso, M. Koppel, E. Stamatatos, G. Inches, Overview of the author profiling
     task at pan 2013, in: CLEF Conference on Multilingual and Multimodal Information Access
     Evaluation, CELCT, 2013, pp. 352–365.
 [6] F. Rangel, P. Rosso, I. Chugur, M. Potthast, M. Trenkmann, B. Stein, B. Verhoeven, W. Daele-
     mans, Overview of the 2nd author profiling task at pan 2014, in: CLEF 2014 Evaluation
     Labs and Workshop Working Notes Papers, Sheffield, UK, 2014, 2014, pp. 1–30.
 [7] F. M. Rangel Pardo, F. Celli, P. Rosso, M. Potthast, B. Stein, W. Daelemans, Overview of
     the 3rd author profiling task at pan 2015, in: CLEF 2015 Evaluation Labs and Workshop
     Working Notes Papers, 2015, pp. 1–8.
 [8] F. Rangel, P. Rosso, B. Verhoeven, W. Daelemans, M. Potthast, B. Stein, Overview of the
     4th author profiling task at pan 2016: cross-genre evaluations, in: Working Notes Papers
     of the CLEF 2016 Evaluation Labs. CEUR Workshop Proceedings/Balog, Krisztian [edit.];
     et al., 2016, pp. 750–784.
 [9] F. Rangel, P. Rosso, M. Potthast, B. Stein, Overview of the 5th author profiling task at pan
     2017: Gender and language variety identification in twitter, Working notes papers of the
     CLEF (2017) 1613–0073.
[10] F. Rangel, P. Rosso, M. Montes-y Gómez, M. Potthast, B. Stein, Overview of the 6th author
     profiling task at pan 2018: multimodal gender identification in twitter, Working Notes
     Papers of the CLEF (2018) 1–38.
[11] F. Rangel, P. Rosso, Overview of the 7th author profiling task at pan 2019: bots and gender
     profiling in twitter, in: Working Notes Papers of the CLEF 2019 Evaluation Labs Volume
     2380 of CEUR Workshop, 2019.
[12] F. Rangel, A. Giachanou, B. H. H. Ghanem, P. Rosso, Overview of the 8th author profiling
     task at pan 2020: Profiling fake news spreaders on twitter, in: CEUR Workshop Proceedings,
     volume 2696, Sun SITE Central Europe, 2020, pp. 1–18.
[13] M. Martinc, I. Skrjanec, K. Zupan, S. Pollak, Pan 2017: Author profiling-gender and
     language variety prediction., in: CLEF (Working Notes), 2017.
[14] B. Koloski, S. Pollak, B. Skrlj, Know your neighbors: Efficient author profiling via follower
     tweets., in: CLEF (Working Notes), 2020.
[15] B. Koloski, S. Pollak, B. Skrlj, Multilingual detection of fake news spreaders via sparse
     matrix factorization., in: CLEF (Working Notes), 2020.
[16] A. Basile, M. Chinea-Rios, A.-S. Uban, T. Müller, L. Rössler, S. Yenikent, B. Chulví, P. Rosso,
     M. Franco-Salvador, Upv-symanto at erisk 2021: Mental health author profiling for early
     risk prediction on the internet, Working Notes of CLEF (2021) 21–24.
[17] J. Pennington, R. Socher, C. Manning, GloVe: Global vectors for word representation, in:
     Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
     (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543.
     URL: https://aclanthology.org/D14-1162. doi:10.3115/v1/D14-1162.
[18] E. Campillo-Ageitos, H. Fabregat, L. Araujo, J. Martinez-Romo, Nlp-uned at erisk 2021:
     self-harm early risk detection with tf-idf and linguistic features, Working Notes of CLEF
     (2021) 21–24.
[19] R. Martínez-Castaño, A. Htait, L. Azzopardi, Y. Moshfeghi, Bert-based transformers for
     early detection of mental health illnesses, in: K. S. Candan, B. Ionescu, L. Goeuriot,
     B. Larsen, H. Müller, A. Joly, M. Maistro, F. Piroi, G. Faggioli, N. Ferro (Eds.), Experimental
     IR Meets Multilinguality, Multimodality, and Interaction, Springer International Publishing,
     Cham, 2021, pp. 189–200.
[20] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave,
     M. Ott, L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation
     learning at scale, CoRR abs/1911.02116 (2019). URL: http://arxiv.org/abs/1911.02116.
     arXiv:1911.02116.
[21] C. Spartalis, G. Drosatos, A. Arampatzis, Transfer learning for automated responses to the
     bdi questionnaire, Working Notes of CLEF (2021) 21–24.
[22] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Process-
     ing, Association for Computational Linguistics, 2019. URL: http://arxiv.org/abs/1908.10084.
[23] B. Koloski, T. Stepišnik-Perdih, S. Pollak, B. Škrlj, Identification of covid-19 related fake
     news via neural stacking, in: T. Chakraborty, K. Shu, H. R. Bernard, H. Liu, M. S. Akhtar
     (Eds.), Combating Online Hostile Posts in Regional Languages during Emergency Situation,
     Springer International Publishing, Cham, 2021, pp. 177–188.
[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
     M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
     Learning Research 12 (2011) 2825–2830.
[25] S. Bird, E. Klein, E. Loper, Natural language processing with Python: analyzing text with
     the natural language toolkit, " O’Reilly Media, Inc.", 2009.
[26] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional
     transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv.
     org/abs/1810.04805. arXiv:1810.04805.
[27] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of BERT: smaller,
     faster, cheaper and lighter, CoRR abs/1910.01108 (2019). URL: http://arxiv.org/abs/1910.
     01108. arXiv:1910.01108.
[28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoy-
     anov, Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692
     (2019). URL: http://arxiv.org/abs/1907.11692. arXiv:1907.11692.
[29] D. E. Losada, F. A. Crestani, A test collection for research on depression and language use,
     in: CLEF, 2016.