=Paper= {{Paper |id=Vol-2936/paper-84 |storemode=property |title=Transfer Learning for Automated Responses to the BDI Questionnaire |pdfUrl=https://ceur-ws.org/Vol-2936/paper-84.pdf |volume=Vol-2936 |authors=Christoforos Spartalis,George Drosatos,Avi Arampatzis |dblpUrl=https://dblp.org/rec/conf/clef/SpartalisDA21 }} ==Transfer Learning for Automated Responses to the BDI Questionnaire== https://ceur-ws.org/Vol-2936/paper-84.pdf
Transfer Learning for Automated Responses to the
BDI Questionnaire
Christoforos Spartalis1 , George Drosatos2 and Avi Arampatzis1
1
    Department of Electrical and Computer Engineering, Democritus University of Thrace, Xanthi 67100, Greece
2
    Institute for Language and Speech Processing, Athena Research Center, Xanthi 67100, Greece


                                         Abstract
                                         This paper describes the participation of the DUTH-ATHENA team of Democritus University of Thrace
                                         and Athena Research Center in the eRisk 2021 task, which focuses on measuring the level of depression
                                         based on Reddit users’ posts. We address this task using both feature-based and fine-tuning strategies for
                                         applying BERT-based representations. In the feature-based approaches, we examine the possibilities of a
                                         SBERT model based on RoBERTa, pre-trained on Natural Language Inference (NLI) data and fine-tuned
                                         on STSb dataset to leverage transfer learning to depression-level estimation, and we achieve promising
                                         results. One of our runs ranks first in Average Hit Rate (AHR), while the others rank among the best four
                                         in the other evaluation metrics. Also, for the fine-tuning approach, we propose two predictive models
                                         that are built upon RoBERTa, which provide directions for future optimizations.

                                         Keywords
                                         Transfer Learning, SBERT, RoBERTa, Depression Level, Social Media, Reddit.




1. Introduction
In the last decade, the social interactions taking place in the digital world have increased [1].
This development expands the potential of monitoring systems that detect users who suffer
from mental health conditions. Several studies have focused on this purpose using data from
social media platforms, such as Facebook [2], Twitter [3], Reddit [4], and others. CLEF eRisk1
contributes in this direction.
   CLEF’s eRisk lab [5] launched in 2017 introducing the test collection and evaluation metrics
proposed in [6]. Since 2017, the eRisk shared tasks pave the way for early detection of signs
of depression, self-harm, and anorexia [7]. Recently, a new challenge proposed concerning
pathological gambling.
   Since 2019, eRisk organizes a task oriented to automatically filling a depression questionnaire
based on user interactions in social media. The Beck’s Depression Inventory (BDI) questionnaire
[8] consists of 21 questions which assess the presence of feelings and mental states, such as:

                  • Sadness, pessimism, agitation, irritability, guilty, and punishment feelings.

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" chrispar@ee.duth.gr (C. Spartalis); gdrosato@athenarc.gr (G. Drosatos); avi@ee.duth.gr (A. Arampatzis)
~ https://www.drosatos.info (G. Drosatos); http://www.aviarampatzis.com (A. Arampatzis)
 0000-0001-8228-4235 (C. Spartalis); 0000-0002-8130-5775 (G. Drosatos); 0000-0003-2415-4592 (A. Arampatzis)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings          CEUR Workshop Proceedings (CEUR-WS.org)
                  http://ceur-ws.org
                  ISSN 1613-0073




                  1
                      CLEF eRisk – Early risk prediction on the Internet (https://erisk.irlab.org)
    • Self-dislike, self-criticalness, worthlessness, tiredness, and indecisiveness.
    • Changing in sleep patterns and appetite.
    • Loss of pleasure and energy, loss of interest in sex and in general.
    • Crying, failure, concentration difficulty, suicidal thoughts and wishes.

   The performance of the approaches proposed in previous years to handle this task can be
found in [9, 10]. Our approaches that are discussed in this paper are based on modifications of
the BERT model [11]. We addressed the eRisk task as a downstream task and deployed both
of the existing strategies for applying pre-trained language representations to it (i.e. feature-
based and fine-tuning). Regarding the first one, we extract Reddit post representations from a
SBERT pre-trained model [12] on the basis of which we build the predictive models. Regarding
fine-tuning, we update the parameters of a RoBERTa pre-trained model [13] using the datasets
provided by eRisk in previous years.
   This paper is structured as follows. In Section 2, an overview of related works is provided.
In Section 3, we describe the given eRisk datasets. In Section 4, we present our approaches to
measuring the severity of depression signs. In Section 5, we present and discuss our scores
in comparison with the best ones. Finally, in Section 6, we summarize our contributions and
present some thoughts for future work.


2. Related Work
Some previous contributions [14, 15, 16, 17] to the eRisk shared tasks employed standard
machine learning models, such as SS3 [18], topic modeling algorithms (LDA [19] and Anchor
[20]), and neural models (Contextualizer [21], Deep Averaging Network [22], RNN [23], CNN
[24, 25, 26], and BiLSTM [27, 28, 29]).
   Several studies (e.g., [30]) suggest that pre-training a language model on a large corpus can
provide widely applicable representations of words, which can be used in related tasks. These
language models encode textual data into high dimensional vector representations, which are
known as embeddings. In this way, the problem of lack or inadequacy of task-dedicated training
data could be alleviated.
   Some authors [31, 32] took advantage of these methods to automatically extract signals from
social media activity concerning depression and anorexia. Some of them included pre-trained
representations, extracted from GloVe [33], BERT [11] or Universal Sentence Encoder [34], as
additional features to their task-specific architectures, whilst others [35, 36] fine-tuned OpenAI
GPT [37] and XLM [38] pre-trained models.


3. Dataset
The task 3 of eRisk 2021 is a continuation of 2019’s task 3 and 2020’s task 2. The datasets of
the two previous years are annotated and provided by the organizers upon request. In other
words, the subjects’ answers to the 21 questions of the BDI questionnaire are known. Thus, we
utilized this data to evaluate our methods and select the best-performing ones for the eRisk
2021 challenge. Moreover, the approaches needed training or fine-tuning are solely based on
eRisk 2019 and 2020 datasets. Table 1 shows the number of subjects and their posts per year.
Their depression categories to which they belong vary from year to year, as shown in Figure 1.

Table 1
Makeup of the eRisk datasets.
                                                               # of subjects   # of posts
                                        eRisk 2019 (Task 3)         20              10,941
                                        eRisk 2020 (Task 2)         70              35,562
                                        eRisk 2021 (Task 3)         80              30,787



                            40


                            35                                                                            34


                            30
                                                                                     27

                            25                            23
            # of subjects




                            20                                                 19                              2019
                                                                                                  18
                                                                                                               2020
                            15                                 13                                              2021

                                       10
                            10                                                               8
                                               6
                            5    4                    4                   4


                            0
                                     minimal              mild              moderate             severe
                                                            depression categories


Figure 1: Statistics on eRisk subjects’ depression categories.




4. Methods
Our approaches to leverage transfer learning are based on Bidirectional Encoder Representations
from Transformers (BERT) [11]. The BERT model has been pre-trained on BookCorpus [39] and
English Wikipedia on two objectives: Masked Language Model (MLM) [40] and Next Sentence
Prediction [41, 42]. Furthermore, it is available in two different architectures:

    • BERTBASE (number of layers=12, hidden size=768, number of self-attention heads=12)
    • BERTLARGE (number of layers=24, hidden size=1024, number of self-attention heads=16)
Table 2
Experiments with SBERT models using the eRisk 2019-20 datasets. The names of the models are en-
coded as follows: (base model)-(architecture)-(data used for pre-training)-(optional: data used for fine-
tuning)-(pooling strategy). The evaluation measures are defined in Section 5.
           SBERT model                                 AHR       ACR      ADODL       DCHR
           bert-base-nli-cls-token                    24.13%    60.07%     74.92%     24.44%
           bert-base-nli-max-tokens                   28.47%    61.90%     80.00%     24.44%
           bert-base-nli-mean-tokens                  25.40%    59.70%     74.87%     17.78%
           bert-base-nli-stsb-mean-tokens             27.67%    60.78%     76.72%     21.11%
           bert-large-nli-cls-token                   26.93%    61.66%     77.21%     21.11%
           bert-large-nli-max-tokens                  28.10%    59.19%     78.84%     22.22%
           bert-large-nli-mean-tokens                 26.98%    60.83%     77.44%     25.56%
           distilbert-base-nli-stsb-mean-tokens       28.25%    61.57%     78.15%     25.56%
           distilbert-base-nli-stsb-quora-ranking     26.88%    59.42%     79.91%     26.67%
           roberta-base-nli-stsb-mean-tokens          28.10%    64.78%     79.52%     27.78%
           roberta-large-nli-stsb-mean-tokens         26.98%    63.81%     81.27%     36.67%
           best scores                                28.47%    64.78%     81.27%     36.67%


   Every input sequence to BERT consists of tokens derived from the WordPiece [43] algorithm.
The key advantage of this language representation model is that it overcomes the unidirection-
ality constraint of the previous ones (e.g., OpenAI GPT [37] and GloVe [33]). Moreover, it has
been proved effective for fine-tuning and feature-based approaches [11]. We examine both of
these strategies for our proposed approaches for the task 3 of eRisk 2021.
   The Robustly Optimized BERT Pretraining Approach (RoBERTa) [13] has been trained longer
on extended sequences, with bigger batches, and over more data. More specifically, its pre-
training corpus also includes CC-News (portion of the CommonCrawl News dataset [44]),
OpenWebText [45], and Stories [46]). Furthermore, there are some modifications to the training
procedure (dynamic masking instead of static, no NSP loss, large mini-batches, and larger
byte-level Byte-Pair Encoding (BPE) [47]).
   Sentence-BERT (SBERT) [12] is an adaptation of pre-trained BERT and RoBERTa networks
aiming to capture better sentence embeddings. For this purpose, it adds a pooling operation
to the output of these models. To draw conclusions about the most appropriate SBERT model,
we evaluate the performance of various SBERT models with respect to the predictive model
described in Section 4.1 using the eRisk 2019 and 2020 datasets. The results are presented
in Table 2. Our findings led us to use in our approaches a SBERT pre-trained model based
on RoBERTaLARGE , which was pre-trained on the combination of the Stanford NLI [48] and
Multi-Genre NLI [49] and then fine-tuned on the STS benchmark dataset [50], with a mean-pool
layer on the output to map subjects’ posts to a vector space.
   Overall, we mainly propose three approaches:
   1. Feature-based transfer learning without using any training data
   2. Feature-based transfer learning in combination with machine learning classification
   3. Transfer learning with fine-tuning
  The details of these approaches are presented in the following subsections.
4.1. Feature-based transfer learning without using any training data
In this approach, we use the aforementioned SBERT pre-trained model (max input sequence=128
tokens, i.e. padding the shorter and truncating the end of the longer sequences) to get the vector
representation of Reddit posts which belong to the eRisk 2021 subjects. Similarly, we encode
the responses to the BDI questionnaire into embeddings.
   Next, we map subjects to the same vector space by calculating the mean of each feature of the
post embeddings. Finally, we compare the vector of each subject with the vectors of the possible
responses to each question in order to select the one with the maximum cosine similarity. The
flowchart of this approach is shown in Figure 2.

   Reddit                                                  Calculate the
                   Extract features      Posts             mean of each             Subjects
  subjects’
                  from pre-trained     embeddings           feature per            embeddings
  posts of
                    SBERT model                           subject’s posts
 eRisk 2021




                                                          Calculate cosine simi-
                                                         larity between subjects
   Possible
                   Extract features        BDI              and BDI responses
  responses                                                                              Filled in BDI ques-
                  from pre-trained      responses          embeddings & Select
 to each BDI                                                                            tionnaire per subject
                    SBERT model        embeddings         the response with the
   question
                                                           maximum similarity
                                                             to each question


Figure 2: A flowchart of feature-based transfer learning approach without using any training data.



4.2. Feature-based transfer learning in combination with machine learning
     classification
This approach is quite similar to the previous one. We initially follow the same procedure to
get the eRisk 2019 and 2020 subjects embeddings. However, this time, we use them as a training
set to perform machine learning classification. The target variables for each subject are the
21 values (varying from 0–3 or 0–6 depending on the question) corresponding with his/her
responses to the BDI questionnaire. Then, we apply the eRisk 2021 subjects embeddings as
input to the derived trained model to make our predictions by filling in the BDI questionnaire
per subject. The flowchart of this approach is shown in Figure 3.
  The best-performing machine learning algorithms for this approach were selected utilizing
the eRisk 2019 and 2020 datasets and using 10-fold cross-validation. Our experiments with
various known classifiers are shown in Table 3. Slightly superior results are achieved with the
AdaBoost [51], Linear SVM [52], and Naive Bayes [53] classifiers. Thus, we decided to employ
the former two for our submitted runs.
                                                            training



        Reddit                                                                      Calculate the
                            Extract features                                                                    Subjects
       subjects’                                         Posts 2019-20              mean of each
                           from pre-trained                                                                      2019-20
       posts of                                           embeddings                 feature per
                             SBERT model                                                                       embeddings
     eRisk 2019-20                                                                 subject’s posts




                           Train ML model to classify
    BDI questionnaire                                                                         Apply trained
                            posts into 4 or 7 classes                                                                  Filled in BDI
   responses (21 ques-                                                   Trained              model to the
                               corresponding with                                                                      questionnaire
    tions) per subject                                                    model               subjects 2021
                             the possible responses                                                                     per subject
     of eRisk 2019-20                                                                          embeddings
                               to the 21 questions




         Reddit                                                                     Calculate the
                            Extract features              Posts 2021                mean of each              Subjects 2021
        subjects’
                           from pre-trained              embeddings                  feature per               embeddings
        posts of
                             SBERT model                                           subject’s posts
       eRisk 2021

Figure 3: A flowchart of feature-based transfer learning approach in combination with machine learn-
ing classification.


Table 3
Experiments with various known classifiers using eRisk 2019-20 datasets and 10-fold cross-validation.
Evaluation metrics are defined in Section 5.
                         Classifier                     AHR            ACR         ADODL         DCHR
                         Nearest Neighbors          35.29%          68.14%         78.64%        28.33%
                         Linear SVM                 40.95%          72.00%         80.64%        28.00%
                         RBF SVM                    38.75%          69.81%         76.87%        20.76%
                         Gaussian Process           34.16%          67.76%         80.82%        26.67%
                         Decision Tree              31.79%          66.10%         81.91%        25.67%
                         Random Forest              39.19%          71.24%         81.55%        30.67%
                         Neural Net                 36.59%          70.10%         82.48%        34.67%
                         AdaBoost                   35.17%          68.76%         83.21%        35.00%
                         Naive Bayes                37.20%          69.67%         82.34%        36.67%
                         best scores                    40.95%         72.00%      83.21%        36.67%


4.3. Transfer learning with fine-tuning
In this approach, we employ the eRisk 2019 and 2020 datasets as training set to fine-tune
(epochs=3, batch size=32, and learning rate=2e-5) the RoBERTaBASE pre-trained model (max
input sequence=128 tokens, i.e. padding the shorter and truncating the end of the longer
sequences) to a classification task. To this end, we assigned subjects’ BDI responses to each
of their posts as target variables. A different fine-tuned model derived for each question, that
outputs for each post of eRisk 2021 the probability of being related to each response.
   Finally, in order to make the transition from post-level to subject-level, we apply two different
                     fine-tuning                                                          Method 1:
                                                                                           Calculate
                                                                                           the mean
        Reddit                                                                           probabilities
       subjects’                                                                          per subject
       posts of                                                                          & Select the
     eRisk 2019-20                 Fine-tune
                                 RoBERTaBASE                                            response with
                                                                      Probabilities     the maximum
                               model to classify
                                                     Fine-tuned         of posts       mean probability   Filled in BDI
                                posts into 4 or
                                                    RoBERTaBASE        belonging                          questionnaire
                                7 classes corre-
                                                       model             to each                           per subject
    BDI questionnaire         sponding with the
                                                                      possible class      Method 2:
   responses (21 ques-        possible responses
    tions) per subject       to the 21 questions                                          Select the
     of eRisk 2019-20                                                                   response with
                                                                                        the maximum
                                                                                          probability
                                                                                          per subject

         Reddit
                                                   Apply fine-tuned
        subjects’
                                                    model to eRisk
        posts of
                                                     2021 posts
       eRisk 2021


Figure 4: A flowchart of fine-tuning approach.


methods. In the first method, we calculate the mean probabilities per subject and select the
response with the maximum mean probability, while in the second one, we simply select the
response with the maximum probability per subject. The flowchart of this approach is shown
in Figure 4.


5. Evaluation
In order to determine the subjects’ depression level based on the BDI questionnaire, the responses
to each question (out of 21 questions in total) are associated with integer values (i.e., 0–3). The
sum of these 21 values is used to determine the depression level of a subject. The depression
categories are associated with the depression levels in the following way:

     • Minimal depression (depression levels 0–9)
     • Mild depression (depression levels 10–18)
     • Moderate depression (depression levels 19–29)
     • Severe depression (depression levels 30–63)

   The evaluation measures used by the organizers of this eRisk task to assess the performance
of the submitted runs are as follows:

     • Average Hit Rate (AHR): Reflects the accuracy of the responses to the BDI questionnaire
       submitted by the participants.
     • Average Closeness Rate (ACR): Captures the deviation of the submitted responses from
       the real ones.
     • Average Difference between Overall Depression Levels (ADODL): Captures the deviation
       of the sum of response values from the actual sum.
     • Depression Category Hit Rate (DCHR): Reflects the accuracy of the depression category
       resulting from the sum of the submitted responses.
Table 4
Evaluation of DUTH-ATHENA’s submissions. The best result across all participants for each measure
is shown in the last line for comparison.
       Run                               Approach     AHR       ACR     ADODL      DCHR
                                              rd
       DUTH_ATHENA MaxFT                    3        31.43%    64.86%    74.46%    15.00%
       DUTH_ATHENA MeanFT                   3rd      32.02%    65.63%    73.81%    12.50%
       DUTH_ATHENA MeanPosts                1st      25.06%    63.97%    80.28%    30.00%
       DUTH_ATHENA MeanPostsAB              2nd      33.04%    67.86%    80.32%    27.50%
       DUTH_ATHENA MeanPostsSVM             2nd      35.36%    67.18%    73.97%    15.00%
       best scores                                   35.36%    73.17%    83.59%    41.25%


  We also used the aforementioned measures to evaluate our experiments in the eRisk 2019 and
2020 datasets. Thus, we came up with the proposed runs for the first two approaches (Section 4.1
and 4.2). The third approach was quite time-consuming due to high computational cost and we
could not afford to evaluate our methods.

5.1. Results
The evaluation of DUTH-ATHENA’s submissions on the eRisk 2021 Task 3 are shown in Table 4.
The second approach with the SVM classifier (MeanPostsSVM) achieved the highest score in
terms of AHR, which is the most stringent measure. The same approach with the AdaBoost
classifier (MeanPostsAB) ranked third among the 35 runs in ACR and ADODL.
   Another promising finding was that the first approach (MeanPosts) performed well on
predicting the depression levels and categories. In fact, this run ranked fourth in ADODL and
DCHR among all submissions and first in DCHR among ours. This is remarkable since no
annotated, task-dedicated data was used and the computational cost and execution time were
the lowest among our runs. Finally, regarding the third approach with fine-tuning (MaxFT and
MeanFT), the results are not comparable with the other two approaches because we utilized a
smaller model architecture, due to computational limitations from our side, even thought we
were expecting poorer results [13].


6. Conclusion
This paper presented our transfer learning approaches submitted to eRisk 2021 Task 3 utilizing a
BERT-based pre-trained language model for a classification task that aims to automatically fill a
depression questionnaire. The approaches utilize both feature-based and fine-tuning strategies.
While our proposed models did not achieve high scores on all evaluation measures, we observed
that this is a widespread problem among most of the submissions of the participants that maybe
reflects the difficulty of the task. The modest performance of our third approach may be a result
of the smaller model architecture or the matching of the subject’s ground truth with all of their
posts.
   Nevertheless, we found that feature extraction from BERT-based pre-trained models achieved
the best accuracy compared to the other participants’ approaches. This suggests that further
research in this direction could lead to promising outcomes. Future research should consider this
potential more carefully, for example, experimenting with more and state-of-the-art pre-trained
language models, such as Big Bird [54], and/or even more machine learning classifiers.


References
 [1] A. Perrin, Social Media Usage: 2005-2015, Pew Research Center, 2015. URL: http://www.
     pewinternet.org/2015/10/08/2015/Social-Networking-Usage-2005-2015/.
 [2] J. C. Eichstaedt, R. J. Smith, R. M. Merchant, L. H. Ungar, P. Crutchley, D. Preoţiuc-
     Pietro, D. A. Asch, H. A. Schwartz, Facebook language predicts depression in medi-
     cal records, Proceedings of the National Academy of Sciences 115 (2018) 11203–11208.
     doi:10.1073/pnas.1802331115, iSBN: 9781802331110 Publisher: National Academy of
     Sciences Section: Social Sciences.
 [3] A. H. Orabi, P. Buddhitha, M. H. Orabi, D. Inkpen, Deep learning for depression detection
     of twitter users, in: Proceedings of the Fifth Workshop on Computational Linguistics and
     Clinical Psychology: From Keyboard to Clinic, 2018, pp. 88–97.
 [4] A. Yates, A. Cohan, N. Goharian, Depression and self-harm risk assessment in online
     forums, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language
     Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp.
     2968–2978. doi:10.18653/v1/D17-1322.
 [5] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, eRisk 2021: Pathological Gambling,
     Self-harm and Depression Challenges, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego,
     M. Potthast, F. Sebastiani (Eds.), Advances in Information Retrieval, Lecture Notes in
     Computer Science, Springer International Publishing, Cham, 2021, pp. 650–656. doi:10.
     1007/978-3-030-72240-1_76.
 [6] D. E. Losada, F. Crestani, A test collection for research on depression and language use,
     in: N. Fuhr, P. Quaresma, T. Gonçalves, B. Larsen, K. Balog, C. Macdonald, L. Cappellato,
     N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction,
     7th International Conference of the CLEF Association, CLEF 2016, Springer International
     Publishing, Cham, 2016, pp. 28–39. doi:10.1007/978-3-319-44564-9_3.
 [7] D. E. Losada, F. Crestani, J. Parapar, eRisk 2020: Self-harm and Depression Challenges, in:
     J. M. Jose, E. Yilmaz, J. Magalhães, P. Castells, N. Ferro, M. J. Silva, F. Martins (Eds.), Ad-
     vances in Information Retrieval, Lecture Notes in Computer Science, Springer International
     Publishing, Cham, 2020, pp. 557–563. doi:10.1007/978-3-030-45442-5_72.
 [8] A. T. Beck, C. H. Ward, M. Mendelson, J. Mock, J. Erbaugh, An inventory for measuring
     depression, Archives of General Psychiatry 4 (1961) 561–571. doi:10.1001/archpsyc.
     1961.01710120031004.
 [9] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk at CLEF 2019: Early risk prediction
     on the internet (extended overview), in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller
     (Eds.), Conference and Labs of the Evaluation Forum (CLEF), 2019, p. 21.
[10] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk 2020: Early risk prediction on
     the internet, in: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis, H. Joho, C. Lioma,
     C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (Eds.), Experimental IR Meets Multilinguality,
     Multimodality, and Interaction, 11th International Conference of the CLEF Association,
     CLEF 2020, Springer International Publishing, Cham, 2020, pp. 272–287.
[11] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the 2019 Conference of the
     North American Chapter of the Association for Computational Linguistics: Human Lan-
     guage Technologies, Association for Computational Linguistics, Minneapolis, Minnesota,
     2019, pp. 4171–4186. doi:10.18653/v1/N19-1423.
[12] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-
     networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
     Language Processing and the 9th International Joint Conference on Natural Language
     Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong,
     China, 2019, pp. 3982–3992. doi:10.18653/v1/D19-1410.
[13] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettle-
     moyer, V. Stoyanov, RoBERTa: A robustly optimized bert pretraining approach, 2019.
     arXiv:1907.11692.
[14] A. Trifan, P. Salgado, L. Oliveira, BioInfo@UAVR at eRisk 2020: on the use of psycholin-
     guistics features and machine learning for the classification and quantification of mental
     diseases, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Neveol (Eds.), Working Notes of CLEF
     2020 – Conference and Labs of the Evaluation Forum, volume 2696, Thessaloniki, Greece,
     2020, p. 11.
[15] S. G. Burdisso, M. Errecalde, M. Montes-y Gomez, UNSL at eRisk 2019: a Unified Approach
     for Anorexia, Self-harm and Depression Detection in Social Media, in: L. Cappellato,
     N. Ferro, D. E. Losada, H. Müller (Eds.), Working Notes of CLEF 2019 – Conference and
     Labs of the Evaluation Forum, volume 2380, Lugano, Switzerland, 2019, p. 18.
[16] D. Maupome, M. D. Armstrong, R. Belbahar, J. Alezot, R. Balassiano, M. Queudot, S. Mosser,
     M.-J. Meurs, Early Mental Health Risk Assessment through Writing Styles, Topics and
     Neural Models, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Neveol (Eds.), Working Notes of
     CLEF 2020 – Conference and Labs of the Evaluation Forum, volume 2696, Thessaloniki,
     Greece, 2020, p. 13.
[17] A. Madani, F. Boumahdi, A. Boukenaoui, C. Kritli, H. Hentabli, USDB at eRisk 2020: Deep
     learning models to measure the Severity of the Signs of Depression using Reddit Posts,
     in: L. Cappellato, C. Eickhoff, N. Ferro, A. Neveol (Eds.), Working Notes of CLEF 2020 –
     Conference and Labs of the Evaluation Forum, volume 2696, Thessaloniki, Greece, 2020,
     p. 9.
[18] S. G. Burdisso, M. Errecalde, M. Montes-y Gómez, A text classification framework for
     simple and effective early depression detection over social media streams, Expert Systems
     with Applications 133 (2019) 182–197. doi:10.1016/j.eswa.2019.05.023.
[19] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, The Journal of Machine
     Learning Research 3 (2003) 993–1022.
[20] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, M. Zhu, A Practical
     Algorithm for Topic Modeling with Provable Guarantees, in: S. Dasgupta, D. McAllester
     (Eds.), Proceedings of the 30th International Conference on Machine Learning, volume
     28(2), PMLR, 2013, pp. 280–288.
[21] L. Xian, S. D. Vickers, A. L. Giordano, J. Lee, I. K. Kim, L. Ramaswamy, #selfharm on
     Instagram: Quantitative Analysis and Classification of Non-Suicidal Self-Injury, in: 2019
     IEEE First International Conference on Cognitive Machine Intelligence (CogMI), 2019, pp.
     61–70. doi:10.1109/CogMI48466.2019.00017.
[22] M. Iyyer, V. Manjunatha, J. Boyd-Graber, H. Daumé III, Deep Unordered Composition Rivals
     Syntactic Methods for Text Classification, in: Proceedings of the 53rd Annual Meeting of
     the Association for Computational Linguistics and the 7th International Joint Conference
     on Natural Language Processing (Volume 1: Long Papers), Association for Computational
     Linguistics, Beijing, China, 2015, pp. 1681–1691. doi:10.3115/v1/P15-1162.
[23] D. Maupomé, M. Queudot, M.-J. Meurs, Inter and intra document attention for depression
     risk assessment, in: Canadian Conference on Artificial Intelligence, Springer, 2019, pp.
     333–341.
[24] Y. LeCun, Y. Bengio, Convolutional networks for images, speech, and time series, in: The
     handbook of brain theory and neural networks, MIT Press, Cambridge, MA, USA, 1998,
     pp. 255–258.
[25] J. L. Elman, Finding structure in time, Cognitive Science 14 (1990) 179–211. doi:10.1016/
     0364-0213(90)90002-E.
[26] A. Graves, J. Schmidhuber, Offline Handwriting Recognition with Multidimensional
     Recurrent Neural Networks, Advances in Neural Information Processing Systems 21
     (2008).
[27] C. Baziotis, N. Pelekis, C. Doulkeridis, DataStories at SemEval-2017 Task 4: Deep LSTM
     with Attention for Message-level and Topic-based Sentiment Analysis, in: Proceedings
     of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association
     for Computational Linguistics, Vancouver, Canada, 2017, pp. 747–754. doi:10.18653/v1/
     S17-2126.
[28] Y. Zhang, J. Wang, X. Zhang, YNU-HPCC at SemEval-2018 Task 1: BiLSTM with Attention
     based Sentiment Analysis for Affect in Tweets, in: Proceedings of The 12th Interna-
     tional Workshop on Semantic Evaluation, Association for Computational Linguistics, New
     Orleans, Louisiana, 2018, pp. 273–278. doi:10.18653/v1/S18-1040.
[29] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, B. Xu, Attention-Based Bidirectional
     Long Short-Term Memory Networks for Relation Classification, in: Proceedings of the
     54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short
     Papers), Association for Computational Linguistics, Berlin, Germany, 2016, pp. 207–212.
     doi:10.18653/v1/P16-2034.
[30] J. Turian, L.-A. Ratinov, Y. Bengio, Word Representations: A Simple and General Method
     for Semi-Supervised Learning, in: Proceedings of the 48th Annual Meeting of the Associa-
     tion for Computational Linguistics, Association for Computational Linguistics, Uppsala,
     Sweden, 2010, pp. 384–394.
[31] M. Trotzek, S. Koitka, C. M. Friedrich, Word Embeddings and Linguistic Metadata at
     the CLEF 2018 Tasks for Early Detection of Depression and Anorexia, in: L. Cappellato,
     N. Ferro, J.-Y. Nie, L. Soulier (Eds.), Working Notes of CLEF 2018 – Conference and Labs of
     the Evaluation Forum, volume 2125, Avignon, France, 2018, p. 15.
[32] A.-S. Uban, P. Rosso, Deep learning architectures and strategies for early detection of self-
     harm and depression level prediction, in: L. Cappellato, C. Eickhoff, N. Ferro, A. Neveol
     (Eds.), Working Notes of CLEF 2020 – Conference and Labs of the Evaluation Forum,
     volume 2696, Thessaloniki, Greece, 2020, p. 12.
[33] J. Pennington, R. Socher, C. Manning, Glove: Global Vectors for Word Representation, in:
     Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing
     (EMNLP), Association for Computational Linguistics, Doha, Qatar, 2014, pp. 1532–1543.
     doi:10.3115/v1/D14-1162.
[34] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-
     Cespedes, S. Yuan, C. Tar, B. Strope, R. Kurzweil, Universal sentence encoder for English, in:
     Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing:
     System Demonstrations, Association for Computational Linguistics, Brussels, Belgium,
     2018, pp. 169–174. doi:10.18653/v1/D18-2029.
[35] P. Abed-Esfahani, D. Howard, M. Maslej, S. Patel, V. Mann, S. Goegan, L. French, Transfer
     Learning for Depression: Early Detection and Severity Prediction from Social Media
     Postings, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.), Working Notes of CLEF
     2019 – Conference and Labs of the Evaluation Forum, volume 2380, Lugano, Switzerland,
     2019, p. 9.
[36] R. Martınez-Castano, A. Htait, L. Azzopardi, Y. Moshfeghi, Early Risk Detection of Self-
     Harm and Depression Severity using BERT-based Transformers, in: L. Cappellato, C. Eick-
     hoff, N. Ferro, A. Neveol (Eds.), Working Notes of CLEF 2020 – Conference and Labs of the
     Evaluation Forum, volume 2696, Thessaloniki, Greece, 2020, p. 16.
[37] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, Improving Language Under-
     standing by Generative Pre-training, 2018. URL: https://cdn.openai.com/research-covers/
     language-unsupervised/language_understanding_paper.pdf.
[38] A. Conneau, G. Lample, Cross-lingual language model pretraining, in: H. Wallach,
     H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in
     Neural Information Processing Systems (NeurIPS 2019), volume 32, Curran Associates,
     Inc., Vancouver, Canada, 2019, p. 11.
[39] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, S. Fidler, Aligning
     books and movies: Towards story-like visual explanations by watching movies and reading
     books, in: Proceedings of the 2015 IEEE International Conference on Computer Vision
     (ICCV), ICCV ’15, IEEE Computer Society, USA, 2015, p. 19–27. doi:10.1109/ICCV.2015.
     11.
[40] W. L. Taylor, “Cloze Procedure”: A New Tool for Measuring Readability, Journalism
     Quarterly 30 (1953) 415–433. doi:10.1177/107769905303000401.
[41] Y. Jernite, S. R. Bowman, D. Sontag, Discourse-based objectives for fast unsupervised
     sentence representation learning, 2017. arXiv:1705.00557v1.
[42] L. Logeswaran, H. Lee, An efficient framework for learning sentence representations, in:
     Proceedings of the 6th International Conference on Learning Representations (ICLR 2018),
     Vancouver, BC, Canada, 2018, p. 16.
[43] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao,
     Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Łukasz Kaiser, S. Gouws,
     Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith,
     J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, J. Dean, Google’s neural machine
     translation system: Bridging the gap between human and machine translation, CoRR
     abs/1609.08144 (2016).
[44] S. Nagel, CC-News, https://commoncrawl.org/2016/10/news-dataset-available/, 2016.
[45] A. Gokaslan, V. Cohen, Openwebtext corpus, http://Skylion007.github.io/
     OpenWebTextCorpus, 2019.
[46] T. H. Trinh, Q. V. Le, A simple method for commonsense reasoning, 2019.
     arXiv:1806.02847.
[47] R. Sennrich, B. Haddow, A. Birch, Neural Machine Translation of Rare Words with Subword
     Units, in: Proceedings of the 54th Annual Meeting of the Association for Computational
     Linguistics, Association for Computational Linguistics, Berlin, Germany, 2016, pp. 1715–
     1725. doi:10.18653/v1/P16-1162.
[48] S. R. Bowman, G. Angeli, C. Potts, C. D. Manning, A large annotated corpus for learning
     natural language inference, in: Proceedings of the 2015 Conference on Empirical Methods
     in Natural Language Processing (EMNLP), Association for Computational Linguistics,
     Lisbon, Portugal, 2015, p. 632–642.
[49] A. Williams, N. Nangia, S. Bowman, A broad-coverage challenge corpus for sentence
     understanding through inference, in: Proceedings of the 2018 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New
     Orleans, Louisiana, 2018, pp. 1112–1122. doi:10.18653/v1/N18-1101.
[50] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia, SemEval-2017 Task 1: Semantic
     textual similarity multilingual and crosslingual focused evaluation, in: Proceedings of
     the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association
     for Computational Linguistics, Vancouver, Canada, 2017, pp. 1–14. doi:10.18653/v1/
     S17-2001.
[51] Y. Freund, R. E. Schapire, A short introduction to boosting, in: In Proceedings of the
     Sixteenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann,
     1999, pp. 1401–1406.
[52] N. Cristianini, E. Ricci, Support Vector Machines, in: M.-Y. Kao (Ed.), Encyclo-
     pedia of Algorithms, Springer US, Boston, MA, 2008, pp. 928–932. doi:10.1007/
     978-0-387-30162-4_415.
[53] G. I. Webb, Naïve Bayes, in: C. Sammut, G. I. Webb (Eds.), Encyclopedia of Machine Learn-
     ing, Springer US, Boston, MA, 2010, pp. 713–714. doi:10.1007/978-0-387-30164-8_
     576.
[54] M. Zaheer, G. P. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. M. Pham,
     A. Ravula, Q. Wang, L. Yang, A. M. E. H. Ahmed, Big Bird: Transformers for Longer
     Sequences, in: Proceedings of the 34th Conference on Neural Information Processing
     Systems (NeurIPS 2020), Vancouver, Canada, 2020, p. 15.