=Paper=
{{Paper
|id=Vol-2936/paper-79
|storemode=property
|title=uOttawa at eRisk 2021: Automatic Filling of the Beck's Depression Inventory Questionnaire
                        using Deep Learning
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-79.pdf
|volume=Vol-2936
|authors=Diana Inkpen,Ruba Skaik,Prasadith Buddhitha,Dimo Angelov,Maxwell Thomas Fredenburgh
|dblpUrl=https://dblp.org/rec/conf/clef/InkpenSBAF21
}}
==uOttawa at eRisk 2021: Automatic Filling of the Beck's Depression Inventory Questionnaire
                        using Deep Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-2936/paper-79.pdf</pdf>
<pre>
uOttawa at eRisk 2021: Automatic Filling of the
Beck’s Depression Inventory Questionnaire using
Deep Learning
Diana Inkpen, Ruba Skaik, Prasadith Buddhitha, Dimo Angelov and
Maxwell Thomas Fredenburgh
University of Ottawa, School of Electrical Engineering and Computer Science, 800 King Edward Avenue, Ottawa, ON,
K1N 6N5, Canada


               Abstract
                This paper describes the University of Ottawa’s participation in Task 3 of the eRisk 2021 shared task
                at CLEF 2021. We think that this task is important because it allows detecting the level of depression
                for social media users as often as needed, without having to ask them to spend their time manually
                filling in the Beck’s Depression Inventory questionnaire. Our methods focus on selecting the relevant
                posts for each question of the questionnaire and using pre-trained deep learning models with or without
                fine-tuning to make predictions for unseen users.

                Keywords
                depression detection, social media, deep learning, natural language processing


1. Introduction
This paper described the uOttawa team’s participation in Task 3 of the eRisk 2021 [1] shared
task. The task is a continuation of Task 3 at eRisk 2019 and Task 2 at eRisk 2020, and the goal is
to automatically estimate the level of depression from a user’s social media postings.
   We believe that this task can be very useful for monitoring users in special situations, without
having to ask them to manually provide information. For example, a psychologist could post-
monitor their patients after their recovery, with their consent. Another example is for people
who need to spend long periods in conditions of isolation, such as arctic researchers or astronauts
during long space flights (in the future).
   We employed deep learning techniques to classify information extracted from the postings.
Our focus was on selecting relevant posts for each type of information. Then we employed
zero-shot learning via pre-trained models or we trained models based on sequence-to-sequence
models for Question Answering (QA).
   The rest of the paper is organized as follows. Section 2 gives more details about the task
and shows statistics about the class distribution in the training data. Section 3 describes the

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" diana.inkpen@uottawa.ca (D. Inkpen); rskai034@uottawa.ca (R. Skaik); pkiri056@uottawa.ca (P. Buddhitha);
dange042@uottawa.ca (D. Angelov); mfred075@uottawa.ca (M. T. Fredenburgh)
~ https://www.site.uottawa.ca/~diana/ (D. Inkpen)
 0000-0002-0202-2444 (D. Inkpen); 0000-0002-8929-8602 (R. Skaik)
 © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR Workshop Proceedings (CEUR-WS.org)
methods that we used. Section 4 presents their results, and discusses them. Section 5 concludes
and suggests directions of future work.


2. Task description
For each user, the participants in the shared task1 were given the postings of that user for a
certain time period. The task was to automatically fill in a standard depression questionnaire:
the Beck’s Depression Inventory (BDI). The questionnaire has 21 questions which assess the
presence of feelings like sadness, pessimism, loss of energy, etc. Each question has 4 answers
(0,1,2,3). Therefore the task becomes a classification task into 4 classes, for each question. The
answers are targeting changes in a user’s life.
   There is variation in the expected answers. Two of the questions have seven answers instead
of four, namely: 0, 1a, 1b, 2a, 2b, 3a, 3b. Therefore these two classifiers will need to classify
into 7 classes. The two questions are question 16 (about sleep patterns) and question 18 (about
appetite).
   The training dataset provided by the task organizers for Task 3 is composed of 43,514 Reddit
posts and comments written by 90 users who have answered 21-questions of the BDI question-
naire during the past two years. The test dataset consists of 19,803 posts and comments written
by 80 users.pub. More details about the construction of the dataset are available in [2].
   In table 1, we show statistics about the class distribution in the training data. For the two
questions with 7 answers, the answers 1a, 1b were considered equivalent when performing
these counts. They were counted as class 1. We did the same for for classes 2a, 2b and 3a, 3b,
because they contribute the same number of points when assessing the depression level of a
user. We can see for the statistics that class 0 is the biggest (35%), while class 1 is not far behind
(33%). Choosing class 0 as the default class would therefore not necessary be a great choice for
the classifiers. Class 2 is 19% and class 3 is the smallest (13%). If we look at the depression levels,
15% of the users have minimal depression, 30% mild depression, 24% moderate depression, and
30% severe depression. So, we can say that the data is not very imbalanced.
   The evaluation measures used in the shared task are: the Average Hit Rate (AHR), the Average
Closeness Rate (ACR), the Average Difference between Overall Depression Levels (ADODL),
and the Depression Category Hit Rate (DCHR).
   The Average Hit Rate (AHR) is the Hit Rate averaged over all the users. It is a strict measure
that computes the ratio of cases where the automatically filled questionnaire has exactly the
same answer as the real questionnaire.
   The Average Closeness Rate (ACR) is the Closeness Rate averaged over all users. The Closeness
Rate measure takes into account that the answers of the depression questionnaire represent an
ordinal scale, and not only separate options. For example, if the user answered "0", a system
whose answer is "3" should be penalised more than a system whose answer is "1".
   The Average Difference between Overall Depression Levels (ADODL) measures the overall
depression level estimated taking all responses as a sum of all the answers, looking for the
depression level as a whole instead of some differences on each questionnaire answer prediction.
It computes the difference between the overall depression level for the real and automated
    1
        https://erisk.irlab.org/
Table 1
Statistics about the class distribution in the training data

90 users
Question 1 : 0 : 27 (30%) 1 : 47 (52%) 2 : 11 (12%) 3 : 5 (5%)
Question 2 : 0 : 22 (24%) 1 : 34 (37%) 2 : 20 (22%) 3 : 14 (15%)
Question 3 : 0 : 22 (24%) 1 : 35 (38%) 2 : 18 (20%) 3 : 15 (16%)
Question 4 : 0 : 28 (31%) 1 : 33 (36%) 2 : 23 (25%) 3 : 6 (6%)
Question 5 : 0 : 34 (37%) 1 : 32 (35%) 2 : 12 (13%) 3 : 12 (13%)
Question 6 : 0 : 60 (66%) 1 : 13 (14%) 2 : 11 (12%) 3 : 6 (6%)
Question 7 : 0 : 28 (31%) 1 : 17 (18%) 2 : 23 (25%) 3 : 22 (24%)
Question 8 : 0 : 28 (31%) 1 : 27 (30%) 2 : 23 (25%) 3 : 12 (13%)
Question 9 : 0 : 41 (45%) 1 : 37 (41%) 2 : 7 (7%) 3 : 5 (5%)
Question 10 : 0 : 42 (46%) 1 : 23 (25%) 2 : 8 (8%) 3 : 17 (18%)
Question 11 : 0 : 37 (41%) 1 : 31 (34%) 2 : 14 (15%) 3 : 8 (8%)
Question 12 : 0 : 28 (31%) 1 : 32 (35%) 2 : 8 (8%) 3 : 22 (24%)
Question 13 : 0 : 38 (42%) 1 : 21 (23%) 2 : 16 (17%) 3 : 15 (16%)
Question 14 : 0 : 38 (42%) 1 : 21 (23%) 2 : 20 (22%) 3 : 11 (12%)
Question 15 : 0 : 17 (18%) 1 : 32 (35%) 2 : 28 (31%) 3 : 13 (14%)
Question 16 : 0 : 17 (18%) 1 : 36 (40%) 2 : 24 (26%) 3 : 13 (14%)
Question 17 : 0 : 38 (42%) 1 : 31 (34%) 2 : 16 (17%) 3 : 5 (5%)
Question 18 : 0 : 32 (35%) 1 : 30 (33%) 2 : 15 (16%) 3 : 13 (14%)
Question 19 : 0 : 29 (32%) 1 : 25 (27%) 2 : 25 (27%) 3 : 11 (12%)
Question 20 : 0 : 21 (23%) 1 : 34 (37%) 2 : 21 (23%) 3 : 14 (15%)
Question 21 : 0 : 51 (56%) 1 : 18 (20%) 2 : 11 (12%) 3 : 10 (11%)

Total:     0 : 678 (35%) 1 : 609 (32%) 2 : 354 (19%) 3 : 249 (13%)

Minimal depression 14 (15.5%)
Mild depression 27 (30%)
Moderate depression 22 (24.5%)
Severe depression 27 (30%)


questionnaire. Then, the absolute difference (ad) between the real and the automated score is
computed. Depression levels are integers between 0 and 63. The measure is normalised to be
between 0 and 1 with the formula (63 - ad)/63. Then the average over all the users is computed.
  The Depression Category Hit Rate (DCHR) measures the correctness of the estimation
achieved over all users according to the well-established depression categories in psychology,
with the following four categories of depression:

minimal depression (depression levels 0-9)
mild depression (depression levels 10-18)
moderate depression (depression levels 19-29)
severe depression (depression levels 30-63)

  See the task overview paper for more details about the evaluation measures [1].
3. Methods
Our methods contained three steps: pre-processing the data, selecting relevant posts, and
the classification to predict the answers for each question, based on zero-short learning or
supervised learning.

3.1. Preprocessing
Each post was preprocessed as follows: the title of the post and the post text were concatenated.
The contractions were expanded, and words between brackets were removed. Then punctuation
and special characters were cleaned, and all the text was lowercased. The posts related to the
forum monitoring (namely posts that notified users that they "broke the rules") were removed.
Finally, all posts that had less than four characters were removed.

3.2. Selecting posts
For each question, we focused on selecting a subset of posts for each user. The goal was to keep
only posts that are relevant to each question, in order to increase the probability of finding
an answer about the topic of the question. We call this process filtering of the posts. We also
experimented with keeping all posts, with the caveat that training deep learning models on long
texts (the concatenation of all posts) is slow or sometimes problematic for BERT-like models.
   We employed two methods for filtering posts. The first method is similarity-based. The
similarity-based method utilized pre-trained sentence transformer models based on BERT or
RoBERTa to embed each post and all the BDI answers. Then, the relatedness of each post
to each answer of BDI answers is measured by computing the cosine distance between the
post_embedding (p) and the answer_embedding (a), as shown in the Equation 12 .
                                       𝑝·𝑎
                              1−               where ‖𝑥‖2 is the 2-norm of x                     (1)
                                    ‖𝑝‖2 ·‖𝑎‖2
   If the similarity value of a post with all questionnaires’ answers is less than 𝜃1 , then the post
is excluded, since it means that the post is not related to any of the BDI questionnaire questions.
In addition, if the difference between the maximum similarity and the minimum similarity is
less than 𝜃2 , then the post is excluded because it is considered a general post and not assisting
in answering any of the BDI questionnaire questions in specific. It should be noted that not all
the categories are discussed in the posts. Some categories appear more often than others. If
there are very limited posts for a user, we consider all the posts of that user for the learning
process (all the posts are included in the top n posts). Table 1 shows the number of posts for
each BDI question as per RoBERTa similarity with 𝜃1 = 0.6. It shows that the posts related to
eating and sleeping habits are rare, and posts about guilt and punishment feelings are the most
frequent.
   The second method is topic-based. We leveraged topic modeling to help identify relevant
posts for each question. We used top2vec [3] to divide all posts into topics that were then
used to find relevant posts for each question. The top2vec algorithm automatically finds the

   2
       https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
Figure 1: Number of posts per BDI-question based on RoBERTa (𝜃1 = 0.6)


number of topics in a corpus. It finds topic vectors from jointly embedded document and word
vectors of a corpus. The main idea behind the algorithm is that it finds dense areas of documents
in the embedding space. The assumption of the algorithm is that the dense area of document
vectors represents an area of highly similar documents which are representative of a topic. A
topic vector is calculated from each dense area of documents as the centroid of those document
vectors. The topics are then described with the nearest word vectors to the topic vector. At the
end, each document is assigned to its nearest topic vector, allowing for the size of each topic to
be calculated. As an example, for the first question, we searched for the topics related to ’sad’
and ’feel’, using two relevant topics computed by top2vec, containing the following words:
   cry, crying, sad, cried, feeling, upset, emotional, . . .
   smile, eyes, myself, cry, face, laugh, beautiful, sad, beauty, wish, forgive, . . .

3.3. Classification
3.3.1. Zero-Shot Learning
The current dataset set is relatively small for training 21 classifiers for each question of the BDI
questionnaire, especially if we considered only the posts that are related to the BDI question. In
addition, it is difficult to relate which posts are answering which question. For that reason, we
decided to utilize transfer learning, more precisely zero-shot learning. Zero shot learning is
a fast emerging field in machine learning, with a broad range of computer vision and natural
language processing applications [4].
   We used a simple technique that relies on estimating the semantic similarity between the
BDI answer and each post of the user. We did make use of the training labels to choose the
thresholds 𝜃1 and 𝜃2 . Therefore the method is not totally unsupervised. Fixed thresholds could
be chosen if preferred.
   We used language models based on Sentence Transformers for deep contextual post represen-
tations: Sentence-BERT (SBERT) and Sentence-RoBERTa (SRoBERTa) [5]. SBERT/SRoBERTa
are modifications of the pretrained BERT/RoBERTa network that employs siamese and triplet
network architectures [6] as illustrated in Figure 2.
Figure 2: SBERT architecture. (https://www.sbert.net/).


   The following runs from table 2 are based on zero-shot learning: uOttawa1_sim_BERT_base+,
uOttawa3_sim_BERT_large and uOttawa5_sim_RoBERTa+. uOttawa1_sim_BERT_base+ uses
’bert-base-nli-mean-tokens’ model. The model starts by getting the word embeddings for
each word in a given sentence using BERT base model, then calculates the average of the
word embeddings to produce SBERT embeddings. The weights are updated using siamese and
triplet networks to construct semantically relevant sentence embeddings that can be compared
using cosine-similarity. The base model contains 12-layers, 768-hidden layers, 12-heads, 110M
parameters [7]. Whereas, uOttawa3_sim_BERT_large uses the large BERT model as the initial
word embedding for each word in the sentence. It contains 24-layers, 1024-hidden layers,
16-heads, 340M parameters [7]. Similarly, uOttawa5_sim_RoBERTa+ is based on ’roberta-base-
nli-mean-tokens’ model.

3.3.2. BERT QA
One of the key limitations that BERT-based models [7] face is the length of the input sequence.
Due to the full attention mechanism, the computational memory requirement and the model
training time is quadratic. In other words, if the sequence length is 𝑛, the memory requirement
when training the model will be 𝑛2 . Due to this reason, the authors of transformers architecture
[8] and similar architectures, such as BERT, have limited the sequence length to 512 tokens, which
includes the special tokens [CLS] (classification embedding) and [SEP] (sentence separator).
To overcome this sequence length limitation, many researchers have introduced different
architectures such as Longformer [9], Reformer [10] and BigBird [11]. Even though we conducted
several experiments in the form of multiple-choice question answering using the Longformer
architecture, we still found it resource-intensive, especially when the sequence length increases.
To obtain the results for the “uOttawa4_Ensemble_BERT_QA” run (see 2) and “uOttawa6”
(see table4), we conducted several preliminary experiments using the BigBird architecture in
the form of multiple-choice question answering. For the experiments in this section, we did
not filter the posts as mentioned in section 3.2 and tried to identify the impact of using the
content published by users soon before submitting the BDI questionnaire (i.e., results under
“uOttawa4_Ensemble_BERT_QA”). In addition, we also conducted several experiments by using
all the earliest Reddit posts from the collection of posts of users (for the “uOttawa6” run). After
pre-processing the data, we concatenated all the posts in the order of posts’ date and time. The
concatenated posts were then tokenized using the spaCy tokenizer, so that a limited number
of tokens can be extracted for training purposes. Due to the resource intensiveness of the
transformer-based models, we selected only 512 tokens from each user. It is important to note
that these 512 tokens do not reflect the number of tokens that will get generated when using the
BigBirdTokenizer [12] to tokenize the input data. We prepared our input, per user, per question,
in the following format, which was then tokenized using the BigBirdTokenizer.

                                                                    (𝑞 )
              [[[𝐶𝐿𝑆] 𝑇1 𝑇2 𝑇3 . . . 𝑇510 𝑇511 𝑇512 [𝑆𝐸𝑃 ] 𝑄1 𝐶1 1 [𝑆𝐸𝑃 ]] . . .
                                                                     (𝑞 )
                [[𝐶𝐿𝑆] 𝑇1 𝑇2 𝑇3 . . . 𝑇510 𝑇511 𝑇512 [𝑆𝐸𝑃 ] 𝑄1 𝐶4 1 [𝑆𝐸𝑃 ]]]

   Here 𝑇1 to 𝑇512 indicates the sequence of tokens (512 for our experiments) extracted from the
concatenated posts of a user. 𝑄1 states the question, which will be from 1-21 that also indicates
                                      (𝑞 )
the number of classifiers trained. 𝐶1 1 indicates the first choice of the first question, where the
number of choices can vary between questions. According to the BDI questionnaire, except
for questions 16 and 18, the questions had four choices, while questions 16 and 18 had seven
choices. Based on the number of choices, we changed the number of class labels accordingly.
We used the huggingface implementation of the BigBird model [12] to train the classifiers.
Given the “content”, “question” and the “choice” as mentioned above, we used the pre-trained
BigBird model and fine-tuned it on our task as a multi-class classifier predicting one out of four
or seven choices, based on the question. Unlike the BERT model, which uses full attention,
the BigBird model uses random, window, and global attention to reduce the quadratic impact
on training time and computational memory. When training, we used the Adam optimizer
and trained the model for ten epochs with early stopping. We created five stratified shuffle
splits [13] by allocating 80% for training and 20% for validation. For each stratified split, we
created separate models and saved the ones that produced the best results on the validation
data. During inference, the saved models were used on the test data to generate predictions,
and the softmax outputs were aggregated to create an ensembled output. The stratification was
based on the level of depression (i.e., minimal, mild, moderate, and severe depression) calculated
based on the answer provided for each question.
   It is important to note that even though the level of depression was used as the stratification
strategy, the answers provided by the participants within the same depression group were not
consistent. Due to resource intensiveness and training time, we trained the model with a batch
size of 1. Given the time constraints, we fine-tuned only a few of the hyperparameters of the
BigBird model. We set the block size of the model to be 64 and the number of random blocks
to be 5. The block size specifies the block size to be used with random, global and window
attention, and the number of random blocks specifies how many random blocks to be used
with the given block size. We could not use a sequence length larger or equal to 1024 due
to computational limitations. Though using sparse attention could be more effective if used
with sequences longer than 1024 tokens [12]. Given these constraints, we will conduct further
research in future work to identify more optimal hyperparameters to obtain better results.
3.3.3. Universal Sentence Encoder QA
Each user has unique Reddit posts which may be on a variety of different topics. In order to
train a model to predict the answer of a user for a given BDI question, we would ideally only
use the posts that are relevant to answering the question. However given the limited size of the
dataset it would be difficult to train a model to learn which posts are relevant to a question and
the user’s response. There is the additional challenge that some users may have a very large
quantity of posts that may not be able to be processed all at once by deep learning models due
to computational constraints.
   In order to overcome these challenges, we leverage transfer learning by using the universal
sentence encoder [14] optimized for question answering (USE-QA). We use it to find the most
semantically-relevant posts of a user for each BDI question. We accomplish this by using the
response and question encoders of the USE-QA. We encode each of a user’s posts with the
response encoder. Then for each question, we create a question. For example for question 1,
which is about sadness, we create "How sad do I feel?", for question 2, which is about pessimism,
we create "How discouraged do I feel?". We then embed each one of these questions with the
USE-QA question encoder. The question and response embeddings can then be compared using
cosine similarity. In order to select the most relevant posts of a user to each question, we took
their top 10 most similar posts to each question. Once we have identified the most relevant
posts of a user to each BDI question we concatenate them together as they will be used to train
a neural network for each BDI question which predicts the user’s responses.
   The neural architecture we use is motivated by the deep averaging network [15] and the
need for transfer learning due to the small data size. For the embedding layer, we use USE to
embed the concatenated top 10 most relevant posts for a BDI question. This is followed by three
fully-connected layers with dropout and a final dense layer of size equivalent to the number of
responses for the given BDI question. We use a 90%/%10 training and validation split on the
provided training data (90 users) and train the models for 10 epochs. At prediction time for the
test data (80 users) the USE-QA method was used to find the top 10 most relevant posts for each
BDI question for each user, then those were concatenated and put through the trained neural
network to predict user responses.

3.3.4. Hierarchical Attention Network
For this method, we also trained 21 classifiers for each of the BDI questions, but we adopted a
hierarchical attention network (HAN) for document classification inspired by [16]. It employs
two levels of attention mechanisms at the word and sentence levels as described in Appendix A.
A word attention mechanism is utilized to identify keywords then aggregate them to create a
sentence vector. Then a sentence attention mechanism is used to emphasize the importance of
a sentence.


4. Results and Discussion
Table 2 shows the results of the 5 runs we submitted to the shared task, also available in the task
overview paper [1]. Here is a description of the methods we used to obtained the predictions
Table 2
Results for the submitted runs
               Run                                   AHR        ACR     ADODL          DCHR
               uOttawa1_sim_BERT_base+              28.39%     65.73%      78.91%      25.00%
               uOttawa2_top2vec_USE+                28.04%     63.00%      77.32%      27.50%
               uOttawa3_sim_BERT_large+             25.83%     59.68%      71.23%      27.50%
               uOttawa4_Ensemble_BERT_QA            27.68%     62.08%      76.92%      20.00%
               uOttawa5_sim_RoBERTa+                26.31%     62.60%      76.45%      30.00%


for the test users.

uOttawa1_sim_BERT_base+
This run used SBERT, which is pre-trained on a natural language inference (NLI) dataset in
addition to BERT’s Wikipedia pre-training. As described in section 3.3.1, after calculating the
cosine-similarity of each post against the BDI questionnaire answers, we filtered the unrelated
and general posts based on 𝜃1 and 𝜃2 . In this run, we kept all the post, thus we set 𝜃1 = 0 and
we removed general posts by setting 𝜃2 = 0.1. Then, each post was assigned to the maximum
similarity value of the BDI answer, as illustrated in table 3. Finally, the answers for each
user were aggregated using voting for the most frequent answer, for each question of the BDI
questionnaire. This is our best submitted run in terms of performance based on the first three
evaluation measures.

Table 3
Cosine-similarity with the post "this is so sad i cry every time"
                       0    i do not feel sad                                 0.0902
                       1    i feel sad much of the time                       0.8822
                       2    i am sad all the time                            0.9353
                       3    i am so sad or unhappy that i can’t stand it      0.9119


uOttawa2_top2vec_USE+
This used the method described in section 3.3.3, based on the Universal Sentence Encoder with
a QA training architecture. Note that top2vec was not used in this method (we put the top2vec
in the run’s name by mistake).

uOttawa3_sim_BERT_large+
This run is based on zero-shot learning using the model ’bert-large-nli-stsb-mean-tokens’. This
model is considered suitable for semantic textual similarity as it was fine-tuned on the NLI
dataset, then on the sentence similarity STS benchmark train set. In this run, we kept all the
posts as the uOttawa1_sim_BERT_base+ run, thus we set 𝜃1 = 0 but we changed 𝜃2 to 0.5 to
eliminate all relatively ambiguous posts.
Table 4
Results for unofficial runs
            Run                                   AHR       ACR     ADODL      DCHR
            uOttawa6                             29.46%    63.04%    78.31%    25.00%
            uOttawa7_Sim_HAN_cce_top_20_20       32.62%    65.99%    77.62%    22.50%
            uOttawa8_Sim_HAN_cce_top_10          30.48%    63.63%    72.88%    22.50%
            uOttawa9_Sim_HAN_cce                 31.73%    64.62%    75.22%    25.00%


uOttawa4_Ensemble_BERT_QA
This run is based on a BigBird model using 512 tokens from the end of the concatenated and
tokenized Reddit posts using the spaCy tokenizer (i.e., before tokenizing the sequence using the
BigBirdTokenizer). The training uses the QA models described in section 3.3.2.

uOttawa5_sim_RoBERTa+
This run is based on zero-shot learning (section 3.3.1) with the use of pre-trained ’roberta-base-
nli-mean-tokens’ model. In this run, we set 𝜃1 = 0.25, 𝜃2 = 0, then we performed extra filtering
of the posts, by removing any post with a maximum similarity to the BDI answer that is less
than 0.6. This is our best submitted run in terms of the forth evaluation measure, the level of
depression.
   Table 4 shows results for four more runs, submitted unofficially since only 5 runs were
allowed for the official submission. They were kindly evaluated by the task organizers. Here is
the description of the methods used to produce these results.

uOttawa6
Then run "uOttawa6" is based on the architecture described in section 3.3.2.

uOttawa(7,8,9)_Sim_HAN_cce+
These three unofficial runs employed post filtering (based on similarity or on top2vec) and
deep learning of the questionnaire answers based on hierarchical attention network. The posts
filtering was done by either setting 𝜃1 to 0.5 and, if there are no posts that represent the category
of the questionnaire, the posts filtered by top2vec were added. Or, by selecting the top n
most-similar posts for each topic, then combining all the posts of the user as one document. We
set n to 10 or 20. Table 5 shows the number of posts selected for each run.
    We note that adding the supervised deep learning step (HAN), as described in section 3.3.4,
helped improve the results (especially for the uOttawa7_Sim_HAN_cce_top_20_20 run). Table
7 form the Appendix B shows results for each question for three runs. The questions 16 and 18,
about changes in eating and sleeping patterns, respectively, were the most difficult to answer.
    Our unofficially submitted runs obtained better performance for the first measure (see tables 2
and 4), the correctness of the predicted answers (AHR 32.62% for uOttawa7_Sim_HAN_cce_top_20_20
Table 5
Number of posts included based on the selection criteria
   run                                    n/𝜃1    posts    train     test     post selection method
   uOttawa7_Sim_HAN_cce_top_20_20         20     14727     9043      5776     RoBERTa Sim
   uOttawa8_Sim_HAN_cce_top_10            10      9515      575      3798     RoBERTa Sim
   uOttawa9_Sim_HAN_cce                   0.5     6003     3570      2441     RoBERTa Sim & top2vec


versus 28.39% for uOttawa1_sim_BERT_base+), but not for the other measures. Table 6 compares
our results with the best results from the shared task, for the four measures.

Table 6
Our results compared to the best results from the shared task
                     Run                                     Rank      Our best      Best result
         AHR         uOttawa7_Sim_HAN_cce_top_20_20             12          32.62%       35.36%
         ACR         uOttawa7_Sim_HAN_cce_top_20_20             15          65.99%       73.17%
         ADODL       uOttawa1_sim_BERT_base+                    7           78.91%       83.59%
         DCHR        uOttawa5_sim_RoBERTa+                      5           30.00%       41.25%


5. Conclusion and Future Work
This paper presented several methods for the task of filling the BDI questionnaire. We showed
that filtering posts by their relevance are beneficial in training classifiers related to answering
different questions. We also showed that zero-shot learning with pre-trained models could
be utilized for predicting the answers, with similar performance as QA sequence-to-sequence
learning. In addition, deep learning models such as HAN on top of the post filtering led to our
best results.
   In future work, we plan to investigate possible ways to improve the performance. One
direction is to further pre-train generic sentence similarity measures on large amounts of
postings about mental health issues. Another direction is to investigate more ways to use
top2vec to detect the most relevant posts for each question and to test better linguistic analysis
methods for "change" detection to chose the correct answer to each question. Better ways to
implement zero-shot learning can also be investigated.


Acknowledgments
We thank to the Natural Science and Engineering Council of Canada (NSERC) for supporting
our research.
References
 [1] J. Parapar, P. Martin-Rodilla, D. Losada, F. Crestani, Overview of eRisk 2021: Early Risk
     Prediction on the Internet, in: Experimental IR Meets Multilinguality, Multimodality, and
     Interaction. Proceedings of the Twelfth International Conference of the CLEF Association
     (CLEF 2021), Springer, 2021.
 [2] D. E. Losada, F. Crestani, A Test Collection for Research on Depression and Language Use,
     Springer, 2016, pp. 28–39. doi:10.1007/978-3-319-44564-9_3.
 [3] D. Angelov, Top2Vec: Distributed Representations of Topics, 2020. arXiv:2008.09470.
 [4] W. Wang, V. W. Zheng, H. Yu, C. Miao, A survey of zero-shot learning: Settings, methods,
     and applications, ACM Trans. Intell. Syst. Technol. 10 (2019).
 [5] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-
     Networks, in: Proceedings of the 2019 Conference on Empirical Methods in Natural
     Language Processing, Association for Computational Linguistics, 2019.
 [6] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition
     and clustering, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition
     (CVPR), 2015, pp. 815–823.
 [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding, in: Proceedings of the 2019 Conference of
     the North American Chapter of the Association for Computational Linguistics: Human
     Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186.
 [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polo-
     sukhin, Attention is all you need, in: Advances in Neural Information Processing Systems,
     2017, pp. 5998–6008.
 [9] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The Long-Document Transformer,
     arXiv:2004.05150 (2020).
[10] N. Kitaev, L. Kaiser, A. Levskaya, Reformer: The Efficient Transformer, in: International
     Conference on Learning Representations, 2020. URL: https://openreview.net/forum?id=
     rkgNKkHtvB.
[11] M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham,
     A. Ravula, Q. Wang, L. Yang, et al., BigBird: Transformers for longer sequences, Advances
     in Neural Information Processing Systems 33 (2020).
[12] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf,
     M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao,
     S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-Art Natural Language
     Processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural
     Language Processing: System Demonstrations, Association for Computational Linguistics,
     Online, 2020, pp. 38–45.
[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
     P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher,
     M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine
     Learning Research 12 (2011) 2825–2830.
[14] Y. Yang, D. Cer, A. Ahmad, M. Guo, J. Law, N. Constant, G. H. Abrego, S. Yuan, C. Tar, Y.-H.
     Sung, et al., Multilingual universal sentence encoder for semantic retrieval, arXiv preprint
     arXiv:1907.04307 (2019).
[15] M. Iyyer, V. Manjunatha, J. Boyd-Graber, H. Daumé III, Deep Unordered Composition
     Rivals Syntactic Methods for Text Classification, in: Proceedings of the 53rd annual
     meeting of the association for computational linguistics and the 7th international joint
     conference on natural language processing (volume 1: Long papers), 2015, pp. 1681–1691.
[16] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for
     document classification, in: Proceedings of the 2016 Conference of the North American
     Chapter of the Association for Computational Linguistics: Human Language Technologies,
     Association for Computational Linguistics, San Diego, California, 2016, pp. 1480–1489.


A. Hierarchical Attention Network
This section describes the Hierarchical Attention Network (HAN) architecture we used for
the multi-classification task for each category of the BDI questionnaire. The architecture is
shown in figure 3. We trained 21 HAN-classifiers using the top-related posts for each category-
based on the parameters described earlier. HAN employs bi-directional GRU on the word
level, followed by an attention model, to extract the most informative words, which are then
aggregated to generate a sentence vector, as shown in figure 4. Similarly, bi-directional LSTM
on the sentence level is used with an attention mechanism to aggregate the most essential
sentences to form the user-category vector which is then passed on to a dense layer for text
classification using softmax activation as shown in figure 5 . For training, we use batch_size=128,
Adam optimizer and categorical cross-entropy as the loss function. We added a dropout layer
to avoid over-fitting.
Figure 3: Hierarchical Attention Network [16].


Figure 4: Word Encoder for Question 2 (Sim_HAN_cce_top_20_20).


Figure 5: Attention model summary for Question 2 (Sim_HAN_cce_top_20_20).
B. Results per Question
Table 7 shows the results for each question for three runs.

Table 7
AHR results per category for our best runs
   Question    BDI Question                  uOttawa1 AHR     uOttawa5 AHR    uOttawa7 AHR
   1           Sadness                               31.25            22.50           53.75
   2           Pessimism                             36.25            37.50           28.75
   3           Past failure                          32.50            36.25           38.75
   4           Loss of pleasure                      32.50            12.50           38.75
   5           Guilty feelings                       36.25            35.00           41.25
   6           Punishment feelings                   35.00            28.75           42.50
   7           Self-dislike                          17.50            32.50           25.00
   8           Self-criticalness                     22.50            21.25           28.75
   9           Suicidal thoughts or wishes           46.25            38.75           36.25
   10          Crying                                32.50            31.25           32.50
   11          Agitation                             23.75            27.50           41.25
   12          Loss of interest                      32.50            26.25           27.50
   13          Indecisiveness                        31.25            22.50           23.75
   14          Worthlessness                         27.50            26.25           31.25
   15          Loss of energy                        23.75            25.00           20.00
   16          Changes in sleeping pattern           12.50            17.50           23.75
   17          Irritability                          31.25            28.75           30.00
   18          Changes in appetite                   12.50            15.00           23.75
   19          Concentration difficulty              27.50            32.50           22.50
   20          Tiredness or fatigue                  25.00            22.50           36.25
   21          Loss of interest in sex               26.25            12.50           38.75

</pre>