Early risk detection of mental illnesses using various
types of textual features
Rodrigo Ferreira, Alina Trifan and José Luís Oliveira
DETI/IEETA, University of Aveiro, Portugal


                                      Abstract
                                      This paper documents the participation of our team, BioInfo@UAVR, in the first and second task of
                                      the 2022 edition of CLEF eRisk. With the goal of achieving an early detection of subjects at risk of
                                      specific mental illnesses, pathological gambling and depression for tasks 1 and 2 respectively, using
                                      data sourced from social networks, more specifically, Reddit. To this end, we trained several machine
                                      learning models, dividing our experiments into three feature engineering approaches of increasing
                                      complexity, corresponding to very commonly used vectorization methods in natural language processing,
                                      those of bag-of-words, distributional semantics word embeddings and contextualized language models.
                                      Additionally, we evaluated the impact of the inclusion of sentiment analysis features. Despite having
                                      subpar results on the official evaluation, we managed to improve them considerably post submission
                                      with a few tweaks, showing that these solutions can work if properly fine-tuned.

                                      Keywords
                                      Natural language processing, Machine learning, Mental health, Social mining


1. Introduction
Mental health is a state of well-being related to an individual’s cognitive, behavioral, and
emotional state, and it plays an important role in self-esteem along with our life. Mental health
disorders, given their less physical nature when compared to something like muscle, skeletal or
organ issues (which can be extracted/poked/seen through imaging) can be harder to properly
diagnose. Sadly, the lack of proper support surrounding these disorders as well as the stigma
attached to them, often prevent individuals from seeking out help, which may lead to the
worsening of their conditions and to the potential risk of self-harm or even suicidal behaviours.
According to statistics published by the World Health Organization (WHO)1 , over 700 000
people die to suicide each year, with many more having failed attempts. Many of these could be
prevented with the appropriate care.
   The traditional process of diagnosing a mental illness is far from optimal since it may fail
due to various factors. First, the requirement of the patient’s physical presence may be hard
to fulfil due to lack of access, or the secluding nature of many mental illnesses and the social
stigma surrounding them. For the same reasons, the subject might not feel comfortable being
transparent when describing their situation to the healthcare professional. Patient aside, it may
fail from the lack of experience of the healthcare professional, which may sometimes be only a
CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ rodrigommf@ua.pt (R. Ferreira); alina.trifan@ua.pt (A. Trifan); jlo@ua.pt (J. L. Oliveira)
                                    © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings           CEUR Workshop Proceedings (CEUR-WS.org)
               http://ceur-ws.org
               ISSN 1613-0073


               1
                   https://www.who.int/teams/mental-health-and-substance-use/data-research/suicide-data
general health practitioner from a health center or emergency room with somewhat limited
expertise. Naturally, it might also fail due to the screening tool used, although those tend to be
sturdy with large amounts of research backing them up.
   It should come as no surprise that with all these possible failing points, the traditional methods
lead to suboptimal results. This is where the idea of using social data comes in, since it addresses
many of these points. Social data corresponds to the data that social media users publicly share
in an online scenario (Reddit2 , Twitter3 , etc), its public and easily accessible nature make it a
big facilitator for a lot of big data applications by substituting or complementing traditional
data, which is harder to acquire. Naturally, it also brings a fair share of concerns, despite the
fact that it is public, users might not be comfortable having their data being used for purposes
other than the typical use of social media.
   The process of analyzing social data with the end goal of understanding some trends, opinions
or behaviours within a population is called social monitoring, it has been used across many
fields [1] with goals such as:

    • measuring consumer sentiment.
    • measuring political sentiment.
    • forecasting sales.
    • estimating traffic congestion.
    • forecasting elections.
    • contagious disease surveillance.
    • monitoring gun violence cases.

Given that the traditional screening solutions in the mental health field can fail at various points,
a social monitoring approach may be useful since it may be able to tackle some of the issues
found in the traditional methods. Imagining, for example, a depression screening tool employed
in social media, one can easily see how the requirement of being physically present with a
health specialist is solved. And, since there is no direct interaction with a health specialist, the
users might also feel more comfortable to accurately portray their thoughts, as they already
tend to do on social media.
   Naturally, this is not a perfect solution as there are still some concerns, namely regarding
the effectiveness of such screening tool and the public’s acceptance to being subjected to it,
since in the past, tools of this type have been met with concerns from the public due to ethical
concerns on the use of potentially sensitive personal data in a way that was not intended by
the author at the time of publication [1]. An example of a target of the public’s backlash was a
Twitter plug-in called Samaritan’s Radar 4 that monitored user’s social media data without their
consent with the hope of notifying its user when the accounts they followed showed signs of
struggling with suicidal thoughts. It launched in October 2014, it was suspended the following
month and by March of 2015 it was permanently closed, due to the public’s concern for being
monitored without consent, and the potential dangers originated by this tool if used with bad
intentions (as it makes it easier to target suicidal people).
    2
      https://www.reddit.com
    3
      https://twitter.com
    4
      https://www.samaritans.org/about-samaritans/research-policy/internet-suicide/samaritans-radar/
   Having that said, the early risk5 (eRisk) workshop, a part of the Conference and Labs of
the Evaluation Forum6 (CLEF), incentivizes the development of solutions capable of capturing
traits of individuals at risk on the internet, in the context of shared tasks, by providing test
collections[2], while employing, besides the classic options, novel evaluation metrics like
ERDE[2] and latency-weighted F score[3], that evaluate the participating models not only on
the correctness of their classifications, but also on their timeliness, since faster classifications
allow for better interventions. The focus is put on common issues in the mental health area,
having tackled the detection of issues such as depression, self-harm, eating disorders and
pathological gambling. Here, we document the development of models potentially capable of
classifying users with signs of mental illnesses, namely pathological gambling and depression,
using writings posted by them on the Reddit social network, corresponding to tasks 1 and 2 of
the 2022 edition of eRisk[4].


2. Data
This section seeks to explain what datasets were used to accomplish each task, the preprocessing
procedures and how they were split to allow for our machine learning experimentations.

2.1. Task 1: Pathological gambling
In order to address this challenge, the official dataset provided by the CLEF organizers was
used, it consisted of last year’s test dataset for the same task [5]. The dataset was made
available on a dedicated server and it consists of a golden truth text file, that maps subjects
to their corresponding label, 0 or 1, for control and pathological gamblers respectively, and a
folder containing multiple files in the xml format, one for each subject, containing information
regarding their writings (posts/comments).
   In its original state, there were records of writings for 2348 subjects, 164 of which were
labeled 1, and 2184 were labeled 0. In order to achieve a more user friendly dataset format, after
preprocessing, all writings were aggregated to a single csv file.
   Plenty of writings had artifacts of the extraction process, mostly in the format of html tags.
At this stage, a jupyter notebook was used, it would iterate through the golden truth text file,
save the subject id’s and corresponding labels and then iterate through each of the subject’s
xml files. For each file the writings were subject to the following sequence of transformations
(done mostly with regular expressions7 and contractions8 libraries):
   1. alphabetical characters were converted to lowercase.
   2. html and decimal unicode artifacts were converted to their corresponding symbols.
   3. various types of emojis such as ":)" and ":(" were converted to "smiling emoji" and "negative
      emoji" respectively.


   5
     https://erisk.irlab.org
   6
     https://clef2022.clef-initiative.eu
   7
     https://docs.python.org/3/library/re.html
   8
     https://github.com/kootenpv/contractions
   4. mentions of website links, reddit users and subreddits were converted to the tokens "url",
      "user" and "subreddit" respectively.
   5. numbers beginning or ending with "€" (Euro) or "$" (Dollar) were substituted by the token
      "money".
   6. remaining isolated tokens comprised only of numbers were converted to the token
      "number".
   7. contractions of common words were transformed to their original form (for example
      "you’re", "I’m" turn into "you are" and "i am").
   8. a dictionary of common internet slang terms was constructed to convert words to their
      original form (for example "omg", "ty" turn into "oh my god" and "thank you").
   9. punctuation characters were discarded except for "!.?,", as these can be important for some
      of the features extraction methods used.
   The resulting output was put through a fasttext language detection model 9 [6, 7], and posts
classified as non-English were discarded, while English writings were printed to a csv file.
Regarding the language detection, this model was chosen due to its speed and decent accuracy.
With some manual checking early on, there was a significant amount of misclassifications,
especially with shorter texts and text containing lots of abbreviations and acronyms. In order to
improve this, the script was changed to return the top 2 most likely languages for each writing,
and if English was present in the top 2, the writing would be kept, or otherwise discarded. With
this change, upon some manual checking, the outputs seemed to be much better, and of 1129560
total writings, 57534 were filtered out. The remaining preprocessed posts were then aggregated
in a csv file in original and lemmatized forms.
   The resulting dataset was significantly imbalanced, the control group represented 93% (2184)
of the total subjects vs 7% (164) of pathological gamblers, and 95% (1016947) of the total writings
vs 5% (54915) for pathological gamblers. The method chosen to handle this imbalance was
random undersampling in terms of users (not writings), which was repeatedly executed to bring
the number of control group subjects down to match the number of positive subjects in a way
that results in a somewhat balanced number of writings for each class (since subjects do not
have a fixed amount of writings).
   With the undersampling approach we end up with a dataset balanced in terms of subjects of
each class (164 each), and nearly balanced in terms of posts from subjects of each class (65266 vs
54915). At this point, the data was split 80%/20% into a training and validation sets respectively,
the split was performed on subjects, ensuring that any given subject’s writings may only appear
in either the training or validation set. This was made to make sure that there is no information
leak and that we can evaluate the trained models on data belonging to completely new subjects.
   The official testing set was only made available at the time of submission, so it wasn’t used in
the initial training stage but only afterwards when evaluating, it was comprised of 81 positive
and 1998 control subjects, with 14627 and 1014122 writings respectively. More information on
all mentioned sets of data can be found in Table 1.


   9
       https://fasttext.cc/docs/en/language-identification.html
Table 1
Composition of the various sets of data for task 1.
                                                      Subjects/Writings
       Set
                         Positive                     Negative                     Total
       Original          164 (7%) / 55677 (5%)        2184 (93%) / 1073883 (95%)   2348 / 1129560
       Preprocessed      164 (7%) / 54915 (5%)        2184 (93%) / 1016947 (95%)   2348 / 1071862
       Undersampled      164 (50%) / 65266 (54%)      164 (50%) / 54915 (46%)      328 / 102181
       Training set      131 (50%) / 44805 (46%)      131 (50%) / 53053 (54%)      262 / 97858
       Validation set    33 (50%)/10110 (45%)         33 (50%) / 12213 (55%)       66 / 22323
       Testing set       81 (4%) / 14627 (1%)         1998 (96%) / 1014122 (99%)   2079 / 1028749

Table 2
Composition of the various sets of data for task 2.
                                                      Subjects/Writings
       Set
                         Positive                     Negative                     Total
       Original          214 (13%) / 90222 (8%)       1493 (87%) / 986360 (92%)    1707 / 1076582
       Preprocessed      214 (13%) / 89010 (8%)       1493 (87%) / 970622 (92%)    1707 / 1059632
       Undersampled      214 (50%) / 89010 (42%)      214 (50%) / 121917 (58%)     428 / 210927
       Training set      171 (50%) / 73670 (42%)      171 (50%) / 100503 (58%)     342 / 174173
       Validation set    43 (50%) / 15340 (42%)       43 (50%) / 21414 (58%)       86 / 36754
       Testing set       98 (7%) / 35332 (5%)         1302 (93%) / 687228 (95%)    1400 / 722560


2.2. Task 2: Depression
In accordance with the choice made for task 1, for task 2, only the official data provided by the
organizers was used, which consisted of the training and testing datasets from 2017[8] and the
testing dataset from 2018’s edition of eRisk[9]. This dataset’s structure was slightly different
than that of the previous task, instead of using a golden truth text file, the xml files containing
each user’s writings were grouped into positive and negative folders, corresponding to the 1
and 0 labels respectfully, and divided by years, since as mentioned, the organizers provided
data from 2017 and 2018. Regarding the individual files the format was the same as in the
pathological gambling task.
   Originally, the data from 2017 contained 752 negative and 135 positive subjects. Likewise,
the 2018 folder contained 741 negative and 79 positive subjects. This comes up to a total of 1707
subjects, 214 of which are positive (depressed). Like before, the total writings were aggregated
into a single csv file.
   The preprocessing was the same as in the previous task, following the same transformations
and language detection methods. Out of 1076582 total writings, 16950 were discarded. Though
not as much as the pathological gambling case, this dataset was still heavily imbalanced, with 87%
of the users being labeled as negative, and 13% as positive, and 92% of the writings corresponding
to the negative class vs 8% of the positive users.
   Naturally, a random undersampling algorithm was employed, to reduce the amount of control
group subjects from 1493 to 214 in order to match the amount of depressed subjects. As expected,
the number of writings also became more balanced, going from 970622 negative writings to
121917, much closer to the number of positive writings of 89010. Again, this resulting dataset
was split into a 80%/20% train/validation split, divided by users, leading to a training dataset
containing 171 positive and 171 negative subjects, and 73670 positive vs 100503 negative writings.
The validation set contained 43 positive and 43 negative subjects and 15340 positive vs 21414
negative writings. As it was the case previously, we only obtained the testing dataset at the
time of submission, it was comprised of 35332 writings belonging to 98 positive subjects and
687228 belonging to 1302 control subjects. A brief summary of the mentioned sets of data can
be seen in Table 2


3. Feature engineering techniques
When it comes to feature engineering, many different approaches have been used in the past.
In this work we rely mostly on textual features, though experiments with sentiment analysis
features were also integrated.
   First, we must decide on what constitutes a single sample. Is it a single writing from a subject,
multiple writings grouped by some criteria (length, time of posting, etc), or even all of their
writings? Initially, user writings were used individually as samples, but this quickly led to
subpar results, likely due to the fact that some writings are extremely short and it is difficult
to assess which class a writing’s author belongs to given only 2 or 3 words. To overcome this,
inspired by NLP-UNED’s run [10] in 2021’s edition of CLEF eRisk, a k-sliding-window method
was implemented, where each sample consists of the last k writings at the time. The choice
of k is also important here, in initial experiments we were essentially employing a sliding
window of k=1, we also performed experiments k values of 3 and 5 which resulted in immediate
improvements across the board. Larger values were likely to produce better results for the first
2 feature engineering approaches, which we will discuss next, but given the sequence length
limitation of the language model used in approach 3 ( since the windows are constructed prior
to the feature extraction process), to keep things fair and simple, we decided to ensure that the
models submitted for each task all ran on writing windows of the same length, resulting in a
maximum k size of 5.
   The textual features were essentially split into 3 categories: we implemented models using
Bag of Words (BoW) features with Tf-Idf in approach 1, GloVe [11] word embeddings in approach
2, and contextualized language model representations in approach 3. The previously mentioned
sentiment analysis features are those of sentiment analysis tools in Vader10 and TextBlob11 .
TextBlob’s tool provided us with 2 scores for each text analyzed, regarding its subjectivity
(opinionated or fact based) and polarity (how positive or negative the text is). Vader’s features
consisted of a 4 values corresponding to scores regarding a text’s negativity, neutrality, positivity,
and a compound value which acts as a single overall measure of the previous values. The set
of features generated from these two tools will be referred to from now on as SA features (for
Sentiment Analysis). Experiments were ran with textual features alone and with textual and SA
features, to evaluate their effectiveness in this scenario.


   10
        https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py
   11
        https://github.com/sloria/textblob
3.1. Approach 1: Bag-of-Words
A Bag-of-Words (BoW) feature extraction method treats documents exactly as the name suggests,
documents are defined by the presence/amount of each word it contains. This method works but
is usually too simplistic, it is often improved by employing a Term-frequency Inverse-document-
frequency (Tf-Idf) weighting methodology. This methodology tries to weight how important a
certain word is in a given document relative to the whole corpus assigning higher weights to
words that occur more often within fewer documents. For a given term 𝑡, document 𝑑 found in
corpus 𝐷:
                                               𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 𝑜𝑓 𝑡 𝑖𝑛 𝑑
                                  𝑇 𝑓 (𝑡, 𝑑) =                                              (1)
                                               𝑡𝑜𝑡𝑎𝑙 𝑜𝑐𝑐𝑢𝑟𝑟𝑒𝑛𝑐𝑒𝑠 𝑖𝑛 𝑑
                                                           |𝐷|
                             𝐼𝑑𝑓 (𝑡, 𝐷) = log                                               (2)
                                                |𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑖𝑛 𝐷 𝑤𝑖𝑡ℎ 𝑡|

                                    𝑇 𝑓 𝐼𝑑𝑓 (𝑡, 𝑑, 𝐷) = 𝑇 𝑓 (𝑡, 𝑑) × 𝐼𝑑𝑓 (𝑡, 𝐷)                (3)
   The end result of this feature extraction method is the learned vocabulary and Idf weights,
when vectorizing a document we can calculate the Tf-Idf weights of its tokens by performing
the previously mentioned calculations with the learned Idf weights for the tokens present in
the learned vocabulary (with unknown tokens being usually ignored), providing a simple yet
effective way to represent text as number vectors. This process can be altered by changing
some parameters such as:

    • max features: maximum amount of tokens in the learned vocabulary.
    • n_gram range: how many tokens to group as a unit in the vocabulary (1-gram being
      groups of 1 token, 2-grams 2 tokens, etc).
    • stopwords list: tokens that are not valid candidates for the learned vocabulary and are
      discarded.
    • max/min document frequency: filter out tokens that occur in more/less than x documents,
      for max and min respectively, with x being an absolute value or a fraction.

Thanks to the library sklearn 12 , this is all implemented in the form of the class TfIdfVectorizer
which allows us to create a vectorizer that given the desired parameters and a training corpus,
can later transform samples of text into number vectors.
   To find the optimal Tf-Idf parameter values (of those previously mentioned) to use in both
tasks, a randomized search was used to experiment with different TfIdfVectorizer parameter
combinations, feeding the transformed data into a Naive Bayes classifier to evaluate their
effectiveness. This classifier was chosen due to its speed and good results out-of-the-box to
facilitate these experiments.

3.2. Approach 2: Distributional semantics word embeddings
In this approach, as previously stated we use pre-trained distributional semantics word embed-
dings (which we will refer to as WE) to represent the windows of writings. Word embeddings
   12
        https://github.com/scikit-learn/scikit-learn
consist of models that map tokens to numerical vector representations based on their distribu-
tions in the corpus at the learning stage, these tend to perform better than BoW methods since
the models can capture the meaning of the tokens (its distribution regarding other tokens, hence
the distributional semantics part), whereas BoW only takes into account its distribution across
the corpus and documents. Interesting properties are achieved as a result of this, for instance
many types of relations can be encoded in the resulting vector space and analogy operations
are made possible.
   Despite this improvement, word embeddings like these still fail to take context into account,
for instance the word "bank" in "by the river bank" and "got a loan from the bank" would have
the same vector representation, even though they are referring to 2 different entities.
   There were lots of possibilities for pre-trained models but 2 were selected, first, a 200 di-
mensional GloVe 13 [11] model trained on Twitter data, chosen due to the reliability of GloVe
models and the similarity between the Twitter and Reddit social networks. The other one was a
300 dimensional FastText 14 model trained on data from Common Crawl15 , selected due to its
versatility handling out-of-vocabulary tokens. Windows of writings were represented by the
mean of its tokens embeddings.

3.3. Approach 3: Contextualized language models
Contextualized language models (LMs) leverage the power of deep learning by training large
neural networks to achieve great performance in a variety of Natural Language Processing
(NLP) tasks. Given the large computational costs that come with building language models
from scratch, we decided to use a pre-trained model in this case.
   There was a large variety of models we could use, but we selected the all-MiniLM-L6-v216 as
the model to use when extracting the sequence embeddings using the SentenceTransformers17
library [12]. This MiniLm model originated from the pre-trained 6 layer version of Microsoft’s
MiniLM-L12-H384-uncased (by keeping every second layer) [13], fine-tuned on a 1B sentence-
pair dataset, where given a sentence from a pair, the model had to select its correct match from
a set of random samples. MiniLM-L12-H384-uncased in turn was achieved by the process of
knowledge distillation from BERT. Knowledge distillation is the process of compressing large
models (also known as teachers in this context) by training smaller models (known as students)
to imitate their behaviour. Instead of learning by computing loss from the golden truth labels,
the student model uses the output of the teacher’s layers as the true label (not necessarily the
final layer, as was the case with MiniLm). It maps sentences to a 384 dimensional vector space.
   This choice resulted from the fact that despite being a very lightweight model, according
to the comparison table18 in the SentenceTransformers documentation, the all-MiniLM-L6-v2
performed rather well across 14 sentence embedding tasks, another relevant factor was that a
large subset of its training data derived from Reddit comments.

   13
      https://nlp.stanford.edu/projects/glove/
   14
      https://fasttext.cc/docs/en/english-vectors.html
   15
      https://commoncrawl.org/2017/06/
   16
      https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
   17
      https://www.sbert.net
   18
      https://www.sbert.net/docs/pretrained_models.html
Table 3
Top performing models on task 1 training data for each window size and textual feature type, classifying
individual writing windows.
                   Features    Window Size     Model     F1     Sentiment Analysis
                               1               NB        0.67   No
                   TF-IDF      3               LR        0.72   No
                               5               LR        0.75   No
                               1               LR        0.67   No
                   GloVe       3               LR        0.74   Yes
                               5               SVM       0.77   Yes
                               1               SVM       0.68   No
                   MiniLm      3               LR        0.74   No
                               5               LR        0.76   Yes


4. Models
We divided the tasks into 2 parts, one regarding the classification of windows of writings, to
assess whether a certain number of consecutive user writings (a window) is likely to come
from a positive subject, and another which turns the results of the window classifications of a
given subject, into a subject classification, which is our end goal. It is worth mentioning that
we evaluated the window classification models with 5-fold cross-validation on the training set,
and dedicated the validation set to finding confidence thresholds for the user classification.

4.1. Writing window classification
In order to make the performance of the different feature types somewhat comparable, the models
used remained mostly the same across experiments, consisting of sklearn’s implementations of
Logistic Regression (LR), Naive Bayes (NB), linear Support Vector Machine (SVM), ExtraTrees
and Perceptron models. It is worth noting that Naive Bayes was only used in approach 1 due to
its inability to deal with negative features, present in the features of approaches 2 and 3. Also,
due to the large feature space and amount of samples, both LR and SVM models were trained
using stochastic gradient descent with sklearn’s SGDClassifier.
   Here, for each feature extraction approach we perform 5-fold cross validation while ensuring
that no subject’s writings are both present in the training folds and testing fold simultaneously
to better imitate the end goal of the models being developed and to avoid information leakage.
   Due to the sheer amount of experiments, we show only the best performing models for each
feature type and window size in Tables 3 and 4. One can make some observations from these
results, for instance, larger window sizes lead to better a F1, GloVe embeddings consistently
outperformed those of FastText, the inclusion of SA features was not consistently good or bad,
and that it seems that windows of depressed subjects seem to be harder to classify. The best
performing models in terms of F1 score of each approach of window size 5 were selected.
Table 4
Top performing models on task 2 training data for each window size and textual feature type, classifying
individual writing windows.
                   Features    Window Size     Model     F1     Sentiment Analysis
                               1               NB        0.54   No
                   TF-IDF      3               NB        0.65   No
                               5               NB        0.66   No
                               1               SVM       0.53   No
                   GloVe       3               LR        0.66   Yes
                               5               LR        0.67   Yes
                               1               LR        0.53   No
                   MiniLm      3               SVM       0.63   Yes
                               5               SVM       0.66   Yes


4.2. User Classification
At this stage, we essentially selected the best models for each approach and applied some
criterion to classify the authors of the writings, since until now we were classifying windows of
writings and not the authors themselves. Various choices can be made here, the NLP-UNED
team [10], from which we drew inspiration for the rolling window method, used a parameter k
of consecutive positive windows as a signal that the subject should be classified as positive, but
other metrics such as a percentage or an absolute value of positive window classifications up to
a point (since writings are retrieved chronologically) can be used.
   We followed the "absolute value" approach with the threshold of 1, meaning that a single
positive window is enough to classify as user as positive. Additionally, since our models were
trained with a balanced dataset (as a result of the undersampling process), and it is unlikely
that a real world scenario such as the one simulated by the official testing set will have a similar
distribution, leading to skewed decisions, we imposed a threshold on the confidence of the
classifiers. If a model is really confident that a window of writings belongs to a positive subject,
we can classify the subject as positive. To put it simply, one of three situations may occur:
   1. the window of writings was classified as positive and the confidence in its decision meets
      the required threshold.
   2. the window of writings was classified as positive but the confidence in its decision does
      not meet the required threshold.
   3. the window of writings was classified as negative.
The subject is only classified as positive in case 1, where case 2 and 3 lead to a negative
classification.
   The thresholds were selected for each model by measuring the F1-score achieved with different
random threshold values at round 100, on our previously unseen validation set.
   The event allowed up to 5 runs for each task. We decided to take advantage of this by running
the best window classifier for each approach (BoW, WE and LM) as the models of the first 3
runs, and ensemble methods reliant on the same models for runs 4 and 5. The condition the
model for run 4 needs to meet to classify a subject as positive, is to have any of the first 3 run’s
Figure 1: Illustration of the decision process followed in run 5 for a given window of writings.


classify the same user as positive. The model for run 5 sums the decision confidence of the
original 3 models with a positive output (contributing 0 if their classification was negative) and
divides the summed value by 3 to get an average confidence, if it passes a custom threshold, the
final output is positive, it is easier to understand this logic with the illustration in Figure 1.
   To put it simply, in order to output a value of 1, the condition for each run’s decisions are:
Table 5
Performance of the chosen writing window models when classifying users in task 1, using the best
thresholds found.
               Run Model          Features Threshold Precision Recall F1
               1     LR           BoW      0.80        0.91        0.97      0.94
               2     SVM          WE       0.85        0.86        0.97      0.91
               3     LR           LM       0.90        0.73        1.00      0.85
               4     Ensemble 1 All        None        0.61        1.00      0.76
               5     Ensemble 2 All        0.80        0.97        1.00      0.99

Table 6
Performance of the chosen writing window models when classifying users in task 2, using the best
thresholds found.
               Run Model          Features Threshold Precision Recall F1
               1     NB           BoW      0.90        0.86        0.84      0.85
               2     LR           WE       0.98        0.66        0.98      0.79
               3     SVM          LM       0.98        0.81        0.91      0.86
               4     Ensemble 1 All        None        0.52        1.00      0.68
               5     Ensemble 2 All        0.92        0.78        0.93      0.85


   1. BoW model classifies a window as 1 with a confidence higher than a threshold value.
   2. WE model classifies a window as 1 with a confidence higher than a threshold value.
   3. LM model classifies a window as 1 with a confidence higher than a threshold value.
   4. Have any of the previous models provide a final decision of 1.
   5. Sum the confidence of the first 3 run’s that had a positive result and divide by 3, this
      averaged confidence must be higher than a threshold value.
The event’s server also expects to receive a score showing how confident each model is in its
decision. For this we simply send the model’s estimated probability of the predicted output
class at each inference. Tables 5 and 6 illustrate the thresholds and corresponding metrics on
the held-out validation set following the evaluation logic that a single positive classification at
any time is enough to lead to a final positive classification.


5. Results and discussion
Despite showing good results in our experiments, our approach proved ineffective in this
evaluation context. When the amount of writings rises dramatically from the one used in
training (roughly 1000 or 500 vs 100), it is only natural that some positive classifications might
be made where they shouldn’t, resulting in a high number of false positives and high recall
at the expense of a low precision and F1 score since we are following a single positive user
classification protocol. This is especially obvious in the results of task 2, where half the writings
were processed and the results were significantly better than those of task 1, despite the opposite
being observed consistently throughout the training process.
Table 7
Comparison of the performance on task 1 upon official evaluation (processing roughly 1000 posts) vs
when mimicking the tuning scenario of only analyzing the first 100 writings.
       Run    P      R      F1     F-Latency   P@100    R@100     F1@100     F-Latency@100
       1      0.09   0.99   0.17   0.17        0.15     0.95      0.27       0.26
       2      0.07   1.00   0.13   0.12        0.10     0.99      0.19       0.19
       3      0.05   1.00   0.10   0.1         0.06     1.00      0.12       0.12
       4      0.05   1.00   0.10   0.09        0.06     1.00      0.11       0.11
       5      0.19   0.99   0.32   0.32        0.31     0.91      0.47       0.46


5.1. Task 1: Pathological gambling
The official evaluation of the server submissions showed results significantly different from
those observed during the training and testing process. The left half of Table 7 displays the final
precision, recall, F1 and latency-weighted F1 obtained during the event for task 1, drastically
different from those previously seen in our testing, displayed in Table 5. We suspected that this
discrepancy may have been caused by multiple factors:
   1. The selected confidence thresholds were tuned for the first 100 windows of writings of
      the users in our test set, a big difference from the number of windows received from the
      server (roughly 1000).
   2. The user classification protocol’s confidence thresholds were overfitting on the limited
      validation data.
   3. The user classification protocol chosen might just be ineffective.
Upon further inspection, using the golden truth file received post-submission, we see a clear
improvement across the board on F1 scores by simply reducing the amount of writings evaluated
to 100, mimicking the scenario used in the tuning process. These results, shown in the right
half of Table 7, while definitely better, were still lacking, showing signs that the main issue lied
elsewhere.
   Since, we already established that item number 1 on our list of concerns, while easily fixable,
had an impact on the results, we decided to address concern number 2. In order to do so, we
changed the threshold tuning process, while before we were tuning the thresholds by using
a validation subset of the training data (since the rest was used to train the writing windows
classifiers), we changed the procedure to a 5-fold cross-validation (with training + validation
subsets) that also includes the data left out during the undersampling process (where we retrain
the full models for every fold to avoid information leakage), to better simulate a real life
distribution of positive to negative samples. When tested with the official testing set, we noticed
significantly better results, shown in Table 8, while still only analyzing 100 posts per subject.
This improved tuning process results in higher threshold values, significantly improving the
precision of our models at the cost of a worse recall, but resulting in improvements on the F1
and F-latency scores across the board, with the top performer, run 1, achieving a remarkable
0.87 F1.
Table 8
Performance on the first 100 writings of the official testing set of task 1 using the same user classification
protocol but with properly tuned confidence thresholds.
                          Run    P@100      R@100      F1@100      F-Latency@100
                          1      0.94       0.81       0.87        0.86
                          2      0.25       0.67       0.36        0.35
                          3      0.32       0.89       0.47        0.46
                          4      0.21       0.91       0.34        0.34
                          5      0.96       0.56       0.70        0.67

Table 9
Comparison of the performance on task 2 upon official evaluation (processing roughly 500 posts) vs
when mimicking the tuning scenario of only analyzing the first 100 writings.
       Run     P      R       F1      F-Latency     P@100     R@100      F1@100      F-Latency@100
       1       0.22   0.95    0.36    0.35          0.31      0.90       0.46        0.45
       2       0.09   0.97    0.17    0.16          0.10      0.95       0.19        0.18
       3       0.17   0.97    0.29    0.28          0.23      0.89       0.37        0.35
       4       0.09   0.99    0.17    0.16          0.10      0.97       0.18        0.18
       5       0.38   0.86    0.53    0.49          0.47      0.76       0.58        0.55

Table 10
Performance on the first 100 writings of the official testing set of task 2 using the same user classification
protocol but with properly tuned confidence thresholds.
                          Run    P@100      R@100      F1@100      F-Latency@100
                          1      0.61       0.63       0.62        0.60
                          2      0.13       0.81       0.22        0.21
                          3      0.54       0.67       0.60        0.56
                          4      0.14       0.91       0.24        0.23
                          5      0.74       0.29       0.41        0.38


5.2. Task 2: Depression
Since we followed a similar approach in task 2, the disparity between results in testing and
official evaluation was also present in this case. Although despite the worse performance
in training, the combination of already stricter thresholds and reduced number of writings
analyzed brought better results when compared to those of task 1.
   Naturally, after seeing the improvements made in task 1’s results by addressing concerns
number 1 and 2, we followed the same improvement procedures in this case and also noticed
significant boosts in performance. Despite achieving lower recall values, leading to undesirable
false negatives, we achieve better precision values. With the run 1 being the best performer yet
again, with run 3 as a close second, and run 2 having the worst performance.
   Regardless of the approach, it seems that classifying depressed subjects through their social
media writings is inherently harder than it is to classify pathological gamblers, the contrast in
performance between BoW and LM methods (a 0.40 F1 difference in task 1 and 0.02 difference
for task) 2 suggest that pathological gamblers provide cues to their condition that are more
explicit, while depressed subjects tend to be more subtle, explaining why the performance of
methods relying on a specific vocabulary (BoW) and more complex models (LM) had closer
results in task 2.


6. Conclusions and future work
Several points regarding the window classifiers can be drawn from our experiments. First,
despite the huge variation in number of textual features and the simplicity or complexity of
the process of their extraction, results did not vary as much as expected across approaches in
training. However, they did vary unexpectedly when using the official testing data where the top
performers where at either end of the complexity spectrum, with BoW models performing the
best, followed by LM models. The models relying on distributional semantics word embeddings
seemed to falter even with the properly tuned confidence thresholds, indicating that perhaps
our vectorization approach of averaging the tokens embeddings in each window was not the
best. Perhaps the choice of a larger language model could have also resulted in better F1 values
in the case of the LM approach. The inclusion of sentiment analysis in general did not have a
great effect on classification, showing little to no improvement or even degrading performance
in some cases.
   The results seem to indicate that this singular window confidence protocol can work, but
the choice of the threshold value to use is crucial and it needs to be tuned specifically to the
number of writings that will be analyzed per subject.
   In the future, we wish to experiment with other protocols (concern number 3), as we feel
that there is a lot of room for exploration at this stage of the classification, ranging from simple
conditions such as fixed absolute amount or ratio thresholds of positive classifications, to more
advanced deep learning based sequence classification models that find their own criteria in the
training process. We would also like to revisit approach 2 (WE), to determine the cause of its
ineffectiveness in our last experiments.


References
 [1] M. J. Paul, M. Dredze, Social monitoring for public health, Synthesis Lectures on Infor-
     mation Concepts, Retrieval, and Services 9 (2017) 1–183. URL: https://doi.org/10.2200/
     S00791ED1V01Y201707ICR060. doi:10.2200/S00791ED1V01Y201707ICR060.
 [2] D. E. Losada, F. Crestani, A test collection for research on depression and language use, Lec-
     ture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence
     and Lecture Notes in Bioinformatics) 9822 LNCS (2016) 28–39. URL: https://link.springer.
     com/chapter/10.1007/978-3-319-44564-9_3. doi:10.1007/978-3-319-44564-9_3.
 [3] F. Sadeque, D. Xu, S. Bethard, Measuring the latency of depression detection in social
     media, WSDM 2018 - Proceedings of the 11th ACM International Conference on Web
     Search and Data Mining 2018-Febuary (2018) 495–503. doi:10.1145/3159652.3159725.
 [4] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, in: Experimental IR Meets Mul-
     tilinguality, Multimodality, and Interaction. 13th International Conference of the CLEF
     Association, CLEF 2022., Springer International Publishing, Bologna, Italy, ????
 [5] J. Parapar, P. Martín-Rodilla, D. E. Losada, F. Crestani, Overview of erisk 2021: Early risk
     prediction on the internet, Lecture Notes in Computer Science (including subseries Lecture
     Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 12880 LNCS (2021)
     324–344. doi:10.1007/978-3-030-85251-1_22.
 [6] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of tricks for efficient text classification,
     15th Conference of the European Chapter of the Association for Computational Linguistics,
     EACL 2017 - Proceedings of Conference 2 (2016) 427–431. URL: https://arxiv.org/abs/1607.
     01759v3. doi:10.48550/arxiv.1607.01759.
 [7] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, Fasttext.zip:
     Compressing text classification models (2016). URL: https://arxiv.org/abs/1612.03651v1.
     doi:10.48550/arxiv.1612.03651.
 [8] D. E. Losada, F. Crestani, J. Parapar, erisk 2017: Clef lab on early risk prediction on
     the internet: Experimental foundations, Lecture Notes in Computer Science (including
     subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
     10456 LNCS (2017) 346–360. doi:10.1007/978-3-319-65813-1_30.
 [9] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk: Early risk prediction on the
     internet, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
     Intelligence and Lecture Notes in Bioinformatics) 11018 LNCS (2018) 343–361. doi:10.
     1007/978-3-319-98932-7_30.
[10] E. Campillo-Ageitos, H. Fabregat, L. Araujo, J. Martinez-Romo, Nlp-uned at erisk 2021:
     self-harm early risk detection with tf-idf and linguistic features, in: Working Notes of
     CLEF 2021 - Conference and Labs of the Evaluation Forum, 2021.
[11] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in:
     Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543. URL:
     http://www.aclweb.org/anthology/D14-1162.
[12] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
     in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Process-
     ing, Association for Computational Linguistics, 2019. URL: https://arxiv.org/abs/1908.10084.
[13] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, Minilm: Deep self-attention distillation
     for task-agnostic compression of pre-trained transformers, 2020. arXiv:2002.10957.