Word Embeddings and Linguistic Metadata at
     the CLEF 2018 Tasks for Early Detection of
             Depression and Anorexia
    FHDO Biomedical Computer Science Group (BCSG)

       Marcel Trotzek1 , Sven Koitka1,2,3 , and Christoph M. Friedrich1,4
           1
              University of Applied Sciences and Arts Dortmund (FHDO)
                           Department of Computer Science
                   Emil-Figge-Str. 42, 44227 Dortmund, Germany
         mtrotzek@stud.fh-dortmund.de, sven.koitka@fh-dortmund.de, and
                       christoph.friedrich@fh-dortmund.de
       2
          TU Dortmund University Department of Computer Science, Germany
   3
     Department of Diagnostic and Interventional Radiology and Neuroradiology,
                         University Hospital Essen, Germany
4
  Institute for Medical Informatics, Biometry and Epidemiology (IMIBE), University
                              Hospital Essen, Germany


      Abstract. Developing methods for the early detection of mental dis-
      orders like depression and anorexia based on written text has become
      an important aspect with the rise of social media platforms. The CLEF
      2018 eRisk shared task consists of two subtasks focussed on the detection
      of these two disorders and FHDO Biomedical Computer Science Group
      (BCSG) has submitted results obtained from four machine learning mod-
      els as well as from a final late fusion ensemble. This paper describes these
      models based on user-level linguistic metadata, Bags of Words (BoW),
      neural word embeddings, and Convolutional Neural Networks (CNN).
      BCSG has achieved top performance according to ERDE50 and F1 score
      in both subtasks.

      Keywords: depression, early detection, linguistic metadata, convolu-
      tional neural networks, word embeddings


1    Introduction
This paper describes the participation of FHDO Biomedical Computer Science
Group (BCSG) at the Conference and Labs of the Evaluation Forum (CLEF)
2018 eRisk task for early detection of depression and anorexia [11, 13]. BCSG
submitted results obtained from four different models and a late fusion ensemble
of three of these models. These models as well as the findings concerning the
dataset are described in this paper and an outlook on possible improvements
and future research is given. The work described in this paper is based on this
team’s previous participation in the eRisk 2017 pilot task for early detection of
depression [27] and on further research based on the same dataset [28].
2     Related Work

Studies concerning the effect of mental state on the language used by a person
have already shown various connections, beginning with observations of more
frequent uses of first personal singular pronouns in spoken language of depression
patients [4, 29]. More recent studies found, for example, an elevated use of the
word “I” in particular and more negative emotion words in essays by depressed
college students [19], more verbs in past tense and pronouns in general spoken by
Russian depression patients [25], and a more frequent use of absolutist words (e.g.
absolutely, completely, every, nothing) in forums related to depression, anxiety,
or suicidal ideation than in unrelated forums or forums about asthma, diabetes,
and cancer [2].
    Results like these have lead to the development of tools that allow researchers
and therapists to evaluate written texts with a focus on the author’s mental state.
One such tool is the Linguistic Inquiry and Word Count (LIWC) software [26],
which calculates a total of 93 features for any given text document based on a
dictionary. Similarly, Differential Language Analysis Toolkit (DLATK) [22] was
published as an open-source Python library for text analysis with a focus on
psychology, health, and social aspects.
    First results in the area of early detection of depression based on written
social media texts have been reported as part of the eRisk 2017 pilot task [12].
Similar research without the early detection aspect has previously been done, for
example, at the CLPsych shared task for detection of depression and PTSD on
Twitter [5]. In the same domain as this task, data from reddit.com has recently
been utilized to successfully detect messages concerning anxiety [24].


3     Datasets and Tasks

Similar to the task in 2017, the datasets of both subtasks consist of messages
obtained from the social media platform reddit.com. The training data of the
depression subtask is equivalent to the full training and test data of this previous
task, while the anorexia subtask is based on completely new messages. An espe-
cially interesting aspect of reddit is that it allows users to create communities
with specific topics called subreddits. There exists a wide variety of these commu-
nities, also including very active ones from a depression detection perspective,
like /r/depression5 , which is mainly used by people struggling with depression.
    The messages contained in both datasets can consist of a separate title and
text field depending on the type of message: Users can post content in terms of
links or images (only title, link or image not included), text content (title and
optional text), or as comment on another message (only text). Some messages
in both datasets also include no text or title and can therefore be discarded.
The number of documents per user ranges between 10 and 2000. In every week
of the test phase, a chunk of 10% of each user’s messages is supplied to the
5
    http://www.reddit.com/r/depression, Accessed on 2018-04-02
participants in chronological order, resulting in 1 to 200 documents per user
each week. In both subtasks there are exceptions of this general rule because
one anorexia training user (subject2167 of the control group), three depression
test users (subject5161, subject5301, and subject8719), and two anorexia test
users (subject4169 and subject7483) do not have any messages in the final week,
resulting in only 9 messages of these users.
    Table 1 displays the main characteristics of the two datasets. The average
amount of characters and unigrams per document was calculated based on a
concatenation of the text and title field. To calculate the number of unigrams,
the same preprocessing and tokenization as described in sections 4.3 and 4.4 was
utilized, retaining only words that occur in the writings of at least two users.

      Table 1. Characteristics of the training and test datasets for both subtasks.

                                        Depression                   Anorexia
                                   Training       Test         Training       Test
  Users                               887           820          152            320
  Positive/Negative                135/752       79/741        20/132        41/279
  Documents                        531,394       544,447       84,834        168,507
  Comments (empty title)           367,439       366,845       61,201        130,631
  One-liners (empty text)          141,849       147,197       15,768        27,228
  Empty documents                      91           219          39             90
  Avg. documents per user           599.09        663.96       558.12        526.58
  Avg. characters per doc.          174.54        197.47       178.11        171.36
  Avg. unigrams per doc.             30.83         34.53        31.59          30.95
  Unique unigrams                   85,558        94,569       31,128         45,727


3.1    Hand-crafted User Features
The participation of this team in the eRisk 2017 pilot task was based on a set
of user-level linguistic metadata features that were used as additional input for
every model. In this second eRisk shared task, only one submitted model (see
section 4.1) and the final late fusion ensemble (see section 4.5) use metadata
features. All text based features have again been calculated based on a concate-
nation of the text and title field of each message. Still, this includes the same set
of features described in the previous working notes paper [27] and an additional
set of ten features obtained from the Linguistic Inquiry and Word Count (LIWC)
[26] software. These LIWC features have been chosen based on their correlation
with the class label in the depression subtask training data. Another addition to
the original feature set is the average length of the title field that was also not
used in 2017.
    Figure 1 illustrates the correlation matrix of the complete metadata feature
set and includes the class label information to indicate the relevance of each fea-
ture. Although some features—especially the pronoun counts—seem redundant
at first sight, all of the original features are preserved as they are based on a
Part of Speech (POS) tagging using the Python NLTK framework6 while LIWC
features are based on a lexicon that also includes abbreviations or common mis-
spellings. Most of the described features are averaged over all documents per
user to obtain the final metadata feature vector, except for the counts of spe-
cific phrases like medication names or mentioned diagnoses which are summed.
Finally, all averaged features are standardized to have unit variance and a mean
of 0 and the summed features are converted to flags with a value of 1 for users
that have used such a phrase in any document and -1 otherwise.


                                                                            Possessive pronouns


                                                                            1st pers. sing. pron.
                                                                            Personal pronouns


                                                                            Personal pronouns
                                                                            Medication names


                                                                            Cognitive process
                                                                            Focus on present
                                                                            Past tense verbs


                                                                            My depression

                                                                            Lexicon words
                                                                            "I" in the text
                                                                            "I" in the title


                                                                            My therapist
                                                                            Text length


                                                                            Title length
                                                                            My anxiety


                                                                            Functional
                                                                            Diagnosis


                                                                            Pronouns
                                                                            Authentic
                                                                            Analytic
                                                                            Month


                                                                            Verbs


                                                                            Class
                                                                            DCR
                                                                            FOG
                                                                            LWF
                                                                            FRE
                       Parts-of-Speech


                                                                                                                                                                                                           1
                                             Past tense verbs
                                         Possessive pronouns 0.94
                                           Personal pronouns 0.91 0.96                                                                                                                                    0.8
                                                 "I" in the text 0.88 0.91 0.96
                                                 "I" in the title 0.13 0.16 0.18 0.21
 eRisk 2017 features


                                                   Text length 0.93 0.96 0.97 0.92 0.13                                                                                                                   0.6
                                                         Month 0.03 0.03 0.01 0.02 -0.01 0.01
                                                           LWF 0.18 0.27 0.3 0.29 -0.04 0.36 0.06
                       Readability


                                                           FRE 0.09 0.16 0.18 0.21 -0.03 0.17 0.06 0.47                                                                                                   0.4

                                                           DCR 0.12 0.16 0.16 0.17 -0.13 0.21 0.02 0.64 0.55
                                                           FOG 0.15 0.2 0.22 0.22 -0.09 0.26 0.05 0.76 0.47 0.85
                                                                                                                                                                                                          0.2
                                           Medication names 0.03 0.07 0.1 0.13 0.14 0.08 0.04 0.18 0.16 0.13 0.14
                                                     Diagnosis 0.03 0.05 0.07 0.11 0.12 0.06 0.02 0.06 0.09 0.03 0.03 0.22
                       Phrases


                                                  My therapist 0.01 0.05 0.06 0.09 0.07 0.04 0.03 0.09 0.02 0.07 0.09 0.21 0.19
                                                                                                                                                                                                           0
                                                    My anxiety 0.07 0.11 0.11 0.15 0.11 0.1 -0.01 0.1 0.05 0.1 0.08 0.26 0.12 0.37
                                               My depression 0.02 0.07 0.09 0.12 0.11 0.07 0.05 0.15 0.18 0.11 0.13 0.44 0.34 0.38 0.32
                                                   Title length -0.09 -0.13 -0.13 -0.14 0.22 -0.14 -0.03 -0.45 -0.6 -0.73 -0.67 -0.09 -0.04 -0.09 -0.08 -0.12                                             -0.2
                                               Lexicon words 0.16 0.23 0.25 0.3 0.14 0.22 0.04 0.48 0.63 0.44 0.43 0.22 0.16 0.18 0.17 0.25 -0.44
                                          1st pers. sing. pron. 0.21 0.29 0.32 0.41 0.3 0.25 0.07 0.33 0.54 0.39 0.41 0.25 0.17 0.23 0.23 0.32 -0.49 0.67
                                                       Analytic -0.15 -0.23 -0.28 -0.34 -0.19 -0.21 -0.06 -0.43 -0.67 -0.5 -0.51 -0.24 -0.18 -0.23 -0.2 -0.31 0.61 -0.8 -0.82                             -0.4
 LIWC features


                                                      Authentic 0.19 0.24 0.28 0.39 0.19 0.25 0.09 0.39 0.45 0.22 0.28 0.27 0.17 0.2 0.19 0.3 -0.3 0.66 0.73 -0.6
                                                    Functional 0.19 0.26 0.3 0.34 0.15 0.28 0.05 0.62 0.67 0.56 0.52 0.24 0.15 0.17 0.17 0.25 -0.49 0.92 0.68 -0.82 0.67
                                            Focus on present 0.11 0.17 0.22 0.27 0.2 0.18 0.05 0.38 0.6 0.44 0.42 0.22 0.15 0.2 0.17 0.28 -0.49 0.77 0.65 -0.82 0.49 0.76                                 -0.6

                                                          Verbs 0.2 0.25 0.29 0.35 0.19 0.25 0.06 0.47 0.66 0.5 0.49 0.22 0.15 0.19 0.17 0.27 -0.53 0.86 0.73 -0.86 0.62 0.86 0.92
                                            Cognitive process 0.19 0.25 0.3 0.34 0.08 0.29 0.04 0.58 0.52 0.45 0.43 0.21 0.12 0.17 0.15 0.24 -0.4 0.77 0.49 -0.7 0.57 0.81 0.67 0.74
                                                                                                                                                                                                          -0.8
                                                     Pronouns 0.18 0.26 0.29 0.34 0.22 0.23 0.05 0.4 0.66 0.52 0.5 0.21 0.17 0.21 0.2 0.29 -0.59 0.82 0.87 -0.94 0.61 0.84 0.81 0.87 0.63
                                           Personal pronouns 0.2 0.29 0.32 0.37 0.25 0.25 0.05 0.36 0.61 0.44 0.45 0.23 0.18 0.22 0.21 0.32 -0.55 0.76 0.9 -0.9 0.61 0.76 0.76 0.81 0.54 0.95
                                                          Class 0.04 0.11 0.15 0.19 0.05 0.11 0.04 0.26 0.25 0.2 0.23 0.42 0.31 0.33 0.22 0.54 -0.21 0.35 0.44 -0.42 0.39 0.34 0.36 0.36 0.34 0.38 0.41
                                                                                                                                                                                                          -1


Fig. 1. Correlation matrix of all user features including the class information (non-
depressed/depressed) based on the depression subtask training data. This plot is best
viewed in electronic form.


6
            http://www.nltk.org/book/ch05.html, accessed on 2018-04-02
4     Chosen Models
This section describes the five models that have been used to classify the test
users of both subtasks. The models for both tasks are completely identical, use
the same set of metadata features, and only vary slightly in their prediction
thresholds as described below. In comparison to this team’s participation in the
eRisk 2017 pilot task, the prediction thresholds were simplified: For each model,
only a single prediction threshold value was chosen based on cross-validation on
the training data to indicate whether a subject is classified as depressed. The
number of documents already processed for a user is not used anymore as the
new models are less prone to predict many false positives after processing only
few documents. In addition, non-depressed predictions are now only submitted
in the final week because early prediction of these cases has no effect on the
score and later writings might still identify them as depressed. Selecting viable
prediction thresholds is difficult as a balanced result according to both ERDEo
and F1 is often hard to achieve. The goal for this participation was to use rather
low thresholds to find depressed cases as early as possible without generating
too many false positives.
    In contrast to the previous participation of this team, only the first model and
the final ensemble utilize the updated set of user metadata features described in
section 3.1. The bag of words model, which achieved the best overall F1 as well as
second best ERDE5 and ERDE50 score in the previous task [12], is reused with
and without metadata features. The Recurrent Neural Network (RNN) using a
Long Short Term Memory (LSTM) layer was not evaluated again and instead
replaced with a Convolutional Neural Network (CNN). This decision was based
on further research using the eRisk 2017 dataset [28], which showed that the
CNN model was able to outperform results of the LSTM models and also easier
to configure and less prone to overfitting.

4.1   Bag of Words Metadata Ensemble - BCSGA
The first model is mostly equivalent to the first model used in this team’s partici-
pation in eRisk 2017, except for the extended set of metadata features. It utilizes
an ensemble of Bag of Words (BoW) classifiers with different term weightings
and n-grams that are calculated on a user basis by first concatenating all doc-
uments (text and title) of a user. The term weighting for bags of words can
generally be split into three components: a term frequency component or local
weight, a document frequency component or global weight, and a normalization
component [21]. A general term weighting scheme can therefore be given as [30]:
                                tt,d = lt,d · gt · nd ,                           (1)
where tt,d is the calculated weight for term t in document d, lt,d is the local weight
of term t in document d, gt is the global weight of term t for all documents, and nd
is the normalization factor for document d. A common example would be using
the term frequency (tf ) as local weight and the inverse document frequency (idf )
as global weight, resulting in tf -idf weighting [21].
    All ensemble models use l2-norm for nd but varying local and global weights.
The first one uses a combination of uni-, bi-, tri-, and 4-grams obtained from
the training data. To build this first BoW, the 200,000 {1, 2, 3, 4}-grams with
the highest Information Gain (IG) are selected, given by [14, p. 272]:
                                                                                
             X      X                                    P (U = et , C = ec )
 I(U, C) =               P (U = et , C = ec ) · log2                               ,
                                                       P (U = et ) · P (C = ec )
           et ∈{0,1} ec ∈{0,1}
                                                                              (2)
with the random variable U taking values et = 1 (the document contains term
t) and et = 0 (the document does not contain term t) and the random variable
C taking values ec = 1 (the document is in class c) and ec = 0 (the document is
not in class c). The raw term frequency of the resulting n-grams is used as local
weight, while their IG-score is used as global weight. The second BoW utilizes a
modified version of tf , namely augmented term frequency (atf ) [30], multiplied
by idf :                                             
                                               tft               nd
               atf -idf (t, d) = a + (1 − a)            · log           ,     (3)
                                             max(tf )         df (d, t)
with max(tf ) being the maximum frequency of any term in the document, the
total number of documents nd , and the smoothing parameter a, which is set to 0.3
for this model. This BoW, as well as the third one, contains all unigrams of the
training corpus. The local weight of the third model consists of the logarithmic
term frequency (logtf ) [16] and the global weight is given by relevance frequency
(rf ) [9], which can be combined as:
                                                                        
                                                            dft,+
            logtf -rf (t, d) = (1 + log(tf )) · log2 2 +                   ,   (4)
                                                         max (1, dft,− )

where dft,+ and dft,− is the number of documents in the depressed/non-depressed
class that contain the term t. The final model of this ensemble uses the hand-
crafted user features described in section 3.1.
    All three bags of words and the hand-crafted features were each used as
input for a separate logistic regression classifier. Due to the imbalanced class
distribution, a modified class weight was used for these classifiers similar to the
original task paper [11] to increase the cost of false negatives. It was calculated
for the non-depressed class as 1/(1 + w) and for the depressed class as w/(1 + w),
with w = 2 for all four models. The final output probabilities were calculated as
unweighted mean of all four logistic regression probabilities. Each week and for
both tasks, this ensemble predicted any user with a probability above or equal
to 0.4 as depressed, while in the final week all users with a probability less than
0.4 were predicted as non-depressed.


4.2   Bag of Words Ensemble - BCSGB

The second model is similar to the first one, but it only includes the three
bags of words in the ensemble and disregards the metadata features. Again, for
the depression subtask any test subject with a probability of at least 0.4 was
predicted as depressed, while users with a probability below 0.4 were predicted
as non-depressed in the final week. The prediction threshold for the anorexia
subtask was set to 0.3 in this case.


4.3        CNN with GloVe Embeddings - BCSGC

The third model consists of a Convolutional Neural Network (CNN) [10], which
have previously been utilized by many recent studies to achieve outstanding re-
sults especially in the area of image classification and are generally viable for
data with a grid-like structure [6]. The implementation has been done based on
Tensorflow [1] and the input of this CNN is based on GloVe [18] word embed-
dings: A 50-dimensional set of word embeddings pre-trained on Wikipedia and
News7 is used to produce a matrix of word vectors for the first 100 words of
each document in the dataset. Prior to this vectorization, the documents are
preprocessed and tokenized in a way that preserves, for example, emoticons,
punctuation, words including special characters, and generally all tokens that
occur in the documents of at least two users. Zero-padding is used for docu-
ments with less than 100 words. Each document is therefore represented by a
100 × 50 matrix and is classified independently. Since the number of words per
document in the training data ranges between 1 (when ignoring the empty docu-
ments) and 6,487 but has a mean of 34.58 according to the tokenization done for
this work, the limitation to 100 words (or even fewer to minimize the necessary
zero-padding) is viable.


                           CReLU
                          activation
            convolution

                                       2 feature maps    1-max
                                           per ﬁlter     pooling

                          100 ﬁlters

                            2x300          99x1

    document
                            2x300                 99x1
                             ...


                                                                        Dropout
    100x50/300                                           1x200     FC             1x200   FC   1x100   FC   1x50   FC   1x2
                                                                          0.4

                            2x300          99x1


                            2x300                 99x1


Fig. 2. Architecture of the convolutional neural network used for the models BCSGC
and BCSGD (with 300 instead of 50 dimensional word vectors) [28].


    The text classification network architecture used for this work is displayed
in Figure 2, which shows the use of 300 dimensional word vectors (and therefore
100 × 300 documents) as used for the next model BCSGD. It is similar to the
7
    http://nlp.stanford.edu/projects/glove, Accessed on 2018-03-30
one-layer CNN for sentence classification described by Zhang and Wallace [31]
and consists of only a single convolutional layer, 100 filters with an equal height
of 2 and a width corresponding to the word embedding dimensions, and uses
1-max pooling to extract a single value from each filter. Due to the usage of
Concatenated Rectified Linear Units (CReLU) [23] activation, this finally results
in a 200-dimensional vector per document that is propagated through four fully
connected layers, of which the first applies dropout to its output and the final
one applies softmax. The training steps of this and the following CNN model
utilized Adam [8] to minimize the cross-entropy loss. Both models were trained
using a learning rate of 1e−4 and a batch size of 10.000 documents. BCSGC was
trained for 30 epochs.
    To obtain a final prediction per user, the 98th percentile of the outputs
from all the user’s documents is calculated. This ensures that even depressed
users that have very few documents with a high probability can be correctly
predicted. For both subtasks, any subject with a final probability of at least
0.4 was predicted as depressed in each week, while probabilities below 0.4 again
resulted in a non-depressed prediction in the final week.


4.4    CNN with fastText Embeddings - BCSGD

The second CNN model is based on the same architecture as the previous one but
utilizes 300-dimensional fastText [7, 3, 15] word embeddings. To evaluate word
vectors that are more related to the domain of reddit messages or social media in
general, a new fastText model was trained specifically for this task. A dataset of
all 1.7 billion reddit comments written between October 2007 and May 20158 was
used as training corpus for this model and preprocessed similar to the description
in section 4.3 but without removing infrequent words yet. In addition to this,
any references to reddit users (in the form of /u/<username>) were replaced
by a generic phrase “ref_user” to prevent any connections to actual users in the
resulting word embeddings. Similarly, any reference to a subreddit (in the form
of /r/<subreddit>) was replaced by the phrase “ref_subreddit_<subreddit>”
to be able to learn a vector representation of them as well that can be regarded as
their topic. No stemming or stopword removal of any kind was done and messages
in other languages than English were removed based on stopword counts. The
final corpus of 1.37 billion reddit comments was used to train 6 million word
vectors of words that occur at least five times in the corpus. Additional details
about this model and the utilized CNNs can be found in the corresponding paper
[28].
    Similar to the previous CNN model, the resulting 100 × 300 matrix of word
embeddings obtained for each document was classified separately and the 98th
percentile of the outputs was used as output for the corresponding user. This
model was trained for 25 epochs using the same parameters as BCSGC. The
prediction threshold for depressed predictions was set to 0.7 for both tasks,
8
    https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly
    _available_reddit_comment/, Accessed on 2018-03-30
leading to a non-depressed prediction for probabilities below 0.7 in the final
week.


4.5   CNN and Bag of Words Metadata Ensemble - BCSGE

The final model consists of a simple late fusion ensemble that has been calculated
as the unweighted mean of the outputs obtained from models BCSGA, BCSGC,
and BCSGD - the bag of words including metadata and the two CNN models.
Although these outputs have not been calibrated (e.g. by using Platt scaling [17])
and can therefore not be seen as directly comparable probabilities, previous
experiments [28] have shown that such an ensemble was able to improve the
results of the separate models. Again, a prediction threshold of at least 0.4 was
used for the depression detection subtask, while a threshold of 0.5 was utilized
for the anorexia subtask.


5     Results

Before examining the results of the described models in the two subtasks, it is
necessary to analyze the utilized ERDEo metric for early detection systems.
Since this metric is based on the absolute number of documents read per user
before a true positive prediction, but these documents have to be read in ten
equally sized chunks, the score is highly dependent on the number of documents
available per user. Because at least 10% of each user’s documents have to be read
by all participants, it is impossible to predict some users correctly depending on
the parameter o that describes after how many documents the penalty for late
predictions grows. This fact has already been described in more detail in another
paper [28].
     Table 2 displays the best ERDE5 and ERDE50 scores that are possible for
the test data of the depression and anorexia subtask. These results are based
on a perfect prediction in the first week of the tasks. As described in the above-
mentioned paper, only test users with less than 100 documents (less than 10 per
chunk) have any effect on the ERDE5 score. This means that only predicting 26
of the 79 depressed test users correctly in the first week and ignoring all others
still leads to an ERDE5 score of 7.78 (F1 = 0.50), while predicting only 12 of
the 41 anorexia users in the first week also leads to an ERDE5 score of 10.23
(F1 = 0.45). ERDE5 alone, without the additional F1 score, is therefore hard
to interpret.
     To examine the weekly predictions obtained from the described models, Fig-
ures 3 and 4 show the cumulative number of positive predictions for the two
subtasks and also visualize the proportion of true positives. For the depression
subtask, this shows that the ensemble indeed lead to the most true positives
but also many false positives. BCSGD seems to perform worse at first sight
but indeed achieved a good balance between true and false positives because
of its higher prediction threshold. As the comparison of both figures shows, the
Table 2. Best possible ERDEo scores of both subtasks based on a perfect prediction
in the first week.

                                                                                   Depression         Anorexia
                                                                          ERDE5         7.78           10.23
                                                                          ERDE50        3.79            4.05


anorexia subtask was much easier using the same models and that it was possi-
ble to detect nearly all positive samples without too many false positives. Both
examinations show a steady progression over the ten weeks for all models.
         Cumulative number of (true) positive predictions


                                                            140
                                                            130
                                                            120
                                                            110
                                                            100
                                                             90                                                         Model
                                                             80                                                           BCSGA
                                                             70                                                           BCSGB
                                                                                                                          BCSGC
                                                             60
                                                                                                                          BCSGD
                                                             50
                                                                                                                          BCSGE
                                                             40
                                                             30
                                                             20
                                                             10
                                                              0

                                                                  1   2    3   4   5          6   7     8      9   10
                                                                                       Week


Fig. 3. Cumulative number of depressed predictions (blue plus gray bars) and pro-
portion of true positives (blue bars only) per model after each week of the depression
subtask. A horizontal line marks the 79 depressed samples in the test data.


    Tables 3 and 4 show the official results [13] of BCSG’s models for both sub-
tasks and also include the alternative early detection scores Flatency [20] and
ERDEo% [28]. According to the suggestion in the paper, Flatency was calculated
using a value for the parameter p that fits the true positive cost function Platency
to return a cost of 0.5 for the median number of documents of the positive test
users. This results in a value of p = 0.0051 for the depression subtask (median of
216 documents per depressed test user) and p = 0.0042 for the anorexia subtask
(median of 260 documents per anorexia test user). In contrast to the standard
ERDEo score, ERDEo% is calculated based on the percentage of read documents
per user and is therefore easier to interpret in a chunk-based task. Additional
results by other teams have been added to these tables to include at least the
best two results obtained for each score.
    While the direct comparison of BCSGA (bags of words with linguistic meta-
data) and BCSGB (bags of words only) shows that the metadata features result
in more positive predictions, the actual amount of true positives was only bet-
ter for the depression subtask and resulted in a better ERDE5 score but worse
         Cumulative number of (true) positive predictions
                                                            50


                                                            40
                                                                                                                                   Model
                                                                                                                                     BCSGA
                                                            30                                                                       BCSGB
                                                                                                                                     BCSGC
                                                                                                                                     BCSGD
                                                            20                                                                       BCSGE


                                                            10


                                                             0

                                                                 1      2    3    4     5          6    7          8   9      10
                                                                                            Week


Fig. 4. Cumulative number of anorexia predictions (blue plus gray bars) and proportion
of true positives (blue bars only) per model after each week of the anorexia subtask. A
horizontal line marks the 41 anorexia samples in the test data.

Table 3. Best results of the depression subtask based on the official evaluation and
the alternative metrics Flatency and ERDEo% . The models have been chosen to show
at least the two best results achieved in each score.

                                                                                                                           %           %
       Model                                                     ERDE5       ERDE50           F1       Flatency        ERDE20      ERDE50
       BCSGA                                                         9.21        6.68        0.61       0.47               7.08     5.31
       BCSGB                                                         9.50        6.44        0.64       0.52               7.17     5.04
       BCSGC                                                         9.58        6.96        0.51       0.41               6.89     4.82
       BCSGD                                                         9.46        7.08        0.54       0.41               7.32     6.34
       BCSGE                                                         9.52        6.49        0.53       0.43               6.16     4.57
       LIIRB                                                         10.03       7.09        0.48           0.39           10.66    5.05
       UNSLA                                                         8.78        7.39        0.38           0.25            7.45    6.96
       UNSLD                                                         10.68       7.84        0.45           0.37            6.23    4.52
       UNSLE                                                          9.86       7.60        0.60           0.45            7.76    5.50


ERDE50 and F1 . Similar to the task in 2017, the bag of words ensemble again
obtained the best results in the depression subtask, while the CNN based on
the self-trained fastText embeddings (BCSGD) and the ensemble using both the
bags of words as well as the CNNs (BCSGE) achieved the best scores in the
anorexia subtask. Overall, the models of BCSG achieved the second-best results
in ERDE5 and the best results in all other scores except for another second-best
                            %
result according to ERDE50    in the depression subtask.
    As already described, the ERDEo score and especially ERDE5 should be
discussed in more detail because of the fact that optimizing it can often lead
to simply minimizing false positives by only predicting very few users at all.
A detailed look at the results and the achieved ERDE5 scores shows that, for
example, in the first week of the depression subtask both UNSLA and BCSGA
have predicted 45 users as depressed of which 20 were indeed true positives. Still,
Table 4. Best results of the anorexia subtask based on the official evaluation and the
alternative metrics Flatency and ERDEo% . The models have been chosen to show at
least the two best results achieved in each score.

                                                                %          %
    Model           ERDE5     ERDE50       F1    Flatency   ERDE20     ERDE50
    BCSGA            12.17       7.98     0.71     0.64       6.54       4.82
    BCSGB            11.75       6.84     0.81     0.74       6.02       4.46
    BCSGC            13.63       9.64     0.55     0.47       9.48       6.83
    BCSGD            12.15       5.96     0.81     0.75       5.48       3.14
    BCSGE            11.98       6.61     0.85     0.78       6.45       3.64
    LIIRA            12.78      10.47     0.71     0.57      13.05        5.55
    PEIMEXB          12.41      7.79      0.64     0.57      6.86         5.61
    RKMVERIA         12.17      8.63      0.67     0.59      6.76         6.76
    UNSLB            11.40      7.82      0.61     0.54      6.84         6.53
    UNSLD            12.93      9.85      0.79     0.63      9.03         6.68


the resulting ERDE5 score differs drastically because the predicted users vary
in the number of total documents and even though UNSLA only had five more
true positives in the following nine weeks, while BCSGA already had ten more
in the second week and a total of 53 in the end. Similarly, the leading model
in ERDE5 of the anorexia subtask, UNSLB, had 19 true positives in the first
week, while BCSGD already had 22, BCSGB had 21, and BCSGA had 20. In
summary, ERDE5 produces highly misleading results because of the varying
number of documents per user.


6    Conclusions

Again, the eRisk competition has been a challenging task concerning the early
detection of mental health issues based on sequences of social media texts. The
depression subtask had similar F1 scores but much better ERDEo scores based
on a test set that was nearly as large as last year’s training and test set combined.
The results of the anorexia subtask were surprisingly good, which probably is
due to the nature of this dataset. Generally, the promising results with the test
data of eRisk 2017 obtained only based on linguistic metadata [28] could not
yet be confirmed in this year’s tasks. As already concluded in the same paper,
finding a way to successfully integrate the metadata features into the neural
network models is an interesting task for future research.
    The examination of the task results again shows that a discussion about a
meaningful metric should be a priority in the future. Both Flatency and ERDEo%
include interesting ideas to improve the evaluation of early prediction models.
Flatency contains a cost function that grows less rapidly and already incorporates
the F1 score, which makes it more meaningful when viewed alone. ERDEo% is
more viable for chunk-based shared tasks because it is calculated based on the
proportion of read documents per user instead of the absolute number, which
leads to results that are better interpretable than the standard ERDEo . A com-
bination of these two ideas could be a promising basis for discussions about
future early detection tasks.

7    Acknowledgment
The work of Sven Koitka was partially funded by a PhD grant from University
of Applied Sciences and Arts Dortmund, Germany.

References
1. Abadi M., Barham P., Chen J., Chen Z., Davis A., Dean J., Devin M., Ghemawat
   S., Irving G., Isard M., Kudlur M., Levenberg J., Monga R., Moore S., Murray D.G.,
   Steiner B., Tucker P., Vasudevan V., Warden P., Wicke M., Yu Y., Zheng X.: Ten-
   sorFlow: A System for Large-Scale Machine Learning 12th USENIX Symposium on
   Operating Systems Design and Implementation (OSDI’16), pp. 265–283, Savannah,
   Georgia, USA (2016)
2. Al-Mosaiwi, M., Johnstone, T.: In an Absolute State: Elevated Use of Ab-
   solutist Words is a Marker Specific to Anxiety, Depression, and Suicidal
   Ideation. Clinical Psychological Science, Prepublished January 5, 2018, DOI:
   10.1177/2167702617747074 (2018)
3. Bojanowski, P.,Grave, E., Joulin, A., Mikolov, T.: Enriching Word Vectors with Sub-
   word Information. Transactions of the Association for Computational Linguistics,
   Vol. 5, pp. 135–146 (2017)
4. Bucci, W., Freedman, N.: The Language of Depression. Bulletin of the Menninger
   Clinic, Vol. 45(4), pp. 334–358 (1981)
5. Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K., Mitchell, M.: CLPsych
   2015 Shared Task: Depression and PTSD on Twitter. Proceedings of the 2nd Work-
   shop on Computational Linguistics and Clinical Psychology: From Linguistic Signal
   to Clinical Reality (CLPsych’15), pp. 31–39, Denver, Colorado, USA (2015)
6. Goodfellow I., Bengio Y., Courville A.: Deep Learning. MIT Press (2016)
7. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of Tricks for Efficient Text
   Classification. Proceedings of the 15th Conference of the European Chapter of the
   Association for Computational Linguistics, Vol. 2, pp. 427–431 (2016)
8. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. Proceedings of
   the 3rd International Conference on Learning Representations (ICLR), San Diego,
   California, USA, arXiv preprint arXiv:1412.6980 (2015)
9. Lan, M., Tan, Chew L., Low, H.-B.: Proposing a New Term Weighting Scheme
   for Text Categorization. Proceedings of the 21st National Conference on Artifical
   Intelligence (AAAI-06), Vol. 6, pp. 763–768, Boston, Massachusetts, USA (2006)
10. LeCun, Y.: Generalization and Network Design Strategies. Technical Report CRG-
   TR-89-4, University of Toronto (1989)
11. Losada, D.E., Crestani, F.: A Test Collection for Research on Depression and
   Language Use. Experimental IR Meets Multilinguality, Multimodality, and Interac-
   tion: 7th International Conference of the CLEF Association, pp. 28–39. CLEF 2016,
   Évora, Portugal (2016)
12. Losada, D.E., Crestani, F., Parapar, J.: eRISK 2017: CLEF Lab on Early Risk
   Prediction on the Internet: Experimental Foundations. Proceedings Conference and
   Labs of the Evaluation Forum CLEF 2017, Dublin, Ireland (2017)
13. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk – Early Risk Prediction
   on the Internet Experimental IR Meets Multilinguality, Multimodality, and Inter-
   action. Proceedings of the Ninth International Conference of the CLEF Association
   (CLEF 2018), Avignon, France (2018)
14. Manning, C.D., Raghavan, P., Schütze, H.: An Introduction to Information
   Retrieval. Online Edition. Cambridge University Press (2009) Available from:
   https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf - Accessed on 2018-
   04-02
15. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A.: Advances in
   Pre-Training Distributed Word Representations. Proceedings of the International
   Conference on Language Resources and Evaluation (LREC’18), Miyazaki, Japan
   (2018)
16. Paltoglou, G., Thelwall, M.: A Study of Information Retrieval Weighting Schemes
   for Sentiment Analysis. Proceedings of the 48th Annual Meeting of the Associa-
   tion for Computational Linguistics, pp. 1386–1395. Association for Computational
   Linguistics (2010)
17. Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to
   regularized likelihood methods. Advances in Large Margin Classifiers, Vol. 10(3),
   pp. 61–74 (1999)
18. Pennington J., Richard S., Manning C.D.: GloVe: Global Vectors for Word Rep-
   resentation. Proceedings of the 2014 Conference on Empirical Methods in Natural
   Language Processing (EMNLP’14), ACL, pp. 1532–1543, Doha, Qatar (2014)
19. Rude, S., Gortner, E.-M., Pennebaker, J.: Language Use of Depressed and
   Depression-Vulnerable College Students. Cognition & Emotion, Vol. 18(8), pp.
   1121–1133 (2004)
20. Sadeque, F., Xu, D., Bethard, S.: Measuring the Latency of Depression Detection
   in Social Media. Proceedings of the 11th ACM International Conference on Web
   Search and Data Mining (WSDM’18), pp. 495–503, Los Angeles, California, USA
   (2018)
21. Salton, G., Buckley, C.: Term-Weighting Approaches in Automatic Text Retrieval.
   Information Processing & Management, Vol. 24(5), pp. 513–523 (1988)
22. Schwartz, H.A., Giorgi, S., Sap, M., Crutchley, P., Ungar, L., Eichstaedt, J.:
   DLATK: Differential Language Analysis Toolkit. Proceedings of the 2017 Confer-
   ence on Empirical Methods in Natural Language Processing: System Demonstra-
   tions (EMNLP’17), ACL, pp. 55–60, Copenhagen, Denmark (2017)
23. Shang, W., Sohn, K., Almeida, D., Lee, H.: Understanding and Improving Convo-
   lutional Neural Networks via Concatenated Rectified Linear Units Proceedings of
   The 33rd International Conference on Machine Learning, Vol. 48, pp. 2217–2225,
   New York City, New York, USA (2016)
24. Shen, J.H., Rudzicz, F.: Detecting Anxiety through Reddit. Proceedings of the
   Fourth Workshop on Computational Linguistics and Clinical Psychology. From
   Linguistic Signal to Clinical Reality (CLPsych’17), pp. 58–65, Vancouver, Canada
   (2017)
25. Smirnova, D., Sloeva, E., Kuvshinova, N., Krasnov, A., Romanov, D., Nosachev,
   G.: Language Changes as an Important Psychopathological Phenomenon of Mild
   Depression. European Psychiatry, Vol. 28 (2013)
26. Tausczik, Y.R., Pennebaker, J.W.: The Psychological Meaning of Words: LIWC
   and Computerized Text Analysis Methods. Journal of Language and Social Psy-
   chology, Vol. 29(1), pp. 24–54 (2010)
27. Trotzek, M., Koitka, S., Friedrich, C.M.: Linguistic Metadata Augmented Clas-
   sifiers at the CLEF 2017 Task for Early Detection of Depression. Working Notes
   Conference and Labs of the Evaluation Forum CLEF 2017, Dublin, Ireland (2017)
   Available from: http://ceur-ws.org/Vol-1866/paper_54.pdf - Accessed on 2018-03-
   29
28. Trotzek, M., Koitka, S., Friedrich, C.M.: Utilizing Neural Networks and Linguistic
   Metadata for Early Detection of Depression Indications in Text Sequences. arXiv
   preprint arXiv:1804.07000 [cs.CL] (2018)
29. Weintraub, W.: Verbal Behavior: Adaptation and Psychopathology. Springer Pub-
   lishing Company (1981)
30. Wu, H., Gu, X.: Reducing Over-Weighting in Supervised Term Weighting for Sen-
   timent Analysis. The 25th International Conference on Computational Linguistics
   (COLING 2014), pp. 1322–1330, Dublin, Ireland (2014)
31. Zhang, Y., Wallace, B.: A Sensitivity Analysis of (and Practioners’ Guide to) Con-
   volutional Neural Networks for Sentence Classification. Proceedings of the Eighth
   International Joint Conference on Natural Language Processing (Volume 1: Long
   Papers). Asian Federation of Natural Language Processing, pp. 253–263, Taipei,
   Taiwan (2017)