Using Surface and Semantic Features for
         Detecting Early Signs of Self-Harm in
                 Social Media Postings

        Linda Achilles, Max Kisselew, Johannes Schäfer, and Ralph Koelle

          Institute for Information Science and Natural Language Processing,
                     University of Hildesheim, Hildesheim, Germany
        {achilles,kisselew,johannes.schaefer,koelle}@uni-hildesheim.de


        Abstract. This paper describes the University of Hildesheim submis-
        sion to the CLEF eRisk 2020 shared task on detecting early signs of
        self-harm in social media posts. We introduce four systems that apply
        different methods trying to address this task and a fifth ensemble system
        that combines the four other systems. The first four systems make use
        of features of different types, such as time intervals between posts, the
        sentiment and semantics of the writings by using bag-of-words vectors
        and contextualized word embeddings in a neural network approach. The
        results show that while all our systems achieve a high recall, the focus of
        future work should be further improvement of the precision. All systems
        and the ensemble model achieve a comparable performance of Flatency
        values in the range of 0.367 to 0.424.

        Keywords: Self-harm detection · Risk detection · Natural language pro-
        cessing · Word vectors · Contextualized word embeddings · Deep learning
        · Ensemble system


1     Introduction

Since the advent of Social Media networks, they revolutionized the way people
interact with each other and so attracted academic and industry researchers
in different fields of expertise [3]. This way the content of those networks also
became popular in the field of natural language processing (NLP). Text and
meta information of posts have been used, for instance, to predict the Big Five
personality traits of microblog users [1]. Later, different kinds of mental disorders
came into focus of the research community. Facebook language was analyzed to
predict a depression in a medical record [7], as well as postpartum depression [4]
and Twitter data were evaluated to detect depressed users. The significance of
this research led to the emergence of initiatives such as the CLEF eRisk group.

    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
    The Early Risk Prediction On The Internet (eRisk) lab1 , as part of the
CLEF initiative2 , was first organized in 2017. The lab’s main objective is to
evaluate metrics, systems and data collections regarding Social Media users’
safety and health on the internet. During the first workshop there was a pilot
task on depression detection and a corresponding data collection was released
[12]. During the following years more tasks, such as the early prediction of signs
of self-harm or anorexia have been conducted.
    In 2020 there were two tasks:

 1. Early detection of signs of self-harm
 2. Measuring the severity of the signs of depression

The data for the first task is released in rounds. In each round participating teams
receive one social media post for each social media author in the collection. The
task is to analyze these posts and return a binary decision [0, 1] for each author
if he or she is at risk for self-harm in addition to estimating the level of self-harm
represented by a score [0, 10]. A round ends when a team submits their results
starting the next round.
    In the second task the data is released as entire collection for each social
media author at once. After processing these writings, the team’s system has
to fill in a standard depression questionnaire, the Beck Depression Inventory
(BDI)[2], for each social media author.
    This paper describes the approach of the hildesheim team in the first task of
CLEF eRisk 2020, the early detection of signs of self-harm in social media posts.
Section 2 and 3 give more insight into the data collection and the necessary pre-
processing as well as an overview of the evaluation metrics in eRisk. Furthermore,
we deployed five different systems, two of them making use of surface-based
features (see Sections 4.1 and 4.2), two systems making use of semantic features
(see Sections 4.3 and 4.4) and the fifth system being an ensemble system (see
Section 4.5) combining the decisions and scores from the other four systems.
Each system is evaluated separately in the shared task as runs 0 to 4 of our
team. Since the shared task limited the maximum number of allowed runs per
team to 5, we were not able to test different configurations of these systems. In
Section 4 we present each system in more detail.


2     Data and Pre-processing

In 2019 the shared tasks’ chunk-based data processing approach was changed to
an item-by-item option to analyse the data. In 2017 and 2018 participating teams
received a chunk of data, a sub-collection of the writings limited by a specific
time period, for each social media author. From 2019 on they would receive the
data post by post. This second way of processing the social media data would
give the participating teams the possibility to make a decision about the state
1
    Website of the eRisk lab: https://erisk.irlab.org/
2
    Website of the CLEF initiative: http://www.clef-initiative.eu/
of a specific social media author at any point in time and it was continued in
2020 [14].
    The data collection process is described in [12], the training data set in [13]
and the test data set is described in [15]. For 2020 the training data set is a col-
lection of XML files, one for each subject (social media author). Each individual
XML file comprises a collection of an author’s writings in chronological order.
Each writing consists of title, text, info and date elements. The title and text
elements represent the title and the content of each social media post, the date
element the timestamp. The info tag gives information on the social network
the data was retrieved from: The social network Reddit3 . Both title and text
elements can be empty if the other element contains text. Sometimes title as
well as text are both filled with content.
    The training data provided XML files for 340 subjects, 41 of which belonging
to the self-harm group, 299 to the control group. The total number of writings
in the self-harm group is 6,927 posts in contrast to 163,506 in the control group.
The difference between the groups is also very significant in the average number
of writings per subject: 169.0 in the self-harm group and 546.8 in the control
group. According to the shared task organizers the average number of words per
writing is in the self-harm group a slightly higher (24.8) than in the control group
(18.8). The self-harm test data of the 2019 was reused for the 2020 challenge as
training data [14]. Table 1 summarizes the details about the training data.


Table 1. Summary of the training data set for eRisk 2020 task 1 (Early detection of
signs of self-harm).

                                                      Self-harm    Control
          Number of subjects                                  41       299
          Number of submissions                            6,927   163,506
          Average number of submissions per subject        169.0     546.8
          Average number of words per submission            24.8      18.8


    During manual inspection of the training data, a few problems were identified.
For example, some posts contained many URLs or Hashtags while others only
consisted of HTML codes. The latter were partially identified as posts written
in Cyrillic script, thus probably in a language different from English. These
inconsistencies, typical for social media, required some pre-processing before the
training data could be processed with the systems.
    In a first step the posts were tokenized4 splitting them into sentences and
tokens. At the same time a post ID was generated from the subject ID and

3
    Website of the social media network Reddit: https://www.reddit.com/
4
    Using the standard NLTK tokenizer: https://www.nltk.org/api/nltk.tokenize.html
timestamp to keep the post structure also on sentence level. In a second step
language-detection5 was applied to the sentences.
    We removed sentences in cases where the language detection algorithm raised
an error. This way URLs were sorted out of the data. Additionally, hashtags
and posts consisting only of HTML escape characters were also removed from
the training data set. The remaining sentences were used for training of the
semantic features. Some sentences contained urban slang words (e.g. dawg) and
were consequently misclassified during language detection. Therefore, we decided
to include all the posts where the language detection did not fail, even though
not all sentences were classified to be English.
    An additional pre-processing step of the training data was necessary for the
time analysis system (see Section 4.1). The posts were sorted by subject and
timestamp and the time difference between two posts was calculated. For both
groups (self-harm and control group) mean and standard deviation were calcu-
lated. For the self-harm group the mean was approximately 71 hours (2 days
23 hours), meaning that each subject approximately posted every three days
on average. The standard deviation was approximately 563 hours (23 days
11 hours). For the control group the mean was approximately 35 hours (1 day
11 hours) and the standard deviation was approximately 333 hours (13 days
22 hours).


3     Metrics


Precision, recall and the F-measure, calculated referring to the self-harm positive
group of social media authors [14], were used as evaluation metrics for eRisk since
its advent in 2017. Besides these standard Information Retrieval metrics the
organizers of eRisk also introduced a new metric, ERDE (Early Risk Detection
Error), that involves the time that a system took to make a decision, based on
the number of writings that were needed to come to this decision. ERDE allows
to penalize late decisions made by the system, meaning that a late decision needs
to process more writings than an early alert. The metric’s range is [0, 1] meaning
a low value corresponds to a good result of the system.
    However, ERDE comes with several limitations [13], so that Flatency , as pro-
posed by Sadeque and colleagues [16], was added to the set of evaluation metrics
for eRisk. An optimal system’s value is 1.
    In addition to the decision-based evaluation metrics described above, the
organizers introduced also ranking-based evaluation methodologies. The stan-
dard Information Retrieval measures P@10, NDCG@10 and NDCG@100 are
also applied in eRisk.

5
    Python language-detection library langdetect: https://pypi.org/project/langdetect/
4     Systems
This section describes the four individual systems and the fifth ensemble system
which we developed for the submissions for the eRisk 2020 shared task. Each
system (1 - 5) was submitted as in separate runs (0 - 4).
   Our systems had a run time of 72d 20h describing the time passed from
the first to the last response to the server providing new Social Media posts.
Technical problems within our systems led to this high number, however, no
manual offline processing was involved. The systems work entirely automatically.
   The systems 1, 2 and 4 were originally implemented returning a decision = 1
when there were no more posts of an author in the current round, interpreting
the last score and decision as final for this user. After the test stage of the shared
task this understanding of the evaluation metrics turned out to not fully comply
with the actual application of these measures of the share task.

4.1   System 1 - Time Analysis
One of our hypotheses is that social media users at risk for self-harm post less
regularly than those in the control group. For the first run (run 0) a rather
simple time analysis algorithm was deployed. The first step was to retrieve the
timestamp from the round before the current round to calculate the time that
passed between the two posts for each subject. For the calculation of the score
that represents the estimation of the level of self-harm the standard deviation
that was calculated during the training phase was used. The range from <0 days
to 13 days 21:44:23 was mapped to the score range of 0 to 5 (lower half of being
at risk ). The second half was mapped from the first half’s upper border to 23
days 10:51:57. Everything above this value was set to the highest possible score
(10).
    The mapping was done with the following functions for the groups:
                                                   5
                              fcg (x) = min(x ·       , 5)                        (1)
                                                  bcg

                                                  10
                             fsh (x) = min(x ·        , 10)                       (2)
                                                  bsh
    While bcg refers to the control group’s upper time border (13 days 21:44:23),
bsh contains the self-harm group’s value (23 days 10:51:57), both being converted
to seconds. The variable x refers to the time difference in seconds between two
posts. Thereby, f expresses the score of the level of self-harm risk in the range
of 0 to 10.
    For the calculation of the decision, the current round’s score f was compared
to the mean score of the previous rounds. If the difference between the current
score and the mean score is greater than ±3 the system returns decision = 1.
    There were cases were the algorithm caused an error: During the first round,
since we need at least two writings for calculating a time difference between two
posts, and at least one more case, where the timestamp of two successive posts
of the same author were identical. To prevent the system from causing an error
we caught these cases by setting the default time difference to ten seconds.


4.2    System 2 - Sentiment Analysis

For the second run (run 1) we developed a method based on sentiment analysis.
We used VADER (Valence Aware Dictionary and sEntiment Reasoner) [10], a
lexicon-based sentiment algorithm due to its ability to process elements that are
special for Social Media posts, such as emoticons as an expression of mood or
words written in capital letters as a sign of emphasis.
    VADER comes with a compound score which is described by the authors
as a normalized and weighted composite score6 . The compound score ranges
between -1 (extreme negative sentiment) and +1 (extreme positive sentiment).
Posts with sentiment values in the range from -0.05 to +0.05 are considered
to be of neutral sentiment. The sentiment algorithm was applied on post level
on the pre-processed training data. A histogram of the distribution of posts
with specific sentiment values is shown in Figure 1. Both groups (self-harm
and control group) are visualized separately. At first glance both groups have a
similar distribution of sentiment compound values such as for instance the most
posts are located in the neutral segment. However, the shapes of the histograms
differ at the margins. The control group appears to have a decreasing number of
extreme positive and extreme negative posts compared to the self-harm group.
Therefore, we hypothesize a correlation between extreme sentiment values and
the author’s susceptibility to self-harm.
    For the test stage, the sentiment method therefore functions as follows:
VADER is applied on post level to calculate the compound sentiment value.
In a next step, the compound sentiment value is mapped to the score for the
estimation of the level of self-harm by multiplying the compound value by 10.
A negative compound value is first transformed into a positive number by mul-
tiplying it by -1.
    For the binary decision a mean score of the previous rounds is calculated. The
algorithm then compares the mean score of the previous rounds with the score
of the current round. If the difference between the current score and the mean
score is greater than ±3 the system returns decision = 1, because we assume
the author to be prone to self-harm, becoming apparent through a less stable
sentiment.


4.3    System 3 - Distributional Semantics

For our third system (run 2) we implemented a system based on distributional
semantics to decide as early as possible whether a post indicates signs of self-
harm. To this end, our method computes the semantic similarity between vector
representations of previously unseen Reddit posts and an abstract vector rep-
resentation exhibiting the semantics of a typical post showing signs self-harm
6
    VADER on Pypi.org: https://pypi.org/project/vader-sentiment/
Fig. 1. Distribution of posts in conjunction with VADER sentiment compound value.


behavior. If the semantic similarity exceeds a predefined threshold, the system
raises an alarm for the subject who wrote the corresponding post.
    Our system employs an approach based on the framework of distributional
semantics [19]. This framework is based on the distributional hypothesis which
states that words occurring in similar contexts tend to have similar meanings
[9,8,5]. This hypothesis is operationalized by representing the meaning of target
words as multidimensional vectors, also referred to as word vectors or semantic
vectors. In this work we use the bag-of-words model where each vector dimension
stands for a particular context word and the numerical value in that dimension
signifies how often that context word co-occurs within a window of n words
around the target word in a specific text corpus.
    Data. As described in Section 2 the training data contains a history of
posts for 41 subjects for whom self-harm alarm has been raised and 299 healthy
subjects. To develop our model on data balanced in terms of the distribution of
self-harm and healthy subjects’ histories of writings, we randomly selected 41
subjects from the set of healthy subjects. Together with the 41 subjects prone
to self-harm this results in a total of 82 histories of posts.
    Vector representation of Reddit posts. Our word vectors are built on a
concatenation of the lemmatized British National Corpus (BNC)7 and ukWaC
corpora8 . These corpora are used in many related works to build vector space
models as they contain a wide range of text genres from different domains. The
corpus contains in total around 2.36 billion tokens. We create 10000-dimensional
bag-of-words vectors for the 28,000 most frequent content words in the corpus
and set the window size for counting co-occurring context words around each
7
    British National Corpus (BNC): http://www.natcorp.ox.ac.uk
8
    UKWaC corpora: http://wacky.sslmit.unibo.it
target word to 5 words to either side of the target word. The vector dimensions
are the 10,000 most frequent content words from the corpus. For each Reddit
post that is used for the development of our system and during the test stage of
the shared task, we compute an average semantic vector as follows: Assuming
that Wp is the set of all words appearing in post p, we compute the vector →  −
                                                                              p
                                                                          →
                                                                          −
representing the meaning of post p as the centroid of all word vectors w for
words w from Wp :
                                →
                                −     1 X →     −
                                p =             w                            (3)
                                    |Wp |
                                            w∈Wp

    Vector representation of susceptibility to self-harm. A manual in-
spection of writings from the 41 subjects who are prone to self-harm according
to the golden truth revealed that in many cases the last post written by these
subjects is particularly characteristic for the expression of self-harm behavior.
Thus, we compute a special self-harm vector →    −
                                                 s indicating actions or thoughts
of self-harm by averaging post vectors representing the last post from each of
the 41 subjects prone to self-harm.
    To tune our system, we computed the cosine similarities between the first five
posts of each subject in our balanced data set consisting of 82 histories of posts
and the self-harm vector →−s . We found 0.985 to be a good threshold to distinguish
posts from persons prone to self-harm and those who are not. Therefore we use
this threshold for the test stage in the CLEF eRisk 2020 shared task.
    During the test stage we compute a post vector for each incoming post as we
did during the training phase. Then we compute the cosine similarity between
each of these post vectors and the pre-computed self-harm vector → −s . If the cosine
similarity exceeds the threshold of 0.985, an alarm for that post is raised and
the label 1 is assigned to the subject who wrote that post. The score expressing
susceptibility to self-harm expressed in post p is computed using the following
linear function:
                             score(p) = cos(→
                                            −p ,→
                                                −
                                                s ) · 10                          (4)
        →
        −                          →
                                   −
where p is the post vector and s the pre-computed self-harm vector.


4.4   System 4 - Neural Network

Automatically learning to estimate a person’s risk of self-harm solely based on
their social media postings is a complex task. Therefore, an automatic system
has to be able to consider a variety of textual features. Manually defining a
feature set which incorporates a comprehensive amount of indicators transpired
to be a tedious task (cf. Systems 1-3 utilizing diverse feature categories). Thus,
we decided to use a neural network to automatically learn such features. In this
section we describe the structure of our deep learning model, how we train it on
the available data set and how we predict scores for new instances. The overall
architecture of our neural network system is shown in Figure 2.
    In the context of the shared task, we formulate the classification task as
follows: an instance (input data point) consists of a sequence of posts of a single
Fig. 2. Architecture of our neural network system. Each box corresponds to a neural
network layer while the values on the right side of each box display the dimensions of
the input (values at the top) and output (values at the bottom) tensor of each layer
(’ ?’ is a placeholder for the variable batch size).

author (where each post is represented as a string) and should be classified as a
binary decision, expressing if the author tends toward self-harm (yes/no).
     Model architecture. As numerical representation of the posts’ text, we
used BERT [6] embeddings which has in recent years proven to be an efficient
method of represent meaning in contextualized word embeddings. We used the
pre-trained embedding model BERT-Base (‘BERT uncased L-12 H-768 A-12’)9
[18]. As dimensionality of the embeddings we selected 50 features. To keep the
size of our instances computationally feasible, we limited the post history to the
100 most recent postings and the length of each post to the first 50 words (using
pre-padding in both cases when fewer postings/words are available). Thus, an
instance consists of 5,000 BERT word embeddings of size 50. These values are
also shown in the input layer of the network in Figure 2.
     Subsequently, we apply a convolutional neural network (CNN) to automati-
cally select features which can be indicative of self-harm from the post history.
Our CNN sub-network consists of four parallel branches, each being a sequence
of a convolutional layer, a max pooling layer and a dropout layer. CNNs are ef-
ficient feature selectors which can be applied to learn complex features based on
semantic representations, cf. [17]. We decided to use four different convolutions
(see the branches in Figure 2) to detect 8 features (by using 8 filters in the layer)
from word n-grams of different sizes n from 1 to 4. Thus, the first convolution
learns unigram features from the embedding of the post history of the subject,
9
    The used BERT model is available on https://github.com/google-research/bert
the second convolution learns bigram features, etc. In each branch, the subse-
quent max pooling layer is applied to retain only the maximum feature values
for each post, resulting in a 100x8 matrix (8 features for each of the 100 most
recent postings) in each branch. A dropout layer (with a probability of 0.25)
then introduces regularization to the model during training which improves its
ability to generalize to new data instances. Finally, the output of the branches
is simply concatenated resulting in 32 features for each of the 100 most recent
postings as shown in the layer named Concatenate in Figure 2.
    To compute an assessment of the overall self-harm risk of a subject based on
these high-level features of their most recent history, we process the output of
the Concatenate layer using a recurrent neural network (RNN) with long-short
term memory (LSTM) cells (see LSTM layer in Figure 2). This neural network
component reads the features of these posts as a sequence while updating an
internal status (which here should model the risk of self-harm) after each step
(post). Finally after processing the entire sequence, the RNN returns an overall
assessment for the given post history; here, we set the dimensionality of the
output space to 8.
   As final layer we use a densely connected layer with a sigmoid activation
function to predict an output between 0 and 1 which can be interpreted as the
probability of the subject being prone to self-harm. With this configuration, the
model has a total of 5,353 trainable parameters.
    Training. We used the data available in the eRisk 2019 Text Research Col-
lection [12] to construct training instances as follows. As our model expects a
chronological list of posts per subject as input and the data set consists of the
entire post history of only 340 subjects, we decided to use multiple random sam-
ples of subsets of this data. We constrain the samples to be between 10 and 100
posts in length while we consider both titles and text comments as individual
posts. With this method we extract 3,705 samples from the data set and select
90% of those as our training data and the remaining 10% as validation data.
    To cope with the class imbalance (in the original data set, only 41 of 340
subjects are prone to self-harm), we use class weights during training based on
their inverse frequency in out training set. Thus, errors on classifying positive
instances are valued approximately 21 times higher than errors on classifying
negative instances. Additionally, to avoid over-fitting, we apply early stopping
using the validation set which stops training when the monitored binary cross-
entropy score for the validation set does not improve for several epochs. Finally,
we optimize the network with a training batch size of 64 and a binary cross-
entropy loss using the Adam optimizer [11] in 12 epochs over the training data.
    Prediction. Our neural network model returns a probabilistic prediction
p(s) for each unseen instance expressing the likelihood of the subject s being
prone to self-harm. To conform to the format of the shared task, we reformat
this score for each subject into a decision using Equation 5 and a self-harm score
sh(s) = 10 · p(s).
                                    (
                                     0       if 0.1 < p(s) < 0.8
                      decision(s) =                                                 (5)
                                     1       otherwise
   Using this method, we set the decision value to 1 only in cases when the
model is extremely confident for a negative prediction (values below 0.1)10 or
highly confident for a positive prediction (values above 0.8)11 .

4.5   System 5 - Ensemble System
Our ensemble system uses the individual predictions of the above-mentioned
four systems (System 1-4) in each run to compute a single combined decision.
For each subject this system predicts a decision value of 1 if at least two of the
individual systems assigned a 1. If only one or none of the systems predicted 1,
the ensemble system returns 0. The self-harm risk score for each subject is always
computed as an average of the scores of all four systems in each run.


5     Results
Table 2 shows the overall decision-based results of team hildesheim in all five
runs on the eRisk 2020 test data. The run ID corresponds to our systems as
described in Section 4.
    Our systems rank high in terms of recall. Two systems achieve R = 1, but
also the other three runs are highly ranked. Corresponding to these results the
precision of all five runs is comparably low. Our best system achieves a precision
of P = 0.297 which shows that approximately every third decision is correct. In
such a sensitive domain we consider this to be an acceptable result, but we aim
for a higher precision in future eRisk participation of our team.
    For ERDE5 one of our systems (System 3) performs well with a low value of
ERDE5 = 0.237. After processing 50 posts all our systems improve, e.g. our best
system achieves an ERDE50 result of 0.185.
    Regarding the latency-weighed F measure Flatency our systems show average
performance. Our third system performs best with a value of Flatency = 0.424,
close to the results of our other proposed systems.
    The ranking-based evaluation results are listed in Table 3. We did not con-
sider the results after the first writing to be meaningful for our approaches,
because some systems are based on a history of writings and especially for the
time algorithm at least two writings are necessary to make a decision. For that
reason, we think that the results for 100 writings and 500 writings are more in-
terpretable. However, the ranking-based results remain rather constant for 100
and 500 writings with only slight changes.
10
   We chose this very strict threshold to only exclude subjects which have shown several
   clear signs of not being prone to self harm.
11
   Here we chose a weaker threshold (in comparison to the threshold for negative cases)
   to catch subjects which only have shown signs of being prone to self harm at some
   point, though those still being clear signs.
Table 2. Our decision-based results of task 1: Early detection of signs of self-harm, in
comparison to the best results achieved in the shared task for each metric.

          System          P             R     F1       ERDE5   ERDE50     Flatency
             1           0.248      1.000    0.397     0.292    0.196      0.397
             2           0.246      1.000    0.395     0.304    0.185      0.389
             3           0.297      0.740    0.424     0.237    0.226      0.424
             4           0.270      0.942    0.420     0.400    0.251      0.367
             5           0.256      0.990    0.406     0.409    0.210      0.389
            Best         0.913      1.000    0.754     0.134    0.071      0.658

Table 3. Our ranking-based results of task 1: Early detection of signs of self-harm, in
comparison to the best results achieved in the shared task for each metric.

                              100 writings                      500 writings
    System       P@10     NDCG@10           NDCG@100    P@10   NDCG@10    NDCG@100
      1            0.4           0.43         0.42       0.5     0.53          0.42
      2            0.5           0.48         0.49       0.5     0.54          0.57
      3            1.0           1.00         0.69       1.0     1.00          0.68
      4            0.1           0.07         0.13       0.1     0.06          0.11
      5            1.0           1.00         0.62       1.0     1.00          0.69
     Best          1.0           1.00         0.83       1.0     1.00          0.84


    The complete list of the evaluation results of all participating teams is pub-
lished in [15].


6    Conclusion and Future Work

This paper describes the University of Hildesheim submission to the CLEF eRisk
2020 shared task on detecting early signs of self-harm in social media posts.
We presented four systems that apply different methods trying to solve this
task and a fifth ensemble system that combines the decisions of the four former
systems. The first four systems analyze social media posts taking into account
different types of features, such as time intervals between posts, sentiment values
and semantic representations. While all our systems achieve a high recall, they
struggle to yield a comparable precision.
    We expected our ensemble system to perform better than the other systems
since it raises an alarm only if several other systems do so as well. However, as the
results show, the ensemble system is more likely to balance out the predictions
of the other four systems which becomes apparent by being ranked among the
other four systems. Therefore as future work we intend to implement a more
sophisticated ensemble model that is able to incorporate the strengths of the
other four systems, in particular when those are very confident in their decisions.
This could be accomplished by weighting the contributions of the single systems
in the ensemble system. Investigating different settings for the ensemble system
is a promising avenue towards a solution to the early detection of self-harm risk
in social media postings.

References
 1. Bai, S., Hao, B., Li, A., Yuan, S., Gao, R., Zhu, T.: Predicting Big Five Personality
    Traits of Microblog Users. In: V. Raghavan, X. Hu, C.L.J.T. (ed.) Proceedings of
    the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence
    (WI) and Intelligent Agent Technologies (IAT). vol. 1, pp. 501–508. IEEE Com-
    puter Society, Washington D.C., USA (November 2013)
 2. Beck, A.T., Ward, C.H., Mendelson, M., Mock, J., Erbaugh, J.: An Inventory for
    Measuring Depression. Archives of general psychiatry 4(6), 561–571 (1961)
 3. Boyd, D.M., Ellison, N.B.: Social Network Sites: Definition, History, and Scholar-
    ship. Journal of computer-mediated Communication 13(1), 210–230 (2007)
 4. De Choudhury, M., Counts, S., Horvitz, E.J., Hoff, A.: Characterizing and Predict-
    ing Postpartum Depression from Shared Facebook Data. In: S. Fussell, W. Lut-
    ters, M.R.M.M.R. (ed.) Proceedings of the 17th ACM conference on Computer
    supported cooperative work & social computing. pp. 626–638. CSCW 2014, Asso-
    ciation for Computing Machinery, New York, NY, USA (February 2014)
 5. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: In-
    dexing by Latent Semantic Analysis. Journal of the Association for Information
    Science 41(6), 391–407 (1990)
 6. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep
    Bidirectional Transformers for Language Understanding. In: Proceedings of the
    2019 Conference of the North American Chapter of the Association for Compu-
    tational Linguistics: Human Language Technologies, Volume 1 (Long and Short
    Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapolis,
    Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423
 7. Eichstaedt, J.C., Smith, R.J., Merchant, R.M., Ungar, L.H., Crutchley, P.,
    Preoţiuc-Pietro, D., Asch, D.A., Schwartz, H.A.: Facebook Language Predicts De-
    pression in Medical Records. Proceedings of the National Academy of Sciences
    (PNAS) 115(44), 11203–11208 (2018)
 8. Firth, J.: A Synopsis of Linguistic Theory 1930-1955. Studies in Linguistic Analysis
    pp. 1–32 (1957)
 9. Harris, Z.: Distributional Structure. Word 10(23), 146–162 (1954)
10. Hutto, C.J., Gilbert, E.: Vader: A Parsimonious Rule-based Model for Sentiment
    Analysis of Social Media Text. In: Adar, E., Resnick, P., Choudhury, M.D., Hogan,
    B., Oh, A.H. (eds.) Proceedings of the 8th International AAAI Conference on
    Weblogs and Social Media. AAAI 2014, AAAI Press, Palo Alto, CA, USA (June
    1-4 2014)
11. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. In: Bengio,
    Y., LeCun, Y. (eds.) Proceedings of the 3rd International Conference on Learning
    Representationss. ICLR 2015 (May 7-9 2015), http://arxiv.org/abs/1412.6980
12. Losada, D., Crestani, F.: A Test Collection for Research on Depression and Lan-
    guage Use. In: K. Balog, L. Cappellato, N.F.C.M. (ed.) Experimental IR Meets
    Multilinguality, Multimodality, and Interaction. Proceedings of the 7th Interna-
    tional Conference of the CLEF Association. pp. 28–39. CLEF 2016, Springer Intr-
    national Publishing, Cham, Switzerland (September 5-8 2016)
13. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk at CLEF 2019 Early
    Risk Prediction on the Internet (extended overview). In: Cappellato, L., Ferro, N.,
    Losada, D.E., Müller, H. (eds.) Information Access Evaluation meets Multilingual-
    ity, Multimodality, and Visualization. Proceedings of the 10th Conference and Labs
    of the Evaluation Forum. No. 2380 in CEUR Workshop Proceedings (September
    2019), http://ceur-ws.org/Vol-2380/paper 248.pdf
14. Losada, D.E., Crestani, F., Parapar, J.: eRisk 2020: Self-harm and Depression
    Challenges. In: Jose, J.M., Yilmaz, E., Magalhães, J., Castells, P., Ferro, N., Silva,
    M.J., Martins, F. (eds.) Advances in Information Retrieval. Proceedings of the
    42nd European Conference on IR Research. pp. 557–563. ECIR 2020, Springer
    International Publishing, Cham, Switzerland (April 14-17 2020)
15. Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk 2020: Early Risk Pre-
    diction on the Internet. In: A. Arampatzis, E. Kanoulas, T. Tsikrika, S. Vrochidis,
    H. Joho, C. Lioma, C. Eickhoff, A. Névéol, L. Cappellato, N. Ferro (ed.) ”Exper-
    imental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of
    the 11th International Conference of the CLEF Association. CLEF 2020, Springer
    International Publishing, Cham, Switzerland (September 22-25 2020)
16. Sadeque, F., Xu, D., Bethard, S.: Measuring the Latency of Depression Detection
    in Social Media. In: Proceedings of the 11th ACM International Conference on Web
    Search and Data Mining. pp. 495–503. WSDM 2018, Association for Computing
    Machinery, New York, NY, USA (February 2018)
17. Schäfer, J., Burtenshaw, B.: Offence in Dialogues: A Corpus-Based Study. In:
    R. Mitkov, G.A. (ed.) Proceedings of the International Conference Recent Ad-
    vances in Natural Language Processing: Natural Language Processing in a Deep
    Learning World. pp. 1085–1093. RANLP 2019, INCOMA Ltd., Varna, Bulgaria
    (September 2-4 2019). https://doi.org/10.26615/978-954-452-056-4 125
18. Turc, I., Chang, M.W., Lee, K., Toutanova, K.: Well-Read Students Learn
    Better: On the Importance of Pre-training Compact Models. arXiv preprint
    arXiv:1908.08962v2 (August 2019)
19. Turney, P.D., Pantel, P.: From Frequency to Meaning: Vector Space Models of
    Semantics. Journal of Artificial Intelligence Research 37(1), 141–188 (2010)