<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine
Learning Research</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.21105/joss.01143</article-id>
      <title-group>
        <article-title>UNSL at eRisk 2021: A Comparison of Three Early Alert Policies for Early Risk Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Juan Martín Loyola</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sergio Burdisso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horacio Thompson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leticia Cagnina</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcelo Errecalde</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Consejo Nacional de Investigaciones Científicas y Técnicas</institution>
          ,
          <addr-line>CONICET</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto de Matemática Aplicada San Luis (IMASL), CONICET-UNSL</institution>
          ,
          <addr-line>Av. Italia 1556, San Luis, C.P. 5700</addr-line>
          ,
          <country country="AR">Argentina</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universidad Nacional de San Luis (UNSL)</institution>
          ,
          <addr-line>Ejército de Los Andes 950, San Luis, C.P. 5700</addr-line>
          ,
          <country country="AR">Argentina</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <volume>12</volume>
      <issue>2011</issue>
      <fpage>2825</fpage>
      <lpage>2830</lpage>
      <abstract>
        <p>Early risk detection (ERD) can be considered as a multi-objective problem in which the challenge is to ifnd an adequate trade-of between two diferent and related aspects: 1) the accuracy in identifying risky users and, 2) the minimum time that a risky user detection requires to be reliable. The first aspect is usually addressed as a typical classification problem and evaluated with standard classification metrics like precision, recall, and 1. The second one involves a policy to decide when the information from a user classified as risky is enough to raise an alarm/alert and usually is evaluated by penalizing the delay in making that decision. In fact, temporal evaluation metrics used in ERD like ERDE and latency combine both aspects in diferent ways. In that context, unlike our previous participations at eRisk Labs, we focus this year on the second aspect in ERD tasks, that is to say, the early alert policies that decide if a user classified as risky should efectively be reported as such. In this paper, we describe three diferent early alert policies that our research group from the Universidad Nacional de San Luis (UNSL) used at the CLEF eRisk 2021 Lab. Those policies were evaluated on the two ERD tasks proposed for this year: early risk detection of pathological gambling and early risk detection of self-harm. The first approach uses standard classification models to identify risky users and a simple (manual) rule-based early alert policy. The second approach is a deep learning model trained end-to-end that simultaneously learns to identify risky users and the early alert policy through a Reinforcement Learning approach. Finally, the last approach consists of a simple and interpretable model that identifies risky users, integrated with a global early alert policy. That policy, based on the (global) estimated risk level for all processed users, decides which users should be reported as risky. Regarding the achieved results, our models obtained the best performance in terms of decisionbased performance metrics (1, ERDE50, latency) as well as in terms of the ranking-based performance measures, for both tasks. Furthermore, in terms of the latency measure, the performance obtained in the ifrst task was twice as good as the second-best team.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Early Risk Detection</kwd>
        <kwd>Early Classification</kwd>
        <kwd>End-to-end Early Classification</kwd>
        <kwd>SS3</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The early risk prediction on the Internet (eRisk) lab is concerned with the exploration of new
models for early risk detection and evaluation methodologies with a direct impact on social
and health-related issues [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The lab started in 2017 tackling the problem of early detection of
depression in users from an online forum [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In 2018, the early detection of signs of anorexia
was added as a new challenge for the lab, alongside an expanded version of the previous year’s
task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The test data were organized into 10 chunks and provided to each team chunk by
chunk. The participant’s models were evaluated using the ERDE evaluation metric introduced
by Losada et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to consider both the correctness of the classification and the delay taken
by the system to make the decision. In 2019, the early detection of depression track was
removed and two new challenges were presented: the early detection of signs of self-harm
and measuring the severity of the signs of depression [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For that edition of the lab, new
performance measures were considered. First, the performance measure latency proposed by
Sadeque et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was incorporated as a complementary measure to ERDE. On the other hand,
ranking-based evaluation metrics were added to help professionals in real life make decisions.
That year also marked the end of the chunk-based processing of the data. From that year on, a
post-by-post approach was used for the challenges, resembling a real-life scenario where users
write posts one at a time. In 2020, the early detection of signs of anorexia task was taken out
but the others tasks were kept [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Finally, in 2021, the early detection of signs of pathological
gambling task was introduced. Below is a brief description of the two tasks our research lab
participated in.
      </p>
      <sec id="sec-1-1">
        <title>Task 1: Early Detection of Signs of Pathological Gambling. For this task, the goal was</title>
        <p>to detect, as soon as possible, the users that were compulsive gamblers or that had early traces
of pathological gambling. The task’s data consisted of a series of user’s writings from social
media collected in chronological order. No training data were provided, thus each team had to
build a corpus to train their models.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Task 2: Early Detection of Signs of Self-Harm. For this task, the goal was the same as</title>
        <p>with the 2019 and 2020 eRisk editions, that is, sequentially process pieces of evidence and detect
early traces of self-harm as soon as possible. This year, training data was the combination of
the 2020 edition training and testing data.</p>
        <p>The performance of both tasks was assessed using standard classification measures (precision,
recall, and 1 score), measures that penalize delay in the response (ERDE and latency), and
ranking-based evaluation metrics. The 1 and latency scores were computed with respect to the
positive class. To calculate these measures, for every post of every user, participating models
were required to provide a decision, which signaled if the user was at-risk (indicated with a one)
or not (indicated with a zero), and a score, that represented the user’s level of risk (estimated
from the evidence seen so far). Note that if a user was classified as being at risk, posterior
decisions were not considered.</p>
        <p>The present work describes the diferent approaches used by our research group to address
the tasks 1 and 2 mentioned above. Furthermore, it compares the models’ behavior for the
tasks and evaluates their performance. More precisely, the remainder of this paper is organized
as follows. Section 2 provides a general introduction and overview of the corpus generation
procedure, the data pre-processing steps used for classification, and the diferent models applied
for the early risk detection tasks. Sections 3 and 4 describe the corpus, the parameters of the
models, and their results for Task 1 and Task 2, respectively. Section 5 analyzes the results
obtained in both tasks. Finally, Section 6 presents conclusions and discusses possible future
work directions.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Approaches</title>
      <p>Early risk detection (ERD) can be conceptualized as a multi-objective problem in which the
challenge is to find an adequate trade-of between two diferent and related aspects. On the
one hand, the accuracy in identifying risky users. On the other hand, the minimum time that a
risky user detection requires to be reliable. The first aspect is usually addressed as a typical
classification problem with two classes: risky and non-risky. That task is evaluated with standard
classification metrics like precision, recall, and 1. The second one involves a policy to decide
when the information from a user classified as risky is enough to raise an alarm/alert. That is,
the decision-making policy returns yes (or true) to alert/confirm that the user is efectively at
risk or no (or false) otherwise. When this policy is evaluated, it is usually penalized according
to the delay incurred in raising an alert/alarm of a risky user.</p>
      <p>
        The aspects described above were explicitly modelled in an article presented by Loyola et
al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] where an early classification framework was introduced. The focus of early classification
is on the development of predictive models that determine the category of a document as soon
as possible. This framework divides the task into two separated problems: classification with
partial information and deciding the moment of classification. The task of classification with
partial information (CPI) consists in obtaining an efective model that predicts the class of a
document only using the available information read up to a certain point in time. On the other
hand, the task of deciding the moment of classification (DMC) involves determining the point
at which the reading of the document can be stopped with some certainties that the prediction
made is correct. Trying to decide when to stop reading a document only using the class the CPI
model returns is dificult. For this reason, the data that the DMC model gets is augmented with
contextual information, that is, data from the body of the document that could be helpful for
deciding the moment of classification.
      </p>
      <p>The interesting point about that early classification framework is that it can be used as a
reference in ERD tasks. This is feasible by simply using the CPI component to identify risky
users and replacing the early-stop reading policy implemented by DMC with an equivalent
early alert policy for ERD. Thus, from now on we may refer to the component in charge of
identifying risk users as CPI and the one in charge of implementing the early alert policy as
DMC.</p>
      <p>
        An issue not considered in Loyola and collaborators’ work [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and observed during the eRisk
challenge, is that multiple documents (users) were processed in parallel. Thus, the context
information could also consider information from other documents being processed at the same
time. For that reason, the original early classification framework was minimally modified to
take this situation into account resulting in the framework shown in Figure 1.
      </p>
      <p>Summarizing, we will address ERD as a special case of early classification where we are
only concerned with predicting as soon as possible a subset of the categories. Only the classes
representing a risk for the people are considered. If the current partial input is classified as
non-risky class, the model keeps accumulating information in case that, in the future, the user
starts showing patterns of risk. Note that, in ERD, is essential to retrieve as many of the users at
risk as it is possible since their lives could be in danger. Thus, it is important to develop models
that have a high recall for risk classes.</p>
      <p>In order to adapt the early classification framework to the ERD problem, an alarm was raised
to indicate a user at-risk only when the class predicted by the CPI was positive or at-risk and the
DMC decided we should stop reading the input. Raising an alarm involves sending a decision
equal to one to the challenge. In any other case, the model sent a decision equal to zero. Recall
that, for the eRisk tasks, it was necessary to keep processing the input to score the level of risk
of every user, even if it was already flagged, thus the model should not stop reading until the
input ends.</p>
      <p>
        In this study, three kinds of early risk detection approaches were analyzed with diferent CPI
and DMC components. However, and beyond the key role that the CPI component plays in
identifying risky users, our focus in this participation is on the early alert policy implemented
by the DMC component. In that context, three decision-making policies were considered. First,
a simple decision tree with information from a regular text classifier; second, a deep learning
model trained end-to-end using Reinforcement Learning; lastly, a global criterion based on
information of the whole ranking of users given by the Sequential S3 (Smoothness, Significance,
and Sanction) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], SS3 from now on, model. Except for the SS3 models, a data pre-processing
stage was applied to all the other models to ease the learning procedure. The details of this
pre-processing will be given in Section 2.2.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Corpus Generation Procedure</title>
        <p>Since the Task T1 had no training data and to improve the performance of the models trained for
the Task T2, a couple of datasets for each task were generated. The data from each corpus was
obtained from Reddit through its API. Note that, most of the content of Reddit can be retrieved
as a JSON file if the string “.json” is appended to the original URL —for instance, the current top
posts and their content can be fetched with https://www.reddit.com/top.json. The structure and
meaning of each part of the JSON file can be found in the Reddit API documentation. 1 Thus, to
build each corpus, diferent pages of Reddit were consulted.</p>
        <p>The main goal of the corpus generation procedure was to get two disjoint sets of users,
one with the users at-risk and the other with random users. All available submissions and
comments from both groups were extracted. The most popular subreddit related to each task
was consulted to get the at-risk users, which from now on will be referred to as “main subreddit”.
For the detection of pathological gambling the subreddit “problemgambling”2 was used; while
for the detection of self-harm, the subreddit “selfharm”. 3 On the other hand, to get the random
users, the subreddits “sports”, “jokes”, “gaming”, “politics”, “news”, and “LifeProTips” were used.
Henceforth, these subreddits are going to be referred to as “random subreddits”.</p>
        <p>In order to collect the at-risk users, first the last 1000 submissions to the main subreddit
were evaluated. Every user that posted or commented in those submissions was considered as
a user at-risk —and accordingly added to the set of at-risk users. All the posts and comments
were saved for later cleaning. Then, similarly, the all-time top 1000 submissions to the main
subreddit were fetched to obtain more users at risk and their posts and comments. Finally, for
the users at risk, all their available posts were retrieved. Each of the submissions and comments
from users at risk together with the comments of other users, even if they were published in
another subreddit, were saved. If a post belonged to the main subreddit, all the users that had
commented were added to the set of users that were at-risk.</p>
        <p>To gather the random users, initially, the last 100 submissions to each random subreddit
were evaluated. For every one of the authors of the submissions, all their available posts
were retrieved. Both the posts and all their comments were saved. Note that the number of
submissions retrieved in this case was much lower than with the main subreddit, this is due to
the random subreddits being more popular and having more comments per post.</p>
        <p>While retrieving posts and comments, not all of them were saved. The posts and comments
belonging to bots, moderators of subreddits, or deleted accounts were mostly not considered.
Since there is not enough information in the JSON to know if a user is a bot, moderator, or
person, regular expressions were used to identify most of them based on their user name or the
content of the post or comment. After a manual examination of the posts, it was determined
that if the user name matched one of “LoansBot”, “GoodBot_BadBot”, or “B0tRank”, the account
was flagged as a bot. Note that this set of user names depends on the time and the subreddits
consulted. Additionally, when the content of the post or comment contained the text “this is
a bot” or any variation with the same meaning, the particular submission was automatically
lfagged as a bot and not considered. The drawback of this step is that it is possible to flag real
1https://www.reddit.com/dev/api/
2https://www.reddit.com/r/problemgambling
3https://www.reddit.com/r/selfharm
users submissions as belonging to bots just by having him/her writing those words in a post or
comment. Nevertheless, the instances where this happened were very few, afecting less than 5
of the users posts. With respect to the moderators of subreddits, only the automatic moderator,
whose account name was “AutoModerator”, was filtered. Later, the posts and comments from
accounts that had been deleted, at the time of retrieving, were also filtered —once deleted,
those accounts were named “[deleted]”. Additionally, comments or posts that matched the
text “[deleted]”, “[removed]”, or “removed by moderator” were ignored. Finally, if the post or
comment belonged to the subreddit “copypasta” it was not considered, since all its submissions
have no meaning and contain a large number of words, skewing the whole corpus.</p>
        <p>On the other hand, it was observed that posts and comments contained references to other
users. In particular, there were some references to users at risk. Given these, the models could
learn to classify a user using the references to other users. Since this was not desirable, and to
ensure anonymity, the references to other users were replaced with a token.</p>
        <p>Once all the posts with their comments were collected, they were grouped by user. All the
users with less or equal than 30 writings (posts or comments) or with an average of words per
writings lower than 15 were discarded. Later, any user that had a writing in the main subreddit
was flagged as at-risk, while the rest as random users. Finally, all the writings for each user
were ordered by their publication time.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data Pre-processing</title>
        <p>Every user’s post provided by the challenge was part of a raw JSON file that held its content and
some metadata information. For this work, only the title and post’s content were considered
when processing the input. Due to the nature of social networks and Internet forums, the input
data was highly heterogeneous. Users often use diferent languages, weblinks, emoticons, and
format strings (newlines, tabs, and blanks). This noise could cause the representation space for
the input to grow bigger, which could ultimately afect the performance of the models. Also,
some HTML and Unicode characters were not correctly saved and were replaced by a numeric
value that represented them. Therefore, the input was pre-processed as follows:
1. Convert text to lower case.
2. Replace the decimal code for Unicode characters with its corresponding character. For
example, instead of having “it’s not much [...]” the input has “it #8217;s not much [...]”,
where the number 8217 is the decimal code for the right single quotation mark (’). Note
that every code is surrounded by a hashtag symbol and a semicolon, and has an empty
space that should be removed.
3. Replace HTML codes with their symbols. For example, instead of having “[...] red for ir &amp;
green for Thermal [...]” the input has “[...] red for ir amp; green for Thermal [...]”, where
amp; is the HTML character entity code for the symbol &amp;. Note that every code is also
preceded by an extra white space that should be deleted. The only HTML symbols that
needed to be converted were: &amp;, &lt; and &gt;.
4. Replace links to the web with the token weblink.
5. Replace internal links to subreddits with the name of the subreddits. For example, the
text “[...] x-post from /r/funny” gets processed to “[...] x-post from funny”.
6. Delete any character that is not a number or letter. Note that if the Unicode and HTML
codes were not replaced beforehand, these will appear later as numbers or words.
7. Replace numbers with the token number.
8. Delete new lines, tab, and multiple consecutive white spaces.</p>
        <p>These steps were rigorously checked to ensure that no relevant information from the input
was lost.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. ERD Frameworks</title>
        <p>As it was explained before, our approaches to the ERD problem require a description of the
ERD framework that explicitly identifies how the CPI component (the risky-user classification
model) is implemented, and how the DMC component makes its decisions (the early alert
policy). In fact, it is interesting to note that the DMC component also constitutes a model that
could be learned as is usual with the CPI component. Thus, in the following subsections, the
ERD frameworks used by our group are presented in a comprehensive way, describing both
components (models) of the ERD framework.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.3.1. Text Classifiers with a Simple Rule-based Early Alert Policy</title>
        <p>For this approach, diferent kinds of text representations and text classifiers were trained to
solve the CPI task at hand. The performance was evaluated using the 1 score for the positive
class, and the best models were chosen. Finally, to tackle the DMC task, that is, to decide
when to send an alarm for a user at risk, a simple policy was proposed that checks the current
user context information. This policy can take diferent parameters that allow it to control the
earliness of the decision. The optimal policy parameters were selected considering the latency
score.</p>
        <p>
          Among the document representations that were evaluated are bag of words (BoW) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ],
linguistic inquiry and word count (LIWC) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], doc2vec [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], latent Dirichlet allocation (LDA) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
and latent semantic analysis (LSA) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. These were implemented using the Python packages
scikit-learn [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and gensim [15], except for LIWC which has its own implementation. For every
one of them, a numerous set of parameters were explored.
        </p>
        <p>
          Regarding the used models, decision trees, -nearest neighbors, support vector machines (SVM),
logistic regressions, multi-layer perceptrons, random forests [16], recurrent neural networks with
long short-term memory (LSTM) cells [17], and bidirectional encoder representations from
transformers (BERT) [18] were chosen to classify with partial information. The Python packages
scikit-learn [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], Transformers [19], and PyTorch [20] were used to implement these models.
Similar to what was done with the representations, every model was trained using a large range
of parameters.
        </p>
        <p>Each valid representation and model combination was compared to obtain the best
combinations that solved the CPI task. To determine the performance of each model, the 1 score for
the positive class was employed.</p>
        <p>Once the best combinations were selected, it was necessary to augment these with a
decisionmaking policy able to determine when to raise an alarm for a user at-risk. A decision tree was
proposed to tackle the DMC task. The tree evaluated the current user context information to
make a decision. In particular, the predicted class, the current delay in the classification, and
the predicted positive class probability (class associated with risk) were evaluated to decide
when to send an alarm for a user at risk. If a user was predicted as positive, the probability
of belonging to the positive class was greater than  , and more than  posts were processed,
then an alarm was issued. Thus, the decision tree has two hyper-parameters, a positive class
probability threshold  and a minimum amount of processed posts . The structure of the
decision tree can be seen in Figure 2. To determine the value of the threshold  and the minimum
amount of processed posts , diferent combinations were tested, choosing the ones with the
best performance for latency.</p>
        <p>Finally, the scores and decisions outputted by this model were obtained by reporting the
probability of the positive class and the results of the decision tree, respectively. If the result of
the decision tree for a given input was “Keep reading”, the decision was 0; on the other hand, if
the result was “Issue alarm”, the decision was 1. Henceforth, this kind of model will be referred
to as “EarlyModel”.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.3.2. End-to-end Deep Learning ERD Framework</title>
        <p>This method was an adaptation of the model proposed by Hartvigsen et al. [21] for time series
to early risk detection with text. In their paper, Hartvigsen and collaborators proposed a model
to tackle the problem of early classification of time series called Early and Adaptive Recurrent
Label ESTimator, or short, EARLIEST. The model is composed of a recurrent neural network
that captures the current state of the input, a neural network that tackles the CPI task, called the
discriminator, and a stochastic policy network responsible for the DMC task, called the controller.
During classification, the recurrent model generates step-by-step time series representations,
capturing complex temporal dependencies. The controller interprets these in sequence, learning
to parameterize a distribution from which decisions are sampled at each time step, choosing
whether to stop and predict a label or wait and request more data. Once the controller decides
to halt, the discriminator interprets the sequential representation to classify the time series. By
rewarding the controller based on the success of the discriminator and tuning the penalization
of the controller for late predictions, the controller learns a halting policy that guides the
online halting-point selection. This results in a learned balance between earliness and accuracy
depending on how much the controller is penalized. The size of the penalty is a parameter 
chosen by a decision-maker according to the requirements of the task [21]. It is important to
emphasize that this is an end-to-end learning model optimizing accuracy and earliness at the
same time.</p>
        <p>Since this model was originally proposed for the early classification of time series, it was
necessary to adapt it to the early risk detection problem using text.</p>
        <p>
          First, instead of processing raw input, as it was the case with time series data, the input text
was represented using doc2vec [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] trained in the corpus. This representation allowed the
model to eficiently process the input since the users’ posts were the input unit. Note that if
rather than using doc2vec, the word2vec [22] representation had been used, the model would
have to process every token of the input deciding if it should halt or not. But since the eRisk’s
tasks required a decision for every post, all the decisions taken for every token except for the
last would have been discarded, wasting computation.
        </p>
        <p>Second, a recurrent neural network with Long Short-Term Memory (LSTM) cells was used
as the state of the model since it allowed it to preserve information over longer sequences in
comparison to other recurrent neural networks architectures [17]. In the model proposed by
Hartvigsen and collaborators, the discriminator consisted of adding a fully connected layer after
the recurrent neural network to allow it to make predictions for the input. Since the early risk
detection problems tackled in this work had two classes, that is, user at risk or user not at risk,
the classification task can be seen as a one-class problem with one output (probability of being
at risk), or as a multi-class classification with two classes. In this work, both approaches were
tested. Depending on the approach used was the loss function for the discriminator.</p>
        <p>Ultimately, the model was modified to raise an alarm only if the class predicted by the
discriminator was positive and the controller indicated to stop reading. It should be noted that
the model processed the whole input in order to output the scores needed by the challenge.</p>
        <p>Neither the original implementation of the EARLIEST model by the authors nor our
implementation considered the input as a stream of data. This was a considerable drawback since,
for every new post, the whole history of posts of that user needed to be processed again in the
recurrent neural network to update the hidden layer. Nevertheless, each post representation was
calculated only once. Thus, to improve the time performance of the model, the sequence length
for the input to the recurrent neural network was restricted to 200 posts. If a new post arrived
and the input representation was full, the oldest post was removed from the representation
giving space to the new one. Note that it is possible to implement EARLIEST for stream data,
but time limitations did not allow it.</p>
        <p>To get the final scores and decisions for this model the probability of the positive class given
by the discriminator and the decision made by the controller were used, respectively. In the
end, for the challenge, diferent values for the  parameter were tested to control the earliness
of the model. The parameters that yielded the best latency score were selected.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.3.3. The SS3 Text Classifier with a User-Global Early Alert Policy</title>
        <p>
          The SS3 text classifier [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is a simple and interpretable classification model specially created to
tackle early risk detection scenarios, integrally. First, the model builds a dictionary of words for
each category during the training phase, in which the frequency of each word is stored. Then,
using those word frequencies, and during the classification stage, it calculates a value for each
word using a special function, called (, ), to value words in relation to categories. This
function has three hyper-parameters, ,  , and  , that allow controlling diferent “subjective”
aspects of how words are valued. More precisely, the equations to compute (, ) were
designed to value words in an interpretable manner since, given the sensitive nature of risk
detection problems, transparency and interpretability were two of the key design goals for
this model. To achieve this, the authors first defined what constituted interpretability by
considering how people could explain to each other the reasoning processes behind a typical
text classification tasks, 4 and then this  function was designed to value words by trying to
mimic that behavior –i.e. having  to value words “the way people would naturally do it”. For
instance, suppose that the target classes are food, health, and sports, then, after training, SS3
would learn to assign values like:
        </p>
        <p>(“sushi”, food) = 0.85;
(“sushi”, health) = 0.70;
(“sushi”, sports) = 0.02;</p>
        <p>(“the”, food) = 0;
(“the”, health) = 0;
(“the”, sports) = 0;</p>
        <p>(“all”, food) = 0;
(“all”, health) = 0;
(“all”, sports) = 0;
The classification process is carried out by combining, sequentially, the (, ) values of
all words as they are processed from the input stream. The authors originally proposed a
hierarchical process to perform this through diferent operators, called “summary operators”,
that combine and reduce these  values at diferent levels, such as words, sentences, or
paragraphs.</p>
        <p>
          In the present work, and driven by previously published competitive results [
          <xref ref-type="bibr" rid="ref8">8, 23, 24</xref>
          ], it
was decided to simply use a summation of all seen  values to perform the classification of
users. More precisely, given the positive (i.e. “at-risk”) and negative classes of the addressed ERD
tasks, a  value was calculated for each user , where WH denotes ’s writing history,
as follows:
score =
        </p>
        <p>∑︁ (, ) − (, ).</p>
        <p>∈WH</p>
        <p>Finally, for each user , a classification decision is made simply by using its score since it
represents the overall estimated risk level of the user, given by the model. For instance, in the
eRisk 2019 challenge, the best ERDE values, as well as the best ranking-based results, were
4For instance, for text classification, people would normally direct their attention only to certain “keywords”
(filtering out all the rest) and explain why these words were important in their reasoning process.
(1)
obtained for the two early risk detection tasks, using a simple policy that classified each user
as soon as its score became positive, i.e. when the model’s positive confidence surpassed the
negative one [23]. However, in the present work, we opted to use a user-global early alert
policy. That is, the policy used to raise an alarm for a particular user takes into account its
score value, globally, in regard to the current score of all the other users. More precisely, let
scores = {score| ∈ Users} be the set of all current scores, a decision was made for each user
, where MAD stands for Median Absolute Deviation, as follows:
decision =
{︃1, if score &gt; median(scores) +  · MAD(scores)
0, otherwise.
(2)</p>
        <p>Thus, a user  was classified as “at-risk” as soon as its decision became 1. This policy is
based on three metrics: the median, which is a robust measure of central tendency; the MAD, a
robust measure of statistical dispersion; and the score, which represents the estimated risk level.
Hence, the interval median(scores) ±  · MAD(scores) represents a “region of doubt” containing
all users for which the model is not fully sure whether they are at risk or not —i.e. whether
the estimated risk level is “high enough or low enough”. We designed this policy driven by
the goal of optimizing the performance of the model in terms of the  measure. Note that
 ∈ R is a hyper-parameter that allows controlling how far from the median the user’s current
score must move before being considered at-risk. Thus, the greater the  , the lower the recall,
and the higher the precision our model should have, since only those users whose score is high
enough will be considered. Conversely, the lower the  , the higher the recall and the lower
the precision. Therefore, this policy should allow maximizing the performance of the model in
terms of the  measure, since, at least a priori, there always exists an intermediate  value that
allows obtaining an optimal balance between recall and precision.</p>
        <p>
          We used this policy for the first time in the self-harm detection task of the eRisk 2020,
obtaining competitive results in terms of the  -related measures. For instance, we obtained the
second-best latency (0.609) training the SS3 model only with the small training set provided
by the eRisk organizers [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The best value (0.658) was obtained by the iLab team using a
BERT-based model that was trained with a large dataset created manually by that research
team [25]. We later downloaded that dataset and trained the SS3 model again, greatly improving
and outperforming the previously obtained results in this task —for instance, obtaining an
latency value of 0.711. To achieve this, the same procedure described by the iLab team in their
eRisk paper was carried out [25]. That is, the training set provided for the task was used as a
validation set to perform hyperparameter optimization, from which  = 0.32,  = 0.45, and
 = 0 were selected as the best hyperparameter configuration. Therefore, as will be described in
more detail in Section 4, in the present eRisk edition, we participated in the self-harm detection
task again, this time using this SS3 model trained with the iLab dataset —which achieved the
best results in terms of the  -related measures.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task T1: Early Detection of Signs of Pathological Gambling</title>
      <p>In this section, the details of our participation addressing the eRisk’s early detection of
pathological gambling task are given. Namely, the details of the datasets and the five models submitted
to this challenge are introduced. Finally, the results obtained after the evaluation stage are
shown.</p>
      <sec id="sec-3-1">
        <title>3.1. Datasets</title>
        <p>As already stated, for this task, it was necessary to build a corpus in order to train models
for early detection of signs of pathological gambling. The steps described in Section 2.1 were
followed to build a corpus using data from Reddit. The final corpus was split into a training and
a validation set, each containing half of the users. Table 1 shows the details of each generated
corpus compared to the test dataset provided for this task. In this table, “T1_test” refers to
the test set used to evaluate all participating models, while “T1_train” and “T1_valid” refer to
the generated corpora using Reddit. Note that the corpus provided during the challenge was
much bigger according to the number of users, number of posts, and number of posts per user,
compared to the generated ones. On the other hand, T1_test had a lower number of words per
post compared to T1_train and T1_valid. For T1_test there were posts with no words in them,
that is, empty posts. This could be caused by a user that edited her/his submission after posting,
deleting its content.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Models</title>
        <p>This section describes the details of the models used by our team to tackle this task. Namely,
from the results obtained after the model selection and the hyperparameter optimization stage,
described in Section 2.3, the following five models were selected for participating:
UNSL#0. An EarlyModel with a bag of words (BoW) representation and a logistic
regression classifier. Words unigrams were used for the BoW representation with term
frequency times inverse document-frequency (commonly known as tf-idf ) as the
weighting scheme. For the logistic regression, a balanced weighting for the classes was used,
that is, each input was weighted inversely proportional to its class frequencies in the
input data. Finally, for the decision-making policy, a threshold  = 0.7 and a minimum
number of posts  = 10 were used.
UNSL#1. An EarlyModel with a doc2vec representation and a logistic regression classifier.
Each submission was represented as a vector of dimension 100. The representation was
learned using the training corpus generated, T1_train. For the logistic regression, both
classes were weighted the same. Finally, for the decision-making policy, a threshold
 = 0.85 and a minimum number of posts  = 3 were used.</p>
        <p>UNSL#2. An EarlyModel with a BoW representation and an SVM classifier. For the
BoW representation, character 4-grams were used with tf-idf as the weighting scheme.
The support vector machine was parameterized with a radial basis function kernel with
 = 0.125, regularization parameter  = 512 weighted inversely proportional to its
class frequencies in the input data. Finally, for the decision-making policy, a threshold
 = 0.75 and a minimum number of posts  = 10 were used.</p>
        <p>UNSL#3. An EARLIEST model with a doc2vec representation for user posts. The base
recurrent neural network chosen was a one-layer LSTM with an input feature dimension
of 200 and 256 hidden units. The discriminator of the EARLIEST model reduced the
hidden state of the LSTM to one dimension representing the positive class probability.
Finally, the value of  used to train was  = 0.000001.</p>
        <p>UNSL#4. The same model as UNSL#3 but with the discriminator reducing the hidden
state of the LSTM to two dimensions representing the probabilities of both, the positive
and negative classes. Besides, the value of  used to train was  = 0.00001.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results</title>
        <p>Early classification decision-based performance: Table 2 shows the results obtained
for the decision-based performance metrics. As it can be observed, our team achieved the
best and second-best performance in terms of the 1, ERDE50, and latency measures with
two EarlyModels (UNSL#0 and #2). Moreover, in terms of the latency, the value obtained
with UNSL#2 (0.693) was roughly twice as good as EFE#2’s (0.342) —the model with the
best value among the other teams’ models. However, regarding the ERDE5 measure, the
obtained results were close to the average. In the case of the two EARLIEST models, this
was due to a poor classification performance, whereas in the case of the EarlyModels,
due to having to read at least 3 posts before being able to make a decision —i.e. the
selected values for  were  = 3 and  = 10. Among our three EarlyModels (UNSL#0,
#1, and #2), the model that performed the worst was UNSL#1, a logistic regression with a
doc2vec representation, which had a performance approximately equal to the average.
Interestingly, UNSL#0 performed roughly twice as well as UNSL#1, despite being the same
classifier and using a simpler representation, namely, a standard BoW representation. In
fact, this model was only outperformed by UNSL#2, an SVM using a character 4-grams
BoW representation, which, as mentioned above, obtained the best values —among all
26 participating models. Regarding the two EARLIEST models (UNSL#3 and #4), the
obtained performance was below the average, and therefore, they performed the worst
among our five models. This was mostly due to the EARLIEST models blindly classifying
the vast majority of the users as at-risk, leading to exceptionally low precision values.
Finally, mean and median values indicate that, overall, this task was hard to deal with.
In particular, the low precision and high recall of all participating models suggest that
models had trouble accurately detecting true-positive cases since the vast majority of the
detected users were false-positive cases.</p>
        <p>Performance in terms of execution time: Table 3 shows for each team, details on the
total time taken to complete the task. As it can be seen, the time taken to complete the
task difers from team to team, varying from a few to a large number of hours. However,
to have a more precise view of how eficient the models of each team were, not only the
total time taken to complete the task must be considered, but also the total number of
posts processed in that time and the number of models used to carry it out. For example,
(a) Task T1
(b) Task T2
in terms of processing speed, CeDRI does not seem as eficient as UNSL, since although
the former completed the task in roughly 1 day, it only processed the first 271 posts from
each user using only 2 models, while the latter, although completing the task in roughly 5
days, processed all 2000 posts from each user using 5 models.5 For this reason, this table
also includes, as a guide, an estimate of the time taken by each team’s model to process
each post, which was obtained by normalizing the total time relative to the number of
models used and the total number of posts processed. It can be observed that our team
did not achieve the best performance in terms of execution time, processing each post
in 43 seconds, whereas the fastest team (UPV-Symanto) did it in 16 seconds. To have a
better insight of a possible cause for having taken 5 days to complete the task, as shown
in Figure 3a, information stored in our logs was used to disaggregate this total time into
ifve diferent stages: pre-processing, input features computation, classifier prediction,
server timeouts, and network delay. It can be seen that, as will be discussed in more detail
in Section 5, the two stages taking most of the time are the computation of the feature
vector and network time —roughly 39% of the total time is spent computing the feature
vector, and 57% in network communication delays. Therefore, as will be discussed in
more detail in Section 6, the optimization of the feature vector stage will be taken into
account for future work.</p>
        <p>Ranking-based performance: Table 4 shows the results obtained for the ranking-based
performance metrics. In addition, plots of the four complete rankings created, respectively,
by each model after processing 1, 100, 500, and 1000 posts, are shown in Figure 4. As
can be seen, our team achieved the best performance in terms of the three metrics
( @10,  @10, and  @100) along the four rankings used for the evaluation.
Moreover, the values obtained with two of the EarlyModels (UNSL#0 and #2) were the best
possible ones (i.e. 1) for the three metrics and the four rankings —except for  @100
5Note that, for each of the users 2000 posts, not only was it necessary to send a request to the server to
obtain the post, but also 5 more requests to send the response of each model. Therefore, UNSL needed a total
of 2000 + 2000 * 5 = 12000 requests to the server to complete the task.
in the ranking obtained after reading only 1 post. As with the decision-based results,
the logistic regression with the doc2vec representation (UNSL#1) obtained the lowest
values among the three EarlyModels (UNSL#0, #1, and #2). Regarding the two EARLIEST
models (UNSL#3 and #4), their performance was also the lowest among our five models.
However, unlike the decision-based results, UNSL#3 performed considerably better than
UNSL#4. We will leave for future work to study why the explicit inclusion of the negative
class probability in the discriminator impaired UNSL#4’s ability to estimate users’ risk.
Finally, obtained results show that the two EarlyModel using standard tf-idf -weighted
BoW representations, despite their relative simplicity, were capable of estimating the
risk level of the users with considerable eficiency, even when only a few posts were
processed.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Task T2: Early Detection of Signs of Self-Harm</title>
      <p>In this section, the details of our participation addressing the eRisk’s early detection of self-harm
task are given. Namely, the details of the datasets and the five models submitted to this challenge
are introduced. finally, the results obtained after the evaluation stage are shown.</p>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>For this task, and unlike Task T1, the eRisk’s organizers did provide the datasets to train and
validate the participating models. Each dataset was made available as a set of XML files, one
for each user. However, to improve the performance of our models, the steps described in
Section 2.1 were followed to build a complementary corpus using data from Reddit. Then, the
corpus was split into a training and a validation set, each containing half of the users. Finally,
(a) Rankings after 1 post</p>
        <p>(b) Rankings after 100 posts
(c) Rankings after 500 posts
(d) Rankings after 1000 posts
these complementary datasets were combined with the ones provided for this challenge. These
extended training and validation sets were then used to train and tune the EarlyModels and
the EARLIEST models. On the other hand, as explained at the end of Section 2.3.3, one of
the corpora created by the iLab research team [25] for the eRisk 2020’s edition of this task
was used to train the SS3 models —namely, the dataset called “users-submissions-200k”.6 The
datasets created by the iLab were also created collecting data from Reddit, but obtained from
the Pushshift Reddit Dataset [28] through its public API.7</p>
        <p>Table 5 shows the details of each complementary corpus along with the training, validation,
and test datasets provided for this task. In this table, “T2_test” refers to the test set used to
evaluate all participating models, “T2_train” and “T2_valid” to the training and validation sets
provided by the organizers, “redd_train” and “redd_valid” to the training and validation sets</p>
        <sec id="sec-4-1-1">
          <title>6The iLab’s datasets can be downloaded from https://github.com/brunneis/ilab-erisk-2020. 7https://pushshift.io/api-parameters/</title>
          <p>built using Reddit, “comb_train” and “comb_valid” to the combined datasets, and “ilab_train”
to the iLab’s corpus. Note that the corpus used to evaluate the participating models had four
times more users and posts than the corpus provided for training, but with a similar number of
posts per user and words per post. On the other hand, comb_train and comb_valid had almost
the same number of users as T2_test but had a much lower number of total posts and posts per
user. Also, ilab_train had a considerably greater number of users than the rest of the datasets,
but with a fewer number of posts per user. In the same way as with T1_test, in T2_test there
were posts with no words in them, i.e. empty posts. This could be caused by a user that edited
her/his submission after posting, deleting its content.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Models</title>
        <p>This section describes the details of the models used by our team to tackle this task. Namely,
from the results obtained after the model selection and the hyperparameter optimization stage,
described in Section 2.3, the following five models were selected for participating:
UNSL#0. An EarlyModel with a doc2vec representation and a multi-layer perceptron
optimized using Adam. Each post was represented as a 200-dimensional vector. To learn
this representation the combined training corpus, comb_train, was used. The multi-layer
perceptron consisted of one hidden layer with 100 units and ReLu as the activation
function. Finally, for the early detection policy, a threshold  = 0.7 and a minimum
number of posts  = 10 were used.</p>
        <p>UNSL#1. An EARLIEST model with a doc2vec representation. Each post was represented
as a 200-dimensional vector. To learn this representation the combined training corpus,
comb_train, was used. The base recurrent neural network chosen was a one-layer LSTM
with an input feature dimension of 200 and 256 hidden units. The discriminator of the
EARLIEST model reduced the hidden state of the LSTM to one dimension representing
the positive class probability. Finally, the value of  used to train was  = 0.000001.
UNSL#2. The same model as UNSL#1 but with the discriminator reducing the hidden
state of the LSTM to two dimensions representing the probabilities of both, the positive
and negative classes. Besides, the value of  used to train was  = 0.00001.
UNSL#3. An SS3 model8 with a policy value of  = 2 trained using the iLab corpus,
ilab_train. As mentioned in Section 2.3.3, to select the  values, the eRisk 2020’s training
set for this task was used as the validation set; the value  = 2 achieved an optimal
balance between recall and precision, maximizing the  value.</p>
        <p>UNSL#4. The same model as UNSL#3 but with a policy value of  = 2.5. Given that this
 value is greater than the previous one, this model was meant to have a higher precision
than UNSL#3 since the user’s current score must be 2.5 MADs greater than the median
score to be considered at-risk.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>The main results obtained with our five models are described below, grouped according to the
type of metric used to measure performance:</p>
        <p>Early classification decision-based performance: Table 6 shows the results obtained
for the decision-based performance metrics. As it can be observed, our team achieved the
best performance in terms of the 1 and latency measures and the second-best ERDE5
with one of the SS3 models (UNSL#4). Moreover, we also obtained the best performance
in terms of the ERDE50 measure with the model EarlyModel (UNSL#0). Regarding our
8SS3 models were coded in Python using the “PySS3” package [29] (https://github.com/sergioburdisso/pyss3).
ifve models, as in Task T1, here the EarlyModel performed better than the two EARLIEST
models (UNSL#1 and #2), which again obtained a performance roughly below the average,
classifying the vast majority of the users as at-risk and thus obtaining exceptionally low
precision values. In addition, the two SS3 models (UNSL#3 and #4) achieved a better
balance between recall and precision than the EarlyModel and two EARLIEST models,
as evidenced by better  values. As expected, UNSL#4 achieved a higher precision than
UNSL#3 by using  = 2.5 but, unlike obtained results with the validation set, the former
achieved a better balance than the latter. This suggests that models had a harder time
distinguishing between true and false positive cases in the test set used to evaluate them
when compared to the validation set —i.e. compared with the test set used in the last year
edition of this task. Note that this aspect is also suggested by the average low precision
and high recall values (mean and median). Therefore, a model with a greater  would
have probably obtained better  values since it would have given more importance to
precision over recall, i.e. it would have been more cautious when detecting true positive
cases. Finally, overall results suggest this task was hard to tackle since, as mentioned
above, all participating models had trouble distinguishing between true and false positive
cases.</p>
        <p>Performance in terms of execution time: Table 7 shows details on the total time taken
to complete the task for each team. As can be seen, our team, although not being the
fastest, it was among the few teams that processed each post in a few seconds, processing
each post in 32 seconds. As shown in Figure 3b, for this task we also used the information
stored in the execution logs to disaggregate the total time into five diferent stages. Such
as in Task 1, the two stages taking most of the time were again the computation of feature
vectors and network time —roughly 36% of the total time is spent computing the feature
vector, and 55% in network communication delays.</p>
        <p>Ranking-based performance: Table 8 shows the results obtained for the ranking-based
performance metrics. In addition, plots of the four complete rankings created, respectively,
by each model after processing 1, 100, 500, and 1000 posts, are shown in Figure 5. As
can be seen, obtained results in this task were not as competitive as those obtained in
the first task. Nevertheless, some of our models achieved the best performance in terms
of the  @100,  @10, and  @10. For instance, the EarlyModel (UNSL#0)
obtained the best  @100 in the four rankings whereas the SS3 models (UNSL#3
and #4) obtained some of the best  @10 and  @10 values. Regarding the two
EARLIEST models, the variant that explicitly incorporates the probability of the negative
class in the discriminator, UNSL#2, performed poorly, as in Task T1. However, the other
EARLIEST variant, UNSL#1, performed slightly better than the EarlyModel (UNSL#0) in
terms of most of the  @10 and  @10 metrics —even obtained the best values in
the last ranking. Nevertheless, as shown in Figure 5, the EARLIEST models were the least
efective considering the entire user ranking and not just the top-10 and top-100 users
used to calculate the reported metrics. Note that, in the plots for UNSL#1 and UNSL#2,
the users at risk (dark blue lines) are scattered throughout the entire ranking, consistently,
across all four rankings. Instead, the other three models tend to accumulate those users
on the right end, i.e. tend to accurately move users at risk towards the highest positions
in the ranking. Among our five models, the EarlyModel (UNSL#0) performed the best
in terms of  @100 whereas SS3 in terms of  @10 and  @10 (UNSL#3 and
#4). However, as shown in Figure 5, the rankings generated by the SS3 models (UNSL#3
and #4) seem to slightly lose their quality as more posts are processed, as can be seen
in the transition from 100 posts to 500 posts. Note that, in the plots for UNSL#3 and
(a) 1 post
(c) 500 posts
(d) 1000 posts</p>
        <p>UNSL#4, the users at risk (dark blue lines) are slightly “more compressed” towards the
right end in subfigure (b) than in subfigures (c) and (d). This phenomenon is probably
due to the fact that the score calculated by SS3 is not a normalized value (see Equation
1), being sensitive to the number of words processed for each user. As future work, we
believe that normalizing this score could help improve the overall performance of the
model, for instance, by dividing it by the total number of words being processed for each
user. Finally, obtained results show that the EarlyModel and the SS3 model could both be
competitive when it comes to estimating the risk level of the users, even when only few
posts were processed.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>In this section, a comparison of the three approaches will be analyzed in terms of a range of
diferent aspects, such as performance, simplicity, and adaptability. More precisely, Table 9
shows an overview of the comparison containing all the key aspects.</p>
      <p>(a) EarlyModel
(b) EARLIEST
(c) SS3</p>
      <p>Execution Time Performance. Among the three approaches, the EarlyModel is the
least eficient in terms of execution time since it heavily depends on the used
representation and the method to compute its feature vectors. Thus, if the representation being used
allows computing and updating the feature vector incrementally, as posts are processed,
then, the input stream will be processed eficiently. Otherwise, for every new post
available, the whole history of posts will be processed again in order to compute the updated
feature vector. In the case of EARLIEST, the representation is modelled sequentially, only
needing to calculate it for each of the individual posts as they are available. On the other
hand, SS3 does not even need a feature representation since it processes the raw input,
sequentially, word by word. These diferences afected the time taken for each approach
to address each task, for instance, Figure 6 shows the time taken for each approach to
build the feature vectors in Task T2. Although only one model is shown for each approach,
all the other variations of the same model showed the same pattern of elapsed times.
Note that, since SS3 does not need a feature representation, its elapsed time presented in
Figure 6c is always zero. By contrast, EarlyModel consumes a lot of time since it has to
rebuild the doc2vec representation of the whole sequence of posts again, each time a new
post arrives. That is, the time complexity of the feature representation stage is tied to the
length and number of posts. Thus, as new posts come, the amount of time invested in
the feature representation stage grows bigger. This would imply that the graph of the
elapsed time for the EarlyModel increases monotonically. However, this is not the case
since as time passed fewer users kept posting, decreasing the total number of posts to
be processed. Finally, EARLIEST, shown in Figure 6b, required a much lower time to
build the feature vector compared to EarlyModel since only the current post needed to be
processed —i.e. the time complexity of the feature representation stage is not tied to the
number of posts processed.</p>
      <p>Decision-based Performance. Among the three approaches, EARLIEST was the least
efective in terms of decision-based performance since the obtained results, in both tasks,
were below the average among all participating models. On the other hand, EarlyModel
and SS3 were the most eficient in these terms. For instance, the EarlyModel approach
achieved the best and second-best performance in terms of the 1, ERDE50, and latency
measures in Task T1. Likewise, SS3 achieved the best performance in terms of the 1
and latency measures and the second-best ERDE5 in Task T2. Overall, obtained results
showed that, despite their relative simplicity, EarlyModel and SS3 were able to detect
user at-risk with competitive efectiveness.</p>
      <p>Simplicity. The simplest among the three approaches is SS3 since it only consists of
a summation of word values (see Equation 1). On the other hand, the simplicity of
EarlyModel depends on the classifier and the representation being used. For instance, one
of the best performing EarlyModel was a logistic regression classifier that used a standard
tf-idf -weighted BoW representation which is much simpler than a recurrent neural
network with word2vec representation. On the contrary, the EARLIEST is more complex
since its architecture consists of three neural models, namely, an LSTM, a feedforward
neural network for the controller, and another for the discriminator.</p>
      <p>Interpretability. Among the three approaches, SS3 is the most interpretable since it was
designed to learn to value words in an interpretable manner. For instance, the learned
 values of each word can be directly used to create visual explanations to present to
the system users, as illustrated in Figure 7.9 On the other hand, the interpretability level
of the EarlyModel approach depends on the classifier and the representation being used.
9A live demo is provided at http://tworld.io/ss3 where the interested readers can try out the model. Along with
the classification result, the demo provides an interactive visual explanation as the one illustrated here. [Last access
date: May 2021].</p>
      <p>For instance, using simple linear classifiers with standard BoW representations would
be more interpretable than using neural models with deep representations. Finally, the
EARLIEST is the least interpretable model since, as mentioned above, its architecture
consists of three neural models which are not easily interpretable, and the decision policy
is not directly observable.</p>
      <p>Policy Adaptability. Concerning the approaches presented, EarlyModel is the most
rigid of the three. EarlyModel has a simple rule-based early alert policy that complements
standard classification models to identify risky users. This policy is implemented using a
decision tree with three decision nodes as shown in Figure 2. An alarm is issued for an
input only if the predicted class is positive, the probability of the positive class is greater
than  , and the number of post read is greater than . This approach corresponds to a static
policy since the hyper-parameters for the decision nodes, i.e.  and , are determined in
the training phase. In the testing phase, as new posts arrive, these can not change. The
decision boundary remains constant. On the other hand, the SS3 implements a global early
alert policy based on the estimated risk level for all processed users. An alarm is issued for
a user only if the model score surpasses a global boundary that depends on the score from
all the users at that point in time. The decision boundary is calculated using the median
scores for all users and the Median Absolute Deviation (MAD) of the current scores at the
current time. An alarm is issued for a user if the model score for that user is greater than
the median score for all users, plus  times the MAD of the scores. Note that, although
the decision boundary depends on  , it is not constant in time since it also depends on the
scores of all the users. Finally, the EARLIEST model simultaneously learns to predict risky
users and the early alert policy through a Reinforcement Learning approach. EARLIEST
raises an alarm for a user only if the class predicted by the discriminator is positive and
the controller indicates to stop reading. Thus, the controller is responsible for deciding
the moment to issue an alarm. The only hyper-parameter that needs to be set to control
the decision policy is  that penalizes the model based on its earliness. As the value
of  grows, the loss of the model for late prediction grows bigger, forcing the model
to make decisions early. Here, the decision policy is learned therefore the model can
(c) SS3
adapt to diferent problems or diferent distribution of the data dynamically. This is what
makes this model the most adaptable among the three. The problem being tackled or the
distribution of the data could change, but the architecture does not need to change.
Storage per User. In a real-world scenario, early detection approaches may help to
identify at-risk users through the large-scale passive monitoring of social media. However,
in such large-scale systems, these approaches must be able, not only to eficiently process
user posts as they are posted, but also should be eficient in terms of the information
needed by the model to make predictions. This information could be attached to each
user in the system, for instance, by storing it along with other user-related information
inside the system. For instance, the EARLIEST approach only needs the current post and
the last hidden state produced by the LSTM to make a prediction. Therefore, keeping
stored only the last hidden state produced by the LSTM for a given user would be enough
to make a future prediction when the given user writes a new post, as part of the passive
monitoring. Likewise, in the case of the SS3 approach, storing only the last computed
score for the user would be enough to make a future prediction; when the user creates
a new post, his/her last stored score is retrieved and updated using only the  value
of the words in the new post. On the other hand, in the case of the EarlyModel, the
information needed to be stored depends on the classifier and the representation being
used. That is, if the feature vector of the representation being used can be computed
and updated sequentially, as new posts are created, all the information required to carry
out this update must be stored for each user. Otherwise, the complete sequence of posts
needs to be kept stored.</p>
      <p>Streaming Support. Among the models presented, EarlyModel was the only one not
able to support posts coming from streaming data. This is a drawback implicitly embedded
in the model since each time a new post arrives EarlyModel has to rebuild the entire
representation of the post. The only way to alleviate this is through an intermediate
representation that supports streaming data. For example, for the bag of words
representation using tf-idf, the term frequency of every word for every user could be stored and
updated as new posts come in. Then, the final representation for each user could be built
normalizing the frequencies. On the other hand, the models SS3 and EARLIEST are able
to handle streaming input naturally since they work with sequence data.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Work</title>
      <p>This paper described three diferent early alert policies to tackle the early risk detection
problem. Furthermore, a comparison of the three approaches was analyzed in terms of diferent
characteristics, such as performance, simplicity, and adaptability. As shown in Sections 3 and 4,
the models introduced in this work obtained the best performance for the measures 1, ERDE50,
and latency for both tasks. Also, for most of the ranking-based evaluation metrics, these models
achieved the best results.</p>
      <p>Nevertheless, further research should focus on:
• Reducing the time spent building the features of EarlyModel.
• Defining diferent ways of normalizing the scores for SS3.
• Stabilizing the learning phase of EARLIEST so its decisions are more robust.
• Determining reasons why the explicit inclusion of the negative class in the discriminator
of EARLIEST impaired the model’s ability to estimate the risk level of users.</p>
      <p>Finally, this article has been one of the first attempts to thoroughly examine the role of the
alert policy in the early risk detection problem for the CLEF eRisk Lab. In general, other articles
have only focused on the classification with partial information, leaving the decision of the
classification moment relegated. We consider that this component of the problem is almost as
important as the classification with partial information, and we hope that more research groups
start tackling it. We believe EARLIEST could be the first step towards a model that learns both
pieces of the problem.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgements</title>
      <sec id="sec-7-1">
        <title>This work was supported by the CONICET P-UE 22920160100037.</title>
        <p>volume 14, 2020, pp. 830–839.
[29] S. G. Burdisso, M. Errecalde, M. Montes-y Gómez, Pyss3: A python package
implementing a novel text classifier with visualization tools for explainable ai, arXiv preprint
arXiv:1912.09322 (2019).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <article-title>A test collection for research on depression and language use</article-title>
          ,
          <source>in: Proc. of Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2016</year>
          ), Evora, Portugal,
          <year>2016</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , J. Parapar, erisk
          <year>2017</year>
          :
          <article-title>Clef lab on early risk prediction on the internet: experimental foundations</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>346</fpage>
          -
          <lpage>360</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <article-title>Overview of erisk: early risk prediction on the internet</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>343</fpage>
          -
          <lpage>361</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          ,
          <article-title>Overview of erisk 2019 early risk prediction on the internet</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>340</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Sadeque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bethard</surname>
          </string-name>
          ,
          <article-title>Measuring the latency of depression detection in social media</article-title>
          ,
          <source>in: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>495</fpage>
          -
          <lpage>503</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Parapar</surname>
          </string-name>
          , Overview of erisk at clef 2020:
          <article-title>Early risk prediction on the internet (extended overview</article-title>
          ) (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Loyola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Errecalde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Escalante</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. y Gomez</surname>
          </string-name>
          ,
          <article-title>Learning when to classify for early text classification</article-title>
          ,
          <source>in: Argentine Congress of Computer Science</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Burdisso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Errecalde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montes-y Gómez</surname>
          </string-name>
          ,
          <article-title>A text classification framework for simple and efective early depression detection over social media streams</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>133</volume>
          (
          <year>2019</year>
          )
          <fpage>182</fpage>
          -
          <lpage>197</lpage>
          . doi:https://doi.org/10.1016/j.eswa.
          <year>2019</year>
          .
          <volume>05</volume>
          .023.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sanger</surname>
          </string-name>
          , et al.,
          <article-title>The text mining handbook: advanced approaches in analyzing unstructured data</article-title>
          , Cambridge university press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Boyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Blackburn</surname>
          </string-name>
          ,
          <article-title>The development and psychometric properties of LIWC2015</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          , T. Mikolov,
          <article-title>Distributed representations of sentences and documents</article-title>
          ,
          <source>in: International conference on machine learning</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>D. M. Blei</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>M. I. Jordan</given-names>
          </string-name>
          ,
          <article-title>Latent dirichlet allocation</article-title>
          ,
          <source>Journal of machine Learning research 3</source>
          (
          <year>2003</year>
          )
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>T. K. Landauer</surname>
            ,
            <given-names>S. T.</given-names>
          </string-name>
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          ,
          <article-title>A solution to plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge</article-title>
          .,
          <source>Psychological review 104</source>
          (
          <year>1997</year>
          )
          <fpage>211</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          , M. Brucher,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>