=Paper=
{{Paper
|id=Vol-2936/paper-81
|storemode=property
|title=UNSL at eRisk 2021: A Comparison of Three Early Alert Policies for Early Risk Detection
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-81.pdf
|volume=Vol-2936
|authors=Juan Martín Loyola,Sergio Burdisso,Horacio Thompson,Leticia Cagnina,Marcelo Errecalde
|dblpUrl=https://dblp.org/rec/conf/clef/LoyolaBTCE21
}}
==UNSL at eRisk 2021: A Comparison of Three Early Alert Policies for Early Risk Detection==
UNSL at eRisk 2021: A Comparison of Three Early Alert Policies for Early Risk Detection Juan Martín Loyola1,3 , Sergio Burdisso1,2 , Horacio Thompson1,2 , Leticia Cagnina1,2 and Marcelo Errecalde1 1 Universidad Nacional de San Luis (UNSL), Ejército de Los Andes 950, San Luis, C.P. 5700, Argentina 2 Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) 3 Instituto de Matemática Aplicada San Luis (IMASL), CONICET-UNSL, Av. Italia 1556, San Luis, C.P. 5700, Argentina Abstract Early risk detection (ERD) can be considered as a multi-objective problem in which the challenge is to find an adequate trade-off between two different and related aspects: 1) the accuracy in identifying risky users and, 2) the minimum time that a risky user detection requires to be reliable. The first aspect is usually addressed as a typical classification problem and evaluated with standard classification metrics like precision, recall, and 𝐹1 . The second one involves a policy to decide when the information from a user classified as risky is enough to raise an alarm/alert and usually is evaluated by penalizing the delay in making that decision. In fact, temporal evaluation metrics used in ERD like ERDE𝜃 and 𝐹latency combine both aspects in different ways. In that context, unlike our previous participations at eRisk Labs, we focus this year on the second aspect in ERD tasks, that is to say, the early alert policies that decide if a user classified as risky should effectively be reported as such. In this paper, we describe three different early alert policies that our research group from the Univer- sidad Nacional de San Luis (UNSL) used at the CLEF eRisk 2021 Lab. Those policies were evaluated on the two ERD tasks proposed for this year: early risk detection of pathological gambling and early risk detection of self-harm. The first approach uses standard classification models to identify risky users and a simple (manual) rule-based early alert policy. The second approach is a deep learning model trained end-to-end that simultaneously learns to identify risky users and the early alert policy through a Rein- forcement Learning approach. Finally, the last approach consists of a simple and interpretable model that identifies risky users, integrated with a global early alert policy. That policy, based on the (global) estimated risk level for all processed users, decides which users should be reported as risky. Regarding the achieved results, our models obtained the best performance in terms of decision- based performance metrics (𝐹1 , ERDE50 , 𝐹latency ) as well as in terms of the ranking-based performance measures, for both tasks. Furthermore, in terms of the 𝐹latency measure, the performance obtained in the first task was twice as good as the second-best team. Keywords Early Risk Detection, Early Classification, End-to-end Early Classification, SS3 CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania ¥ https:// github.com/ jmloyola/ unsl_erisk_2021 " jmloyola@unsl.edu.ar (J. M. Loyola); sburdisso@unsl.edu.ar (S. Burdisso); hjthompson@unsl.edu.ar (H. Thompson); lcagnina@unsl.edu.ar (L. Cagnina); merreca@unsl.edu.ar (M. Errecalde) 0000-0002-9510-6785 (J. M. Loyola) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1. Introduction The early risk prediction on the Internet (eRisk) lab is concerned with the exploration of new models for early risk detection and evaluation methodologies with a direct impact on social and health-related issues [1]. The lab started in 2017 tackling the problem of early detection of depression in users from an online forum [2]. In 2018, the early detection of signs of anorexia was added as a new challenge for the lab, alongside an expanded version of the previous year’s task [3]. The test data were organized into 10 chunks and provided to each team chunk by chunk. The participant’s models were evaluated using the ERDE evaluation metric introduced by Losada et al. [1] to consider both the correctness of the classification and the delay taken by the system to make the decision. In 2019, the early detection of depression track was removed and two new challenges were presented: the early detection of signs of self-harm and measuring the severity of the signs of depression [4]. For that edition of the lab, new performance measures were considered. First, the performance measure 𝐹latency proposed by Sadeque et al. [5] was incorporated as a complementary measure to ERDE. On the other hand, ranking-based evaluation metrics were added to help professionals in real life make decisions. That year also marked the end of the chunk-based processing of the data. From that year on, a post-by-post approach was used for the challenges, resembling a real-life scenario where users write posts one at a time. In 2020, the early detection of signs of anorexia task was taken out but the others tasks were kept [6]. Finally, in 2021, the early detection of signs of pathological gambling task was introduced. Below is a brief description of the two tasks our research lab participated in. Task 1: Early Detection of Signs of Pathological Gambling. For this task, the goal was to detect, as soon as possible, the users that were compulsive gamblers or that had early traces of pathological gambling. The task’s data consisted of a series of user’s writings from social media collected in chronological order. No training data were provided, thus each team had to build a corpus to train their models. Task 2: Early Detection of Signs of Self-Harm. For this task, the goal was the same as with the 2019 and 2020 eRisk editions, that is, sequentially process pieces of evidence and detect early traces of self-harm as soon as possible. This year, training data was the combination of the 2020 edition training and testing data. The performance of both tasks was assessed using standard classification measures (precision, recall, and 𝐹1 score), measures that penalize delay in the response (ERDE and 𝐹latency ), and ranking-based evaluation metrics. The 𝐹1 and 𝐹latency scores were computed with respect to the positive class. To calculate these measures, for every post of every user, participating models were required to provide a decision, which signaled if the user was at-risk (indicated with a one) or not (indicated with a zero), and a score, that represented the user’s level of risk (estimated from the evidence seen so far). Note that if a user was classified as being at risk, posterior decisions were not considered. The present work describes the different approaches used by our research group to address the tasks 1 and 2 mentioned above. Furthermore, it compares the models’ behavior for the tasks and evaluates their performance. More precisely, the remainder of this paper is organized as follows. Section 2 provides a general introduction and overview of the corpus generation procedure, the data pre-processing steps used for classification, and the different models applied for the early risk detection tasks. Sections 3 and 4 describe the corpus, the parameters of the models, and their results for Task 1 and Task 2, respectively. Section 5 analyzes the results obtained in both tasks. Finally, Section 6 presents conclusions and discusses possible future work directions. 2. Approaches Early risk detection (ERD) can be conceptualized as a multi-objective problem in which the challenge is to find an adequate trade-off between two different and related aspects. On the one hand, the accuracy in identifying risky users. On the other hand, the minimum time that a risky user detection requires to be reliable. The first aspect is usually addressed as a typical classification problem with two classes: risky and non-risky. That task is evaluated with standard classification metrics like precision, recall, and 𝐹1 . The second one involves a policy to decide when the information from a user classified as risky is enough to raise an alarm/alert. That is, the decision-making policy returns yes (or true) to alert/confirm that the user is effectively at risk or no (or false) otherwise. When this policy is evaluated, it is usually penalized according to the delay incurred in raising an alert/alarm of a risky user. The aspects described above were explicitly modelled in an article presented by Loyola et al. [7] where an early classification framework was introduced. The focus of early classification is on the development of predictive models that determine the category of a document as soon as possible. This framework divides the task into two separated problems: classification with partial information and deciding the moment of classification. The task of classification with partial information (CPI) consists in obtaining an effective model that predicts the class of a document only using the available information read up to a certain point in time. On the other hand, the task of deciding the moment of classification (DMC) involves determining the point at which the reading of the document can be stopped with some certainties that the prediction made is correct. Trying to decide when to stop reading a document only using the class the CPI model returns is difficult. For this reason, the data that the DMC model gets is augmented with contextual information, that is, data from the body of the document that could be helpful for deciding the moment of classification. The interesting point about that early classification framework is that it can be used as a reference in ERD tasks. This is feasible by simply using the CPI component to identify risky users and replacing the early-stop reading policy implemented by DMC with an equivalent early alert policy for ERD. Thus, from now on we may refer to the component in charge of identifying risk users as CPI and the one in charge of implementing the early alert policy as DMC. An issue not considered in Loyola and collaborators’ work [7] and observed during the eRisk challenge, is that multiple documents (users) were processed in parallel. Thus, the context information could also consider information from other documents being processed at the same time. For that reason, the original early classification framework was minimally modified to take this situation into account resulting in the framework shown in Figure 1. Figure 1: Overview of the early classification framework. The CPI model predicts the category of a document with the input segment and augments it with context information for the DMC model, which will decide whether to continue or stop the analysis. Several documents could be processed in parallel and, for each of them, a decision to issue an alarm or not needs to be made. Thus, the context information provided to the DMC could also consider information from other documents being processed at the same time. Adapted from [7] to emphasize the decision-making policy. Summarizing, we will address ERD as a special case of early classification where we are only concerned with predicting as soon as possible a subset of the categories. Only the classes representing a risk for the people are considered. If the current partial input is classified as non-risky class, the model keeps accumulating information in case that, in the future, the user starts showing patterns of risk. Note that, in ERD, is essential to retrieve as many of the users at risk as it is possible since their lives could be in danger. Thus, it is important to develop models that have a high recall for risk classes. In order to adapt the early classification framework to the ERD problem, an alarm was raised to indicate a user at-risk only when the class predicted by the CPI was positive or at-risk and the DMC decided we should stop reading the input. Raising an alarm involves sending a decision equal to one to the challenge. In any other case, the model sent a decision equal to zero. Recall that, for the eRisk tasks, it was necessary to keep processing the input to score the level of risk of every user, even if it was already flagged, thus the model should not stop reading until the input ends. In this study, three kinds of early risk detection approaches were analyzed with different CPI and DMC components. However, and beyond the key role that the CPI component plays in identifying risky users, our focus in this participation is on the early alert policy implemented by the DMC component. In that context, three decision-making policies were considered. First, a simple decision tree with information from a regular text classifier; second, a deep learning model trained end-to-end using Reinforcement Learning; lastly, a global criterion based on information of the whole ranking of users given by the Sequential S3 (Smoothness, Significance, and Sanction) [8], SS3 from now on, model. Except for the SS3 models, a data pre-processing stage was applied to all the other models to ease the learning procedure. The details of this pre-processing will be given in Section 2.2. 2.1. Corpus Generation Procedure Since the Task T1 had no training data and to improve the performance of the models trained for the Task T2, a couple of datasets for each task were generated. The data from each corpus was obtained from Reddit through its API. Note that, most of the content of Reddit can be retrieved as a JSON file if the string “.json” is appended to the original URL —for instance, the current top posts and their content can be fetched with https://www.reddit.com/top.json. The structure and meaning of each part of the JSON file can be found in the Reddit API documentation.1 Thus, to build each corpus, different pages of Reddit were consulted. The main goal of the corpus generation procedure was to get two disjoint sets of users, one with the users at-risk and the other with random users. All available submissions and comments from both groups were extracted. The most popular subreddit related to each task was consulted to get the at-risk users, which from now on will be referred to as “main subreddit”. For the detection of pathological gambling the subreddit “problemgambling”2 was used; while for the detection of self-harm, the subreddit “selfharm”.3 On the other hand, to get the random users, the subreddits “sports”, “jokes”, “gaming”, “politics”, “news”, and “LifeProTips” were used. Henceforth, these subreddits are going to be referred to as “random subreddits”. In order to collect the at-risk users, first the last 1000 submissions to the main subreddit were evaluated. Every user that posted or commented in those submissions was considered as a user at-risk —and accordingly added to the set of at-risk users. All the posts and comments were saved for later cleaning. Then, similarly, the all-time top 1000 submissions to the main subreddit were fetched to obtain more users at risk and their posts and comments. Finally, for the users at risk, all their available posts were retrieved. Each of the submissions and comments from users at risk together with the comments of other users, even if they were published in another subreddit, were saved. If a post belonged to the main subreddit, all the users that had commented were added to the set of users that were at-risk. To gather the random users, initially, the last 100 submissions to each random subreddit were evaluated. For every one of the authors of the submissions, all their available posts were retrieved. Both the posts and all their comments were saved. Note that the number of submissions retrieved in this case was much lower than with the main subreddit, this is due to the random subreddits being more popular and having more comments per post. While retrieving posts and comments, not all of them were saved. The posts and comments belonging to bots, moderators of subreddits, or deleted accounts were mostly not considered. Since there is not enough information in the JSON to know if a user is a bot, moderator, or person, regular expressions were used to identify most of them based on their user name or the content of the post or comment. After a manual examination of the posts, it was determined that if the user name matched one of “LoansBot”, “GoodBot_BadBot”, or “B0tRank”, the account was flagged as a bot. Note that this set of user names depends on the time and the subreddits consulted. Additionally, when the content of the post or comment contained the text “this is a bot” or any variation with the same meaning, the particular submission was automatically flagged as a bot and not considered. The drawback of this step is that it is possible to flag real 1 https://www.reddit.com/dev/api/ 2 https://www.reddit.com/r/problemgambling 3 https://www.reddit.com/r/selfharm users submissions as belonging to bots just by having him/her writing those words in a post or comment. Nevertheless, the instances where this happened were very few, affecting less than 5 of the users posts. With respect to the moderators of subreddits, only the automatic moderator, whose account name was “AutoModerator”, was filtered. Later, the posts and comments from accounts that had been deleted, at the time of retrieving, were also filtered —once deleted, those accounts were named “[deleted]”. Additionally, comments or posts that matched the text “[deleted]”, “[removed]”, or “removed by moderator” were ignored. Finally, if the post or comment belonged to the subreddit “copypasta” it was not considered, since all its submissions have no meaning and contain a large number of words, skewing the whole corpus. On the other hand, it was observed that posts and comments contained references to other users. In particular, there were some references to users at risk. Given these, the models could learn to classify a user using the references to other users. Since this was not desirable, and to ensure anonymity, the references to other users were replaced with a token. Once all the posts with their comments were collected, they were grouped by user. All the users with less or equal than 30 writings (posts or comments) or with an average of words per writings lower than 15 were discarded. Later, any user that had a writing in the main subreddit was flagged as at-risk, while the rest as random users. Finally, all the writings for each user were ordered by their publication time. 2.2. Data Pre-processing Every user’s post provided by the challenge was part of a raw JSON file that held its content and some metadata information. For this work, only the title and post’s content were considered when processing the input. Due to the nature of social networks and Internet forums, the input data was highly heterogeneous. Users often use different languages, weblinks, emoticons, and format strings (newlines, tabs, and blanks). This noise could cause the representation space for the input to grow bigger, which could ultimately affect the performance of the models. Also, some HTML and Unicode characters were not correctly saved and were replaced by a numeric value that represented them. Therefore, the input was pre-processed as follows: 1. Convert text to lower case. 2. Replace the decimal code for Unicode characters with its corresponding character. For example, instead of having “it’s not much [...]” the input has “it #8217;s not much [...]”, where the number 8217 is the decimal code for the right single quotation mark (’). Note that every code is surrounded by a hashtag symbol and a semicolon, and has an empty space that should be removed. 3. Replace HTML codes with their symbols. For example, instead of having “[...] red for ir & green for Thermal [...]” the input has “[...] red for ir amp; green for Thermal [...]”, where amp; is the HTML character entity code for the symbol &. Note that every code is also preceded by an extra white space that should be deleted. The only HTML symbols that needed to be converted were: &, < and >. 4. Replace links to the web with the token weblink. 5. Replace internal links to subreddits with the name of the subreddits. For example, the text “[...] x-post from /r/funny” gets processed to “[...] x-post from funny”. 6. Delete any character that is not a number or letter. Note that if the Unicode and HTML codes were not replaced beforehand, these will appear later as numbers or words. 7. Replace numbers with the token number. 8. Delete new lines, tab, and multiple consecutive white spaces. These steps were rigorously checked to ensure that no relevant information from the input was lost. 2.3. ERD Frameworks As it was explained before, our approaches to the ERD problem require a description of the ERD framework that explicitly identifies how the CPI component (the risky-user classification model) is implemented, and how the DMC component makes its decisions (the early alert policy). In fact, it is interesting to note that the DMC component also constitutes a model that could be learned as is usual with the CPI component. Thus, in the following subsections, the ERD frameworks used by our group are presented in a comprehensive way, describing both components (models) of the ERD framework. 2.3.1. Text Classifiers with a Simple Rule-based Early Alert Policy For this approach, different kinds of text representations and text classifiers were trained to solve the CPI task at hand. The performance was evaluated using the 𝐹1 score for the positive class, and the best models were chosen. Finally, to tackle the DMC task, that is, to decide when to send an alarm for a user at risk, a simple policy was proposed that checks the current user context information. This policy can take different parameters that allow it to control the earliness of the decision. The optimal policy parameters were selected considering the 𝐹latency score. Among the document representations that were evaluated are bag of words (BoW) [9], lin- guistic inquiry and word count (LIWC) [10], doc2vec [11], latent Dirichlet allocation (LDA) [12], and latent semantic analysis (LSA) [13]. These were implemented using the Python packages scikit-learn [14] and gensim [15], except for LIWC which has its own implementation. For every one of them, a numerous set of parameters were explored. Regarding the used models, decision trees, 𝑘-nearest neighbors, support vector machines (SVM), logistic regressions, multi-layer perceptrons, random forests [16], recurrent neural networks with long short-term memory (LSTM) cells [17], and bidirectional encoder representations from trans- formers (BERT) [18] were chosen to classify with partial information. The Python packages scikit-learn [14], Transformers [19], and PyTorch [20] were used to implement these models. Similar to what was done with the representations, every model was trained using a large range of parameters. Each valid representation and model combination was compared to obtain the best combina- tions that solved the CPI task. To determine the performance of each model, the 𝐹1 score for the positive class was employed. Once the best combinations were selected, it was necessary to augment these with a decision- making policy able to determine when to raise an alarm for a user at-risk. A decision tree was proposed to tackle the DMC task. The tree evaluated the current user context information to make a decision. In particular, the predicted class, the current delay in the classification, and the predicted positive class probability (class associated with risk) were evaluated to decide when to send an alarm for a user at risk. If a user was predicted as positive, the probability of belonging to the positive class was greater than 𝛿, and more than 𝑛 posts were processed, then an alarm was issued. Thus, the decision tree has two hyper-parameters, a positive class probability threshold 𝛿 and a minimum amount of processed posts 𝑛. The structure of the decision tree can be seen in Figure 2. To determine the value of the threshold 𝛿 and the minimum amount of processed posts 𝑛, different combinations were tested, choosing the ones with the best performance for 𝐹latency . Finally, the scores and decisions outputted by this model were obtained by reporting the probability of the positive class and the results of the decision tree, respectively. If the result of the decision tree for a given input was “Keep reading”, the decision was 0; on the other hand, if the result was “Issue alarm”, the decision was 1. Henceforth, this kind of model will be referred to as “EarlyModel”. Figure 2: Decision-making policy to determine when to raise an alarm. If the current document is predicted as positive, its probability of belonging to the positive class is greater than 𝛿, and the number of posts read is greater than 𝑛, then the decision tree issues an alarm. Otherwise, it indicates to keep reading. 2.3.2. End-to-end Deep Learning ERD Framework This method was an adaptation of the model proposed by Hartvigsen et al. [21] for time series to early risk detection with text. In their paper, Hartvigsen and collaborators proposed a model to tackle the problem of early classification of time series called Early and Adaptive Recurrent Label ESTimator, or short, EARLIEST. The model is composed of a recurrent neural network that captures the current state of the input, a neural network that tackles the CPI task, called the discriminator, and a stochastic policy network responsible for the DMC task, called the controller. During classification, the recurrent model generates step-by-step time series representations, capturing complex temporal dependencies. The controller interprets these in sequence, learning to parameterize a distribution from which decisions are sampled at each time step, choosing whether to stop and predict a label or wait and request more data. Once the controller decides to halt, the discriminator interprets the sequential representation to classify the time series. By rewarding the controller based on the success of the discriminator and tuning the penalization of the controller for late predictions, the controller learns a halting policy that guides the online halting-point selection. This results in a learned balance between earliness and accuracy depending on how much the controller is penalized. The size of the penalty is a parameter 𝜆 chosen by a decision-maker according to the requirements of the task [21]. It is important to emphasize that this is an end-to-end learning model optimizing accuracy and earliness at the same time. Since this model was originally proposed for the early classification of time series, it was necessary to adapt it to the early risk detection problem using text. First, instead of processing raw input, as it was the case with time series data, the input text was represented using doc2vec [11] trained in the corpus. This representation allowed the model to efficiently process the input since the users’ posts were the input unit. Note that if rather than using doc2vec, the word2vec [22] representation had been used, the model would have to process every token of the input deciding if it should halt or not. But since the eRisk’s tasks required a decision for every post, all the decisions taken for every token except for the last would have been discarded, wasting computation. Second, a recurrent neural network with Long Short-Term Memory (LSTM) cells was used as the state of the model since it allowed it to preserve information over longer sequences in comparison to other recurrent neural networks architectures [17]. In the model proposed by Hartvigsen and collaborators, the discriminator consisted of adding a fully connected layer after the recurrent neural network to allow it to make predictions for the input. Since the early risk detection problems tackled in this work had two classes, that is, user at risk or user not at risk, the classification task can be seen as a one-class problem with one output (probability of being at risk), or as a multi-class classification with two classes. In this work, both approaches were tested. Depending on the approach used was the loss function for the discriminator. Ultimately, the model was modified to raise an alarm only if the class predicted by the discriminator was positive and the controller indicated to stop reading. It should be noted that the model processed the whole input in order to output the scores needed by the challenge. Neither the original implementation of the EARLIEST model by the authors nor our imple- mentation considered the input as a stream of data. This was a considerable drawback since, for every new post, the whole history of posts of that user needed to be processed again in the recurrent neural network to update the hidden layer. Nevertheless, each post representation was calculated only once. Thus, to improve the time performance of the model, the sequence length for the input to the recurrent neural network was restricted to 200 posts. If a new post arrived and the input representation was full, the oldest post was removed from the representation giving space to the new one. Note that it is possible to implement EARLIEST for stream data, but time limitations did not allow it. To get the final scores and decisions for this model the probability of the positive class given by the discriminator and the decision made by the controller were used, respectively. In the end, for the challenge, different values for the 𝜆 parameter were tested to control the earliness of the model. The parameters that yielded the best 𝐹latency score were selected. 2.3.3. The SS3 Text Classifier with a User-Global Early Alert Policy The SS3 text classifier [8] is a simple and interpretable classification model specially created to tackle early risk detection scenarios, integrally. First, the model builds a dictionary of words for each category during the training phase, in which the frequency of each word is stored. Then, using those word frequencies, and during the classification stage, it calculates a value for each word using a special function, called 𝑔𝑣(𝑤, 𝑐), to value words in relation to categories. This function has three hyper-parameters, 𝜎, 𝜌, and 𝜆, that allow controlling different “subjective” aspects of how words are valued. More precisely, the equations to compute 𝑔𝑣(𝑤, 𝑐) were designed to value words in an interpretable manner since, given the sensitive nature of risk detection problems, transparency and interpretability were two of the key design goals for this model. To achieve this, the authors first defined what constituted interpretability by considering how people could explain to each other the reasoning processes behind a typical text classification tasks,4 and then this 𝑔𝑣 function was designed to value words by trying to mimic that behavior –i.e. having 𝑔𝑣 to value words “the way people would naturally do it”. For instance, suppose that the target classes are food, health, and sports, then, after training, SS3 would learn to assign values like: 𝑔𝑣(“sushi”, food) = 0.85; 𝑔𝑣(“the”, food) = 0; 𝑔𝑣(“all”, food) = 0; 𝑔𝑣(“sushi”, health) = 0.70; 𝑔𝑣(“the”, health) = 0; 𝑔𝑣(“all”, health) = 0; 𝑔𝑣(“sushi”, sports) = 0.02; 𝑔𝑣(“the”, sports) = 0; 𝑔𝑣(“all”, sports) = 0; The classification process is carried out by combining, sequentially, the 𝑔𝑣(𝑤, 𝑐) values of all words as they are processed from the input stream. The authors originally proposed a hierarchical process to perform this through different operators, called “summary operators”, that combine and reduce these 𝑔𝑣 values at different levels, such as words, sentences, or paragraphs. In the present work, and driven by previously published competitive results [8, 23, 24], it was decided to simply use a summation of all seen 𝑔𝑣 values to perform the classification of users. More precisely, given the positive (i.e. “at-risk”) and negative classes of the addressed ERD tasks, a 𝑠𝑐𝑜𝑟𝑒𝑢 value was calculated for each user 𝑢, where WH𝑢 denotes 𝑢’s writing history, as follows: ∑︁ score𝑢 = 𝑔𝑣(𝑤, 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒) − 𝑔𝑣(𝑤, 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒). (1) 𝑤∈WH𝑢 Finally, for each user 𝑢, a classification decision is made simply by using its score𝑢 since it represents the overall estimated risk level of the user, given by the model. For instance, in the eRisk 2019 challenge, the best ERDE values, as well as the best ranking-based results, were 4 For instance, for text classification, people would normally direct their attention only to certain “keywords” (filtering out all the rest) and explain why these words were important in their reasoning process. obtained for the two early risk detection tasks, using a simple policy that classified each user as soon as its score became positive, i.e. when the model’s positive confidence surpassed the negative one [23]. However, in the present work, we opted to use a user-global early alert policy. That is, the policy used to raise an alarm for a particular user takes into account its score value, globally, in regard to the current score of all the other users. More precisely, let scores = {score𝑢 |𝑢 ∈ Users} be the set of all current scores, a decision𝑢 was made for each user 𝑢, where MAD stands for Median Absolute Deviation, as follows: {︃ 1, if score𝑢 > median(scores) + 𝛾 · MAD(scores) decision𝑢 = (2) 0, otherwise. Thus, a user 𝑢 was classified as “at-risk” as soon as its decision𝑢 became 1. This policy is based on three metrics: the median, which is a robust measure of central tendency; the MAD, a robust measure of statistical dispersion; and the score, which represents the estimated risk level. Hence, the interval median(scores) ± 𝛾 · MAD(scores) represents a “region of doubt” containing all users for which the model is not fully sure whether they are at risk or not —i.e. whether the estimated risk level is “high enough or low enough”. We designed this policy driven by the goal of optimizing the performance of the model in terms of the 𝐹 measure. Note that 𝛾 ∈ R is a hyper-parameter that allows controlling how far from the median the user’s current score must move before being considered at-risk. Thus, the greater the 𝛾, the lower the recall, and the higher the precision our model should have, since only those users whose score is high enough will be considered. Conversely, the lower the 𝛾, the higher the recall and the lower the precision. Therefore, this policy should allow maximizing the performance of the model in terms of the 𝐹 measure, since, at least a priori, there always exists an intermediate 𝛾 value that allows obtaining an optimal balance between recall and precision. We used this policy for the first time in the self-harm detection task of the eRisk 2020, obtaining competitive results in terms of the 𝐹 -related measures. For instance, we obtained the second-best 𝐹latency (0.609) training the SS3 model only with the small training set provided by the eRisk organizers [6]. The best value (0.658) was obtained by the iLab team using a BERT-based model that was trained with a large dataset created manually by that research team [25]. We later downloaded that dataset and trained the SS3 model again, greatly improving and outperforming the previously obtained results in this task —for instance, obtaining an 𝐹latency value of 0.711. To achieve this, the same procedure described by the iLab team in their eRisk paper was carried out [25]. That is, the training set provided for the task was used as a validation set to perform hyperparameter optimization, from which 𝜎 = 0.32, 𝜆 = 0.45, and 𝜌 = 0 were selected as the best hyperparameter configuration. Therefore, as will be described in more detail in Section 4, in the present eRisk edition, we participated in the self-harm detection task again, this time using this SS3 model trained with the iLab dataset —which achieved the best results in terms of the 𝐹 -related measures. 3. Task T1: Early Detection of Signs of Pathological Gambling In this section, the details of our participation addressing the eRisk’s early detection of patholog- ical gambling task are given. Namely, the details of the datasets and the five models submitted Table 1 Details of the corpora for Task T1: the training (T1_train) and validation (T1_valid) sets, and the test set (T1_test) used by the eRisk organizers to evaluate the participating models. The number of users (total, positives, and negatives) and the number of posts of each corpus are reported. The median, minimum, and maximum number of posts per user and words per post in each corpus are detailed. #users #posts per user #words per post Corpus #posts Total Pos Neg Med Min Max Med Min Max T1_test 2,348 164 2184 1,130,792 244 10 2,001 12 0 10,175 T1_train 726 176 550 71,187 54 31 740 20 1 4,516 T1_valid 726 176 550 74,507 55 31 1,234 19 1 7,479 to this challenge are introduced. Finally, the results obtained after the evaluation stage are shown. 3.1. Datasets As already stated, for this task, it was necessary to build a corpus in order to train models for early detection of signs of pathological gambling. The steps described in Section 2.1 were followed to build a corpus using data from Reddit. The final corpus was split into a training and a validation set, each containing half of the users. Table 1 shows the details of each generated corpus compared to the test dataset provided for this task. In this table, “T1_test” refers to the test set used to evaluate all participating models, while “T1_train” and “T1_valid” refer to the generated corpora using Reddit. Note that the corpus provided during the challenge was much bigger according to the number of users, number of posts, and number of posts per user, compared to the generated ones. On the other hand, T1_test had a lower number of words per post compared to T1_train and T1_valid. For T1_test there were posts with no words in them, that is, empty posts. This could be caused by a user that edited her/his submission after posting, deleting its content. 3.2. Models This section describes the details of the models used by our team to tackle this task. Namely, from the results obtained after the model selection and the hyperparameter optimization stage, described in Section 2.3, the following five models were selected for participating: UNSL#0. An EarlyModel with a bag of words (BoW) representation and a logistic re- gression classifier. Words unigrams were used for the BoW representation with term frequency times inverse document-frequency (commonly known as tf-idf ) as the weight- ing scheme. For the logistic regression, a balanced weighting for the classes was used, that is, each input was weighted inversely proportional to its class frequencies in the input data. Finally, for the decision-making policy, a threshold 𝛿 = 0.7 and a minimum number of posts 𝑛 = 10 were used. Table 2 Decision-based evaluation results for Task T1. For comparison, besides the median and mean values, results from RELAI and EFE teams are also shown. These were selected according to the results obtained by the metrics ERDE5 , ERDE50 , and 𝐹latency , where only the best and second- best models were selected. The best values obtained for the 𝐹1 , ERDE5 , ERDE50 and the 𝐹latency scores for this task, among all participating models, are shown in bold. Model P R 𝐹1 ERDE5 ERDE50 latencyTP speed 𝐹latency UNSL#0 .326 .957 .487 .079 .023 11 .961 .468 UNSL#1 .137 .982 .241 .060 .035 4 .988 .238 UNSL#2 .586 .939 .721 .073 .020 11 .961 .693 UNSL#3 .084 .963 .155 .066 .060 1 1 .155 UNSL#4 .086 .933 .157 .067 .060 1 1 .157 RELAI#0 .138 .988 .243 .048 .036 1 1 .243 EFE#2 .233 .750 .356 .082 .033 11 .961 .342 Mean .141 .807 .220 .073 .055 7.3 .983 .213 Median .101 .973 .184 .070 .050 1.5 .998 .183 UNSL#1. An EarlyModel with a doc2vec representation and a logistic regression classifier. Each submission was represented as a vector of dimension 100. The representation was learned using the training corpus generated, T1_train. For the logistic regression, both classes were weighted the same. Finally, for the decision-making policy, a threshold 𝛿 = 0.85 and a minimum number of posts 𝑛 = 3 were used. UNSL#2. An EarlyModel with a BoW representation and an SVM classifier. For the BoW representation, character 4-grams were used with tf-idf as the weighting scheme. The support vector machine was parameterized with a radial basis function kernel with 𝛾 = 0.125, regularization parameter 𝐶 = 512 weighted inversely proportional to its class frequencies in the input data. Finally, for the decision-making policy, a threshold 𝛿 = 0.75 and a minimum number of posts 𝑛 = 10 were used. UNSL#3. An EARLIEST model with a doc2vec representation for user posts. The base recurrent neural network chosen was a one-layer LSTM with an input feature dimension of 200 and 256 hidden units. The discriminator of the EARLIEST model reduced the hidden state of the LSTM to one dimension representing the positive class probability. Finally, the value of 𝜆 used to train was 𝜆 = 0.000001. UNSL#4. The same model as UNSL#3 but with the discriminator reducing the hidden state of the LSTM to two dimensions representing the probabilities of both, the positive and negative classes. Besides, the value of 𝜆 used to train was 𝜆 = 0.00001. 3.3. Results The main results obtained with our five models are described below, grouped according to the type of metric used to measure performance: Table 3 Details of the participating teams for the Task T1: team name, number of models (#models), number of user posts processed (#posts), time taken to complete the task (Total), and to process Total each post (Per post = #posts×#models ). Time Team #models #posts Total Per post UNSL 5 2000 5 days + 1h 43s RELAI 5 1231 9 days + 6h 2m + 9s UPV-Symanto 5 801 19h 16s BLUE 5 1828 2 days 18s CeDRI 2 271 1 day + 6h 3m + 17s EFE 4 2000 3 days + 3h 33s Early classification decision-based performance: Table 2 shows the results obtained for the decision-based performance metrics. As it can be observed, our team achieved the best and second-best performance in terms of the 𝐹1 , ERDE50 , and 𝐹latency measures with two EarlyModels (UNSL#0 and #2). Moreover, in terms of the 𝐹latency , the value obtained with UNSL#2 (0.693) was roughly twice as good as EFE#2’s (0.342) —the model with the best value among the other teams’ models. However, regarding the ERDE5 measure, the obtained results were close to the average. In the case of the two EARLIEST models, this was due to a poor classification performance, whereas in the case of the EarlyModels, due to having to read at least 3 posts before being able to make a decision —i.e. the selected values for 𝑛 were 𝑛 = 3 and 𝑛 = 10. Among our three EarlyModels (UNSL#0, #1, and #2), the model that performed the worst was UNSL#1, a logistic regression with a doc2vec representation, which had a performance approximately equal to the average. Interestingly, UNSL#0 performed roughly twice as well as UNSL#1, despite being the same classifier and using a simpler representation, namely, a standard BoW representation. In fact, this model was only outperformed by UNSL#2, an SVM using a character 4-grams BoW representation, which, as mentioned above, obtained the best values —among all 26 participating models. Regarding the two EARLIEST models (UNSL#3 and #4), the obtained performance was below the average, and therefore, they performed the worst among our five models. This was mostly due to the EARLIEST models blindly classifying the vast majority of the users as at-risk, leading to exceptionally low precision values. Finally, mean and median values indicate that, overall, this task was hard to deal with. In particular, the low precision and high recall of all participating models suggest that models had trouble accurately detecting true-positive cases since the vast majority of the detected users were false-positive cases. Performance in terms of execution time: Table 3 shows for each team, details on the total time taken to complete the task. As it can be seen, the time taken to complete the task differs from team to team, varying from a few to a large number of hours. However, to have a more precise view of how efficient the models of each team were, not only the total time taken to complete the task must be considered, but also the total number of posts processed in that time and the number of models used to carry it out. For example, (a) Task T1 (b) Task T2 Figure 3: Disaggregated total time for both tasks. The time elapsed in each stage of the input processing is shown in base-10 log scale (Y-axis). Time is reported in seconds. in terms of processing speed, CeDRI does not seem as efficient as UNSL, since although the former completed the task in roughly 1 day, it only processed the first 271 posts from each user using only 2 models, while the latter, although completing the task in roughly 5 days, processed all 2000 posts from each user using 5 models.5 For this reason, this table also includes, as a guide, an estimate of the time taken by each team’s model to process each post, which was obtained by normalizing the total time relative to the number of models used and the total number of posts processed. It can be observed that our team did not achieve the best performance in terms of execution time, processing each post in 43 seconds, whereas the fastest team (UPV-Symanto) did it in 16 seconds. To have a better insight of a possible cause for having taken 5 days to complete the task, as shown in Figure 3a, information stored in our logs was used to disaggregate this total time into five different stages: pre-processing, input features computation, classifier prediction, server timeouts, and network delay. It can be seen that, as will be discussed in more detail in Section 5, the two stages taking most of the time are the computation of the feature vector and network time —roughly 39% of the total time is spent computing the feature vector, and 57% in network communication delays. Therefore, as will be discussed in more detail in Section 6, the optimization of the feature vector stage will be taken into account for future work. Ranking-based performance: Table 4 shows the results obtained for the ranking-based performance metrics. In addition, plots of the four complete rankings created, respectively, by each model after processing 1, 100, 500, and 1000 posts, are shown in Figure 4. As can be seen, our team achieved the best performance in terms of the three metrics (𝑃 @10, 𝑁 𝐷𝐶𝐺@10, and 𝑁 𝐷𝐶𝐺@100) along the four rankings used for the evaluation. Moreover, the values obtained with two of the EarlyModels (UNSL#0 and #2) were the best possible ones (i.e. 1) for the three metrics and the four rankings —except for 𝑁 𝐷𝐶𝐺@100 5 Note that, for each of the users 2000 posts, not only was it necessary to send a request to the server to obtain the post, but also 5 more requests to send the response of each model. Therefore, UNSL needed a total of 2000 + 2000 * 5 = 12000 requests to the server to complete the task. Table 4 Ranking-based evaluation results for Task T1. The values obtained for each metric are shown for the four reported rankings, respectively, the ranking obtained after processing 1, 100, 500, and 1000 posts. The best values obtained for this task, among all participating models, are shown in bold. Ranking Metric UNSL#0 UNSL#1 UNSL#2 UNSL#3 UNSL#4 P@10 1 1 1 .9 1 1 post NDCG@10 1 1 1 .92 1 NDCG@100 .81 .79 .85 .74 .69 P@10 1 .8 1 1 0 100 posts NDCG@10 1 .73 1 1 0 NDCG@100 1 .87 1 .76 .25 P@10 1 .8 1 1 0 500 posts NDCG@10 1 .69 1 1 0 NDCG@100 1 .86 1 .72 .11 P@10 1 .8 1 1 0 1000 posts NDCG@10 1 .62 1 1 .0 NDCG@100 1 .84 1 .72 .13 in the ranking obtained after reading only 1 post. As with the decision-based results, the logistic regression with the doc2vec representation (UNSL#1) obtained the lowest values among the three EarlyModels (UNSL#0, #1, and #2). Regarding the two EARLIEST models (UNSL#3 and #4), their performance was also the lowest among our five models. However, unlike the decision-based results, UNSL#3 performed considerably better than UNSL#4. We will leave for future work to study why the explicit inclusion of the negative class probability in the discriminator impaired UNSL#4’s ability to estimate users’ risk. Finally, obtained results show that the two EarlyModel using standard tf-idf -weighted BoW representations, despite their relative simplicity, were capable of estimating the risk level of the users with considerable efficiency, even when only a few posts were processed. 4. Task T2: Early Detection of Signs of Self-Harm In this section, the details of our participation addressing the eRisk’s early detection of self-harm task are given. Namely, the details of the datasets and the five models submitted to this challenge are introduced. finally, the results obtained after the evaluation stage are shown. 4.1. Datasets For this task, and unlike Task T1, the eRisk’s organizers did provide the datasets to train and validate the participating models. Each dataset was made available as a set of XML files, one for each user. However, to improve the performance of our models, the steps described in Section 2.1 were followed to build a complementary corpus using data from Reddit. Then, the corpus was split into a training and a validation set, each containing half of the users. Finally, (a) Rankings after 1 post (b) Rankings after 100 posts (c) Rankings after 500 posts (d) Rankings after 1000 posts (*) Figure 4: Separation plots [26] for Task T1. A separation plot is shown for each of the four rankings used to evaluate each model, respectively. The ordinate corresponds to the model’s score and abscissa to users (ordered increasingly by score). Dark blue lines correspond to users at risk. The red dotted line indicates the top-100 users region. This small region of top-100 users was the (biggest) portion of the entire ranking actually used to evaluate the participating models. (*) The ArviZ’ implementation [27] of the separation plot was used in this work. these complementary datasets were combined with the ones provided for this challenge. These extended training and validation sets were then used to train and tune the EarlyModels and the EARLIEST models. On the other hand, as explained at the end of Section 2.3.3, one of the corpora created by the iLab research team [25] for the eRisk 2020’s edition of this task was used to train the SS3 models —namely, the dataset called “users-submissions-200k”.6 The datasets created by the iLab were also created collecting data from Reddit, but obtained from the Pushshift Reddit Dataset [28] through its public API.7 Table 5 shows the details of each complementary corpus along with the training, validation, and test datasets provided for this task. In this table, “T2_test” refers to the test set used to evaluate all participating models, “T2_train” and “T2_valid” to the training and validation sets provided by the organizers, “redd_train” and “redd_valid” to the training and validation sets 6 The iLab’s datasets can be downloaded from https://github.com/brunneis/ilab-erisk-2020. 7 https://pushshift.io/api-parameters/ Table 5 Details of the corpora used for Task T2: the different training and validation sets, as well as the test set used by the eRisk organizers to evaluate the participating models. The number of users (total, positives, and negatives) and the number of posts of each corpus are reported. The median, minimum, and maximum number of posts per user and words per post in each corpus are detailed. #users #posts per user #words per post Corpus #posts Total Pos Neg Med Min Max Med Min Max T2_test 1,448 152 1296 746,098 275.5 10 1,999 12 0 18,064 T2_train 340 41 299 170,698 282.0 8 1,992 10 1 6,700 T2_valid 423 104 319 103,837 95.0 9 1,990 7 1 2,663 redd_train 1,051 494 557 118,452 61.0 31 1,466 18 1 5,971 redd_valid 1,051 494 557 119,651 59.0 31 1,781 18 1 4,382 comb_train 1,391 535 856 289,150 73.0 8 1,992 13 1 6,700 comb_valid 1,474 598 876 223,488 63.0 9 1,990 11 1 4,382 ilab_train 26,256 10319 15937 259,297 5.0 1 1,825 19 1 11,933 built using Reddit, “comb_train” and “comb_valid” to the combined datasets, and “ilab_train” to the iLab’s corpus. Note that the corpus used to evaluate the participating models had four times more users and posts than the corpus provided for training, but with a similar number of posts per user and words per post. On the other hand, comb_train and comb_valid had almost the same number of users as T2_test but had a much lower number of total posts and posts per user. Also, ilab_train had a considerably greater number of users than the rest of the datasets, but with a fewer number of posts per user. In the same way as with T1_test, in T2_test there were posts with no words in them, i.e. empty posts. This could be caused by a user that edited her/his submission after posting, deleting its content. 4.2. Models This section describes the details of the models used by our team to tackle this task. Namely, from the results obtained after the model selection and the hyperparameter optimization stage, described in Section 2.3, the following five models were selected for participating: UNSL#0. An EarlyModel with a doc2vec representation and a multi-layer perceptron optimized using Adam. Each post was represented as a 200-dimensional vector. To learn this representation the combined training corpus, comb_train, was used. The multi-layer perceptron consisted of one hidden layer with 100 units and ReLu as the activation function. Finally, for the early detection policy, a threshold 𝛿 = 0.7 and a minimum number of posts 𝑛 = 10 were used. UNSL#1. An EARLIEST model with a doc2vec representation. Each post was represented as a 200-dimensional vector. To learn this representation the combined training corpus, comb_train, was used. The base recurrent neural network chosen was a one-layer LSTM with an input feature dimension of 200 and 256 hidden units. The discriminator of the EARLIEST model reduced the hidden state of the LSTM to one dimension representing the positive class probability. Finally, the value of 𝜆 used to train was 𝜆 = 0.000001. Table 6 Decision-based evaluation results for Task T2. For comparison, besides the median and mean values, results from UPV-Symanto and BLUE teams are also shown. These were selected accord- ing to the results obtained by metrics ERDE5 , ERDE50 , and 𝐹latency , where only the best and second-best models were selected. The best values obtained for the 𝐹1 , ERDE5 , ERDE50 and the 𝐹latency scores for this task, among all participating models, are shown in bold. Model P R 𝐹1 ERDE5 ERDE50 latencyTP speed 𝐹latency UNSL#0 .336 .914 .491 .125 .034 11 .961 .472 UNSL#1 .11 .987 .198 .093 .092 1 1 .198 UNSL#2 .129 .934 .226 .098 .085 1 1 .226 UNSL#3 .464 .803 .588 .064 .038 3 .992 .583 UNSL#4 .532 .763 .627 .064 .038 3 .992 .622 UPV-Symanto#1 .276 .638 .385 .059 .056 1 1 .385 BLUE#2 .454 .849 .592 .079 .037 7 .977 .578 BLUE#3 .394 .868 .542 .075 .035 5 .984 .534 Mean .278 .764 .359 .107 .075 19.5 .888 .332 Median .239 .810 .344 .101 .069 4.5 .984 .336 UNSL#2. The same model as UNSL#1 but with the discriminator reducing the hidden state of the LSTM to two dimensions representing the probabilities of both, the positive and negative classes. Besides, the value of 𝜆 used to train was 𝜆 = 0.00001. UNSL#3. An SS3 model8 with a policy value of 𝛾 = 2 trained using the iLab corpus, ilab_train. As mentioned in Section 2.3.3, to select the 𝛾 values, the eRisk 2020’s training set for this task was used as the validation set; the value 𝛾 = 2 achieved an optimal balance between recall and precision, maximizing the 𝐹 value. UNSL#4. The same model as UNSL#3 but with a policy value of 𝛾 = 2.5. Given that this 𝛾 value is greater than the previous one, this model was meant to have a higher precision than UNSL#3 since the user’s current score must be 2.5 MADs greater than the median score to be considered at-risk. 4.3. Results The main results obtained with our five models are described below, grouped according to the type of metric used to measure performance: Early classification decision-based performance: Table 6 shows the results obtained for the decision-based performance metrics. As it can be observed, our team achieved the best performance in terms of the 𝐹1 and 𝐹latency measures and the second-best ERDE5 with one of the SS3 models (UNSL#4). Moreover, we also obtained the best performance in terms of the ERDE50 measure with the model EarlyModel (UNSL#0). Regarding our 8 SS3 models were coded in Python using the “PySS3” package [29] (https://github.com/sergioburdisso/pyss3). Table 7 Details of the participating teams for the Task T2: team name, number of models (#models), number of user posts processed (#posts), time taken to complete the task (Total), and to process Total each post (Per post = #posts×#models ). Time Team #models #posts Total Per post UNSL 5 1999 3 days + 17h 32s NLP-UNED 5 472 7h 11s AvocadoToast 3 379 10 days + 13h 13m + 23s Birmingham 4 11 2 days + 8h 76m + 23s NuFAST 3 6 17h 57m + 6s NaCTeM 5 1999 5 days + 20h 50s EFE 4 1999 1 days + 15h 18s BioInfo@UAVR 2 91 1 days + 02h 8m + 41s NUS-IDS 5 46 3 days + 08h 20m + 55s RELAI 5 1561 11 days 2m + 2s CeDRI 3 369 1 days + 9h 1m + 50s BLUE 5 156 1 days + 5h 2m + 13s UPV-Symanto 5 538 12h 16s five models, as in Task T1, here the EarlyModel performed better than the two EARLIEST models (UNSL#1 and #2), which again obtained a performance roughly below the average, classifying the vast majority of the users as at-risk and thus obtaining exceptionally low precision values. In addition, the two SS3 models (UNSL#3 and #4) achieved a better balance between recall and precision than the EarlyModel and two EARLIEST models, as evidenced by better 𝐹 values. As expected, UNSL#4 achieved a higher precision than UNSL#3 by using 𝛾 = 2.5 but, unlike obtained results with the validation set, the former achieved a better balance than the latter. This suggests that models had a harder time distinguishing between true and false positive cases in the test set used to evaluate them when compared to the validation set —i.e. compared with the test set used in the last year edition of this task. Note that this aspect is also suggested by the average low precision and high recall values (mean and median). Therefore, a model with a greater 𝛾 would have probably obtained better 𝐹 values since it would have given more importance to precision over recall, i.e. it would have been more cautious when detecting true positive cases. Finally, overall results suggest this task was hard to tackle since, as mentioned above, all participating models had trouble distinguishing between true and false positive cases. Performance in terms of execution time: Table 7 shows details on the total time taken to complete the task for each team. As can be seen, our team, although not being the fastest, it was among the few teams that processed each post in a few seconds, processing each post in 32 seconds. As shown in Figure 3b, for this task we also used the information stored in the execution logs to disaggregate the total time into five different stages. Such as in Task 1, the two stages taking most of the time were again the computation of feature Table 8 Ranking-based evaluation for Task T2. The values obtained for each metric are shown for the four reported rankings, respectively, the ranking obtained after processing 1, 100, 500, and 1000 posts. The best values obtained for this task, among all participating models, are shown in bold. Ranking Metric UNSL#0 UNSL#1 UNSL#2 UNSL#3 UNSL#4 P@10 1 .8 .3 1 1 1 post NDCG@10 1 .82 .27 1 1 NDCG@100 .7 .61 .28 .63 .63 P@10 .7 .8 0 .9 .9 100 posts NDCG@10 .74 .73 0 .81 .81 NDCG@100 .82 .59 0 .76 .76 P@10 .8 .9 0 .9 .9 500 posts NDCG@10 .81 .94 0 .81 0.81 NDCG@100 .8 .58 0 .71 .71 P@10 .8 1 0 .8 .8 1000 posts NDCG@10 .81 1 0 .73 .73 NDCG@100 .8 .61 0 .69 .69 vectors and network time —roughly 36% of the total time is spent computing the feature vector, and 55% in network communication delays. Ranking-based performance: Table 8 shows the results obtained for the ranking-based performance metrics. In addition, plots of the four complete rankings created, respectively, by each model after processing 1, 100, 500, and 1000 posts, are shown in Figure 5. As can be seen, obtained results in this task were not as competitive as those obtained in the first task. Nevertheless, some of our models achieved the best performance in terms of the 𝑁 𝐷𝐶𝐺@100, 𝑃 @10, and 𝑁 𝐷𝐶𝐺@10. For instance, the EarlyModel (UNSL#0) obtained the best 𝑁 𝐷𝐶𝐺@100 in the four rankings whereas the SS3 models (UNSL#3 and #4) obtained some of the best 𝑃 @10 and 𝑁 𝐷𝐶𝐺@10 values. Regarding the two EARLIEST models, the variant that explicitly incorporates the probability of the negative class in the discriminator, UNSL#2, performed poorly, as in Task T1. However, the other EARLIEST variant, UNSL#1, performed slightly better than the EarlyModel (UNSL#0) in terms of most of the 𝑃 @10 and 𝑁 𝐷𝐶𝐺@10 metrics —even obtained the best values in the last ranking. Nevertheless, as shown in Figure 5, the EARLIEST models were the least effective considering the entire user ranking and not just the top-10 and top-100 users used to calculate the reported metrics. Note that, in the plots for UNSL#1 and UNSL#2, the users at risk (dark blue lines) are scattered throughout the entire ranking, consistently, across all four rankings. Instead, the other three models tend to accumulate those users on the right end, i.e. tend to accurately move users at risk towards the highest positions in the ranking. Among our five models, the EarlyModel (UNSL#0) performed the best in terms of 𝑁 𝐷𝐶𝐺@100 whereas SS3 in terms of 𝑃 @10 and 𝑁 𝐷𝐶𝐺@10 (UNSL#3 and #4). However, as shown in Figure 5, the rankings generated by the SS3 models (UNSL#3 and #4) seem to slightly lose their quality as more posts are processed, as can be seen in the transition from 100 posts to 500 posts. Note that, in the plots for UNSL#3 and (a) 1 post (b) 100 posts (c) 500 posts (d) 1000 posts Figure 5: Separation plots for Task T2. A separation plot is shown for each of the four rankings used to evaluate the models, respectively. The ordinate corresponds to the model’s score and abscissa to users (ordered increasingly by score). Dark blue lines correspond to users at risk. The red dotted line indicates the top-100 users region. This small region of top-100 users was the (biggest) portion of the entire ranking actually used to evaluate the participating models. Since SS3 had unbounded scores, they were scaled using min-max normalization to be shown in the separation plot. UNSL#4, the users at risk (dark blue lines) are slightly “more compressed” towards the right end in subfigure (b) than in subfigures (c) and (d). This phenomenon is probably due to the fact that the score calculated by SS3 is not a normalized value (see Equation 1), being sensitive to the number of words processed for each user. As future work, we believe that normalizing this score could help improve the overall performance of the model, for instance, by dividing it by the total number of words being processed for each user. Finally, obtained results show that the EarlyModel and the SS3 model could both be competitive when it comes to estimating the risk level of the users, even when only few posts were processed. Table 9 Comparison of the different approaches in terms of different aspects. (*) Depends on the classifier and the representation being used. Aspects / Models EARLIEST EarlyModel SS3 Execution Time Performance ✓ ×* ✓ Decision-based Performance × ✓ ✓ Simplicity × ✓* ✓ Interpretability × ✓* ✓✓ Policy Adaptability ✓✓ × ✓ Storage per User LSTM hidden state complete sequence of posts user score Supports Streaming? ✓ × ✓ 5. Discussion In this section, a comparison of the three approaches will be analyzed in terms of a range of different aspects, such as performance, simplicity, and adaptability. More precisely, Table 9 shows an overview of the comparison containing all the key aspects. (a) EarlyModel (b) EARLIEST (c) SS3 Figure 6: Time spent during the feature building stage for each kind of architecture in Task 2. For each set of posts for every user, the time invested in building the feature input is shown. The EarlyModel corresponds to UNSL#0, EARLIEST to UNSL#1, and SS3 to UNSL#3. Execution Time Performance. Among the three approaches, the EarlyModel is the least efficient in terms of execution time since it heavily depends on the used representa- tion and the method to compute its feature vectors. Thus, if the representation being used allows computing and updating the feature vector incrementally, as posts are processed, then, the input stream will be processed efficiently. Otherwise, for every new post avail- able, the whole history of posts will be processed again in order to compute the updated feature vector. In the case of EARLIEST, the representation is modelled sequentially, only needing to calculate it for each of the individual posts as they are available. On the other hand, SS3 does not even need a feature representation since it processes the raw input, sequentially, word by word. These differences affected the time taken for each approach to address each task, for instance, Figure 6 shows the time taken for each approach to build the feature vectors in Task T2. Although only one model is shown for each approach, all the other variations of the same model showed the same pattern of elapsed times. Note that, since SS3 does not need a feature representation, its elapsed time presented in Figure 6c is always zero. By contrast, EarlyModel consumes a lot of time since it has to rebuild the doc2vec representation of the whole sequence of posts again, each time a new post arrives. That is, the time complexity of the feature representation stage is tied to the length and number of posts. Thus, as new posts come, the amount of time invested in the feature representation stage grows bigger. This would imply that the graph of the elapsed time for the EarlyModel increases monotonically. However, this is not the case since as time passed fewer users kept posting, decreasing the total number of posts to be processed. Finally, EARLIEST, shown in Figure 6b, required a much lower time to build the feature vector compared to EarlyModel since only the current post needed to be processed —i.e. the time complexity of the feature representation stage is not tied to the number of posts processed. Decision-based Performance. Among the three approaches, EARLIEST was the least effective in terms of decision-based performance since the obtained results, in both tasks, were below the average among all participating models. On the other hand, EarlyModel and SS3 were the most efficient in these terms. For instance, the EarlyModel approach achieved the best and second-best performance in terms of the 𝐹1 , ERDE50 , and 𝐹latency measures in Task T1. Likewise, SS3 achieved the best performance in terms of the 𝐹1 and 𝐹latency measures and the second-best ERDE5 in Task T2. Overall, obtained results showed that, despite their relative simplicity, EarlyModel and SS3 were able to detect user at-risk with competitive effectiveness. Simplicity. The simplest among the three approaches is SS3 since it only consists of a summation of word values (see Equation 1). On the other hand, the simplicity of EarlyModel depends on the classifier and the representation being used. For instance, one of the best performing EarlyModel was a logistic regression classifier that used a standard tf-idf -weighted BoW representation which is much simpler than a recurrent neural network with word2vec representation. On the contrary, the EARLIEST is more complex since its architecture consists of three neural models, namely, an LSTM, a feedforward neural network for the controller, and another for the discriminator. Interpretability. Among the three approaches, SS3 is the most interpretable since it was designed to learn to value words in an interpretable manner. For instance, the learned 𝑔𝑣 values of each word can be directly used to create visual explanations to present to the system users, as illustrated in Figure 7.9 On the other hand, the interpretability level of the EarlyModel approach depends on the classifier and the representation being used. 9 A live demo is provided at http://tworld.io/ss3 where the interested readers can try out the model. Along with the classification result, the demo provides an interactive visual explanation as the one illustrated here. [Last access date: May 2021]. Figure 7: Example of a visual explanation using the SS3 approach. The post number 66 of the user “subject2700” from the Task T2 is shown here. Words have been colored proportionally in relation to the 𝑔𝑣 valued learned by the SS3 model for the at-risk class. In addition, sentences were also colored using the average of all its word values. For instance, using simple linear classifiers with standard BoW representations would be more interpretable than using neural models with deep representations. Finally, the EARLIEST is the least interpretable model since, as mentioned above, its architecture consists of three neural models which are not easily interpretable, and the decision policy is not directly observable. Policy Adaptability. Concerning the approaches presented, EarlyModel is the most rigid of the three. EarlyModel has a simple rule-based early alert policy that complements standard classification models to identify risky users. This policy is implemented using a decision tree with three decision nodes as shown in Figure 2. An alarm is issued for an input only if the predicted class is positive, the probability of the positive class is greater than 𝛿, and the number of post read is greater than 𝑛. This approach corresponds to a static policy since the hyper-parameters for the decision nodes, i.e. 𝛿 and 𝑛, are determined in the training phase. In the testing phase, as new posts arrive, these can not change. The decision boundary remains constant. On the other hand, the SS3 implements a global early alert policy based on the estimated risk level for all processed users. An alarm is issued for a user only if the model score surpasses a global boundary that depends on the score from all the users at that point in time. The decision boundary is calculated using the median scores for all users and the Median Absolute Deviation (MAD) of the current scores at the current time. An alarm is issued for a user if the model score for that user is greater than the median score for all users, plus 𝛾 times the MAD of the scores. Note that, although the decision boundary depends on 𝛾, it is not constant in time since it also depends on the scores of all the users. Finally, the EARLIEST model simultaneously learns to predict risky users and the early alert policy through a Reinforcement Learning approach. EARLIEST raises an alarm for a user only if the class predicted by the discriminator is positive and the controller indicates to stop reading. Thus, the controller is responsible for deciding the moment to issue an alarm. The only hyper-parameter that needs to be set to control the decision policy is 𝜆 that penalizes the model based on its earliness. As the value of 𝜆 grows, the loss of the model for late prediction grows bigger, forcing the model to make decisions early. Here, the decision policy is learned therefore the model can (a) EarlyModel (b) EARLIEST (c) SS3 Figure 8: Evolution of the scores for every architecture in Task 2 for the user “subject2700”. The decision policy of the EarlyModel and SS3 models is shown in grey and the points in which the models made the decision to issue an alarm are indicated by the red dotted line. The EarlyModel corresponds to UNSL#0, EARLIEST to UNSL#1, and SS3 to UNSL#3. adapt to different problems or different distribution of the data dynamically. This is what makes this model the most adaptable among the three. The problem being tackled or the distribution of the data could change, but the architecture does not need to change. Storage per User. In a real-world scenario, early detection approaches may help to identify at-risk users through the large-scale passive monitoring of social media. However, in such large-scale systems, these approaches must be able, not only to efficiently process user posts as they are posted, but also should be efficient in terms of the information needed by the model to make predictions. This information could be attached to each user in the system, for instance, by storing it along with other user-related information inside the system. For instance, the EARLIEST approach only needs the current post and the last hidden state produced by the LSTM to make a prediction. Therefore, keeping stored only the last hidden state produced by the LSTM for a given user would be enough to make a future prediction when the given user writes a new post, as part of the passive monitoring. Likewise, in the case of the SS3 approach, storing only the last computed score for the user would be enough to make a future prediction; when the user creates a new post, his/her last stored score is retrieved and updated using only the 𝑔𝑣 value of the words in the new post. On the other hand, in the case of the EarlyModel, the information needed to be stored depends on the classifier and the representation being used. That is, if the feature vector of the representation being used can be computed and updated sequentially, as new posts are created, all the information required to carry out this update must be stored for each user. Otherwise, the complete sequence of posts needs to be kept stored. Streaming Support. Among the models presented, EarlyModel was the only one not able to support posts coming from streaming data. This is a drawback implicitly embedded in the model since each time a new post arrives EarlyModel has to rebuild the entire representation of the post. The only way to alleviate this is through an intermediate representation that supports streaming data. For example, for the bag of words represen- tation using tf-idf, the term frequency of every word for every user could be stored and updated as new posts come in. Then, the final representation for each user could be built normalizing the frequencies. On the other hand, the models SS3 and EARLIEST are able to handle streaming input naturally since they work with sequence data. 6. Conclusions and Future Work This paper described three different early alert policies to tackle the early risk detection prob- lem. Furthermore, a comparison of the three approaches was analyzed in terms of different characteristics, such as performance, simplicity, and adaptability. As shown in Sections 3 and 4, the models introduced in this work obtained the best performance for the measures 𝐹1 , ERDE50 , and 𝐹latency for both tasks. Also, for most of the ranking-based evaluation metrics, these models achieved the best results. Nevertheless, further research should focus on: • Reducing the time spent building the features of EarlyModel. • Defining different ways of normalizing the scores for SS3. • Stabilizing the learning phase of EARLIEST so its decisions are more robust. • Determining reasons why the explicit inclusion of the negative class in the discriminator of EARLIEST impaired the model’s ability to estimate the risk level of users. Finally, this article has been one of the first attempts to thoroughly examine the role of the alert policy in the early risk detection problem for the CLEF eRisk Lab. In general, other articles have only focused on the classification with partial information, leaving the decision of the classification moment relegated. We consider that this component of the problem is almost as important as the classification with partial information, and we hope that more research groups start tackling it. We believe EARLIEST could be the first step towards a model that learns both pieces of the problem. 7. Acknowledgements This work was supported by the CONICET P-UE 22920160100037. References [1] D. E. Losada, F. Crestani, A test collection for research on depression and language use, in: Proc. of Conference and Labs of the Evaluation Forum (CLEF 2016), Evora, Portugal, 2016, pp. 28–39. [2] D. E. Losada, F. Crestani, J. Parapar, erisk 2017: Clef lab on early risk prediction on the internet: experimental foundations, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2017, pp. 346–360. [3] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk: early risk prediction on the internet, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2018, pp. 343–361. [4] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk 2019 early risk prediction on the internet, in: International Conference of the Cross-Language Evaluation Forum for European Languages, Springer, 2019, pp. 340–357. [5] F. Sadeque, D. Xu, S. Bethard, Measuring the latency of depression detection in social media, in: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 2018, pp. 495–503. [6] D. E. Losada, F. Crestani, J. Parapar, Overview of erisk at clef 2020: Early risk prediction on the internet (extended overview) (2020). [7] J. M. Loyola, M. L. Errecalde, H. J. Escalante, M. M. y Gomez, Learning when to classify for early text classification, in: Argentine Congress of Computer Science, Springer, 2017, pp. 24–34. [8] S. G. Burdisso, M. Errecalde, M. Montes-y Gómez, A text classification framework for simple and effective early depression detection over social media streams, Expert Systems with Applications 133 (2019) 182 – 197. doi:https://doi.org/10.1016/j.eswa.2019. 05.023. [9] R. Feldman, J. Sanger, et al., The text mining handbook: advanced approaches in analyzing unstructured data, Cambridge university press, 2007. [10] J. W. Pennebaker, R. L. Boyd, K. Jordan, K. Blackburn, The development and psychometric properties of LIWC2015, Technical Report, 2015. [11] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: International conference on machine learning, 2014, pp. 1188–1196. [12] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, Journal of machine Learning research 3 (2003) 993–1022. [13] T. K. Landauer, S. T. Dumais, A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge., Psychological review 104 (1997) 211. [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [15] R. Rehurek, P. Sojka, Software framework for topic modelling with large corpora, in: In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Citeseer, 2010. [16] L. Breiman, Random forests, Machine learning 45 (2001) 5–32. [17] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (1997) 1735–1780. [18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [19] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan- guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. [20] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning library, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Sys- tems 32, Curran Associates, Inc., 2019, pp. 8024–8035. URL: http://papers.neurips.cc/paper/ 9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. [21] T. Hartvigsen, C. Sen, X. Kong, E. Rundensteiner, Adaptive-halting policy network for early classification, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 101–110. [22] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013). [23] S. G. Burdisso, M. Errecalde, M. Montes-y Gómez, UNSL at eRisk 2019: a unified approach for anorexia, self-harm and depression detection in social media, in: Working Notes of CLEF 2019, CEUR Workshop Proceedings, Lugano, Switzerland, 2019. [24] S. G. Burdisso, M. Errecalde, M. Montes-y Gómez, 𝜏 -SS3: A text classifier with dynamic n-grams for early risk detection over text streams, Pattern Recognition Letters 138 (2020) 130 – 137. doi:https://doi.org/10.1016/j.patrec.2020.07.001. [25] R. Martínez-Castaño, A. Htait, L. Azzopardi, Y. Moshfeghi, Early risk detection of self-harm and depression severity using bert-based transformers: ilab at clef erisk 2020, Early Risk Prediction on the Internet (2020). [26] B. Greenhill, M. D. Ward, A. Sacks, The separation plot: A new visual method for evaluating the fit of binary models, American Journal of Political Science 55 (2011) 991–1002. [27] R. Kumar, C. Carroll, A. Hartikainen, O. Martin, Arviz a unified library for exploratory analysis of bayesian models in python, Journal of Open Source Software 4 (2019) 1143. URL: https://doi.org/10.21105/joss.01143. doi:10.21105/joss.01143. [28] J. Baumgartner, S. Zannettou, B. Keegan, M. Squire, J. Blackburn, The pushshift reddit dataset, in: Proceedings of the International AAAI Conference on Web and Social Media, volume 14, 2020, pp. 830–839. [29] S. G. Burdisso, M. Errecalde, M. Montes-y Gómez, Pyss3: A python package imple- menting a novel text classifier with visualization tools for explainable ai, arXiv preprint arXiv:1912.09322 (2019).