UNSL’s participation at eRisk 2018 Lab

        Dario G. Funez1 , Ma. José Garciarena Ucelay1 , Ma. Paula Villegas1 ,
Sergio G. Burdisso1,2 , Leticia C. Cagnina1,2 , Manuel Montes-y-Gómez3 , and Marcelo
                                     L. Errecalde1
             1
         LIDIC Research Group, Universidad Nacional de San Luis, Argentina
         2
        Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
         3
           Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE)
{funezdario,mjgarciarenaucelay,villegasmariapaula74}@gmail.com,
   {sergio.burdisso,lcagnina}@gmail.com, mmontesg@inaoep.mx,
                           merrecalde@gmail.com


       Abstract. In this paper we describe the participation of the LIDIC Research
       Group of Universidad Nacional de San Luis (UNSL) - Argentina at CLEF eRisk
       2018 Lab. The main goal of this Lab is considering early risk detection scenar-
       ios where the issue of getting timely predictions with a reasonable confidence
       level becomes critical. Two completely different approaches were used, that we
       will refer as flexible temporal variation of terms (FTVT) and sequential incre-
       mental classification (SIC). FTVT is a semantic representation of documents that
       explicitly considers the partial information that is made available in the different
       “chunks” to the early risk detection systems along the time. FTVT is an improve-
       ment on the TVT method [1] that allows varying the number of chunks considered
       in the representation according to the “level of urgency” required in the classifica-
       tion. SIC is a novel approach for text categorization that incrementally estimates
       the level of belonging of a piece of text to the different categories based on an
       accumulative process of evidence. In the test stage, FTVT obtained the lowest
       ERDE5 error in both pilot tasks and SIC achieved the highest precision for the
       anorexia detection task providing strong evidence that both approaches used by
       our team are interesting alternatives to deal with early risk detection tasks.
       Keywords: Early Risk Detection, Early Depression Detection, Early Anorexia
       Detection, Semantic Analysis Techniques, Flexible Temporal Variation of Terms,
       Incremental Classification.


1   Introduction

The increasing use of Internet, social networks and other computer technologies allows
the extraction of valuable information to early prevent some risks. In this context, early
risk detection (ERD) on the Internet is an important research area due to the impact
it might have in areas like health when people suffer depression, anorexia or other
disorders that can threaten life and safety when criminals and sex offenders try to attack
using web technologies.
    The same as other predictive tasks, ERD methods have been mainly based on su-
pervised machine learning approaches. In those cases, the task is generally addressed
as a standard binary classification problem with two unbalanced classes: a minority
(risky) positive class and a majority (control) negative class. However, beyond the diffi-
culty that the unbalanced classes present to the learning algorithms, ERD introduces an
added problem that is not usually present in other classification tasks: the incremental
classification of sequential data (ICSD).
    To effectively support ICSD two important aspects need to be considered. First, we
must provide an adequate way to “remember” or “summarize” historical information
read up to specific points of time. The informativeness level of these partial models
will be critical to the effectiveness of the classifier in charge of detecting risky cases.
Second, these models need to also provide support to a very important aspect of ERD:
the decision of when (how soon) the system should stop reading from the input stream
and classify it with an acceptable level of accuracy. This aspect, that we will refer as the
supporting for early classification, is basically a multi-objective decision problem that
attempts balancing accurate and timely classifications [7]. In fact, common evaluation
measures of supervised classification like precision, recall and F -measure are no longer
adequate in those cases because they do not take “time” into account. Thus, new “tem-
poral” measures that penalize the system’s delay in detecting risky cases are required.
This is the case of the ERDEo error introduced in [4] and used in the 2017 eRisk pilot
task [5] which allows specifying a threshold (the o value) that, when is surpassed, the
penalty rapidly grows to 1.
    The eRisk 2018 Lab presented two challenging tasks for ERD: early detection of
signs of depression (task 1), and early detection of signs of anorexia (task 2). We par-
ticipated in both tasks with two different approaches to deal with the ICSD issue: one
that we will refer as flexible temporal variation of terms (FTVT) and the other named
sequential incremental classification (SIC).
    FTVT is a document representation that deals with the ICSD problem by keeping
sequential information about the variation of terms occurring in the different chunks.
The hypothesis behind this approach is that these variations can be informative to detect
a risky case. SIC is a sequential approach that incrementally reads and estimates the
evidence that words provide for both, positive and negative classes. SIC classifies a
subject as risky as soon as the accumulated evidence of the risky (positive) class surpass
the evidence of the negative one.
    The experiments carried out on the training sets for both tasks were mainly aimed at
determining adequate parameters for training the models (classifiers) for the test stage.
Preliminary results reported by the Lab’s organizers showed that our systems obtained
the best (lowest) ERDE5 error in both pilot tasks and SIC the highest precision for the
anorexia detection task providing strong evidence that the used approaches are interest-
ing alternatives to deal with early risk detection tasks.
    The rest of the article is organized as follows: Section 2 gives general information
of the data sets used in both pilot tasks and the methods used in our ERD systems. Next,
in Section 3 the activities carried out in the training stage are described and the rationale
behind the main design decisions made on our ERD systems, are presented. Section 4
shows the performance of our methods on the eRisk 2018 data sets released in the test
stage. Finally, Section 5 depicts potential future works and the obtained conclusions.
2      Data sets and methods

2.1     Data Sets

The data sets supplied for the eRisk 2018 tasks4 are described in Losada et al. [6]. Both
collections (for task 1 and task 2) of writings (post or comments) were extracted from
Social Media. For each user (document in the data sets), the collections contain a se-
quence of writings (in chronological order) which has been partitioned into 10 chunks.
The first chunk contains the oldest 10 % of the messages, the second chunk contains the
second oldest 10%, and so forth. The corpus of task 1 is related to depression and for the
task 2 is about anorexia. In the first one, there are two categories of users, “depressed”
and “non-depressed”, meanwhile in the corpus of anorexia the users are “anorexic”
and “non-anorexic”. The collection of depression was split into a training and a test
set that we will refer as T RDS and T E DS , respectively. The T RDS set contains 887
users (135 positive, 752 negative) and the T E DS set contains 820 users (79 positive,
741 negative). The users labeled as positive are those that have explicitly mentioned
that they have been diagnosed with depression. The corpus of anorexia was split into a
training and a test set that we will refer as T RAX and T E AX , respectively. The T RAX
set contains 152 users (20 positive, 132 negative) and the T E AX set contains 320 users
(41 positive, 279 negative). In this case, the users labeled as positive are those that have
been diagnosed with anorexia.
Each task was divided by their organizers into a training stage and a test stage. In the
first one, the participating teams had access to the set of training users with ten chunks
of all training users. They could therefore tune their systems with the training data.
Then, in the test stage, the ten chunks from test set were gradually released by the or-
ganizers one by one until completing all the chunks that correspond to the complete
writings of the considered individuals. Each time that a chunk chi was released, partic-
ipants in the pilot tasks were asked to give their predictions on the users contained in
the test set, based on the partial information read from chunks ch1 to chi . Once a class
of an incoming stream is predicted, that decision is irreversible (it cannot be undone).


2.2     Methods

To deal with the problems posed in both pilot tasks we used two methods that previ-
ously referred as FTVT and HCI, which will describe below. An interesting aspect of
those methods is that they are completely independent-domain. Thus, they do not re-
quire costly adaptation processes for each task, beyond the tuning of parameters that
could depend of the used data set. In fact, due to limitations of time to carry out the ex-
perimental study, both methods were only evaluated on the data set of the task 1 (early
depression detection) and the same parameters were used for task 2 (early anorexia
detection).
    Space constraints prevent us from giving detailed explanations of FTVT and HCI.
However, the interested reader can obtain in [1] more implementation details of the
 4
     http://early.irlab.org/task.html
TVT method on which FTVT is based on. SIC is only introduced from an intuitive
point of view because the method is currently under review in a scientific journal.5


Flexible Temporal Variation of Terms (FTVT) The Flexible Temporal Variation of
Terms (FTVT) is an improvement of the temporal variation of terms (TVT) method [1],
an approach for early risk detection that uses the temporal variation of terms between
chunks as concept space of a concise semantic analysis (CSA) approach [2]. The main
characteristic of the original TVT is that it allowed to address the unbalance of the
minority class with information of the first 4 “chunks” of the users (that number was
determined empirically). FTVT provides a more flexible approach than TVT by allow-
ing the specification of a different number of chunks n for the distinct systems. This
small extension on TVT is not a minor aspect. Several studies with FTVT showed that,
depending on the urgency level required for the ERD task (determined by the threshold
o) the number n used in FTVT produces very different ERDEo values. However, be-
yond this small difference between TVT and FTVT, there is no conceptual differences
between both approaches and, therefore, we will only give a short description of the
original TVT approach.
    As we previously said, TVT is based on the concise semantic analysis (CSA) tech-
nique proposed in [2] and later extended in [3] for author profiling tasks. CSA is a
semantic analysis technique that interprets words and text fragments in a space of con-
cepts that are close (or equal) to the category labels. For instance, if documents in the
data set are labeled with q different category labels (usually no more than 100 elements),
words and documents will be represented in a q-dimensional space. That space size is
usually much smaller than standard BoW representations which directly depend on the
vocabulary size (more than 10000 or 20000 elements in general). CSA has been used
in general text categorization tasks [2] and has been adapted to work in author profiling
tasks under the name of Second Order Attributes (SOA) [3].
    In this context, the underlying idea of TVT is that variations of the terms used
in different sequential stages of the documents may have relevant information for the
classification task. With this idea in mind, this method enriches the documents of the
minority class with the partial documents read in the first 4 chunks. These chunks cor-
respond to the minority (depressed or positive) class. Also TVT uses the complete doc-
uments (chunk 10). All this information is considered as a new concept space for a CSA
method.
    TVT naturally copes with the sequential caracteristics of ERD problems and also
gives a tool for dealing with unbalanced data sets. Preliminary results of this method
in comparison to CSA and BoW representations [1] showed its potential to deal with
ERD problems. FTVT, the variant of TVT used in the present work, arose from our
observation that, varying the number n of initial chunks, different performance can be
achieved depending on the ERDEo measure used to evaluate the results.

 5
     The person interested in deeper technical details of both methods can obtain
     more information in https://sites.google.com/site/lcagnina/technicalreport-ftvt and
     https://sites.google.com/site/lcagnina/technicalreport-sic.
Sequential Incremental Classification (SIC) Sequential Incremental Classification
(SIC) is a very simple method. During the training phase a dictionary of words is built
for each category, in which frequency of each word is stored. Then, using those word
frequencies, and during classification stage, a value for each word was calculated using
a function gv(w, c) to value words in relation to categories. gv takes a word w and a cat-
egory c and outputs a number in the interval [0,1] representing the degree of confidence
with which w is believed to exclusively belong to c, for instance, suppose categories
C = {f ood, music, health, sports}, we could have:

    gv(‘sushi’, f ood) = 0.85; gv(‘the’, f ood) = 0;
    gv(‘sushi’, music) = 0.09; gv(‘the’, music) = 0;
    gv(‘sushi’, health) = 0.50; gv(‘the’, health) = 0;
    gv(‘sushi’, sports) = 0.02; gv(‘the’, sports) = 0;

    Additionally, −→
                  gv(w)   = (gv(w, c0 ), gv(w, c1 ), . . . , gv(w, ck )) is defined, where ci ∈
                                            −
                                            → is only applied to a word and it outputs a
C (the set of all the categories). That is, gv
vector in which each component is the gv of that word for each category ci . For instance,
following the above example, we have:

    gv(‘sushi‘) = (0.85, 0.09, 0.5, 0.02); gv(‘the‘) = (0, 0, 0, 0);

    We have called the vector −   →
                                  gv(w),   the “confidence vector of w”. Note that each
category ci is assigned a fixed position, i, in − → (for instance, in the example above
                                                  gv
(0, 0, 0, 0) is the confidence vector of “the” and the first position corresponds to f ood,
the second to music, and so on).

   Classification is finally carried out, for each subject, by means of the cumulative
sum of all words −
                 → vectors, in symbols:
                 gv

                                      →
                                      −   X
                                            −
                                            →
                                      d =   gv(w)
                                            w∈S

                                                            →
                                                            −
    where S is the subject’s writing history. Note that d is a vector with two com-
ponents, one for the positive class (depressed or anorexic) and one for the negative
(control) class. The policy to classify a subject as positive was performed by analyzing
      →
      −
how d changed over time (i.e. over “chunks”), as shown with an example in Figure 1
for a depression case. Subjects were classified as depressed when the cumulated posi-
tive value exceeded the negative one, for instance the subject in the figure was classified
as depressed after reading the 5th chunk.

   It is worth mentioning that, to compute gv we used other two functions, lv and
weight, as follows:

                          gv(w, c) = lvσ (w, c) × weightλ (w, c)
Fig. 1. Subject-9579’s cumulated positive and negative confidence values variation over time
(chunks).


    – lvσ (w, c) values a word based on the local frequency of w in c. As part of this
      process, the word distribution curve is smoothed by a factor controlled by the hy-
      perparameter σ.
    – weightλ (w, c) decreases lv in relation to the lv value of w to the other categories.
      The more categories ci whose lvσ (w, ci ) is high, the smaller the weightλ (w, c)
      value. The λ hyperparameter controls how sensitive this sanction is.


3     Experimental Setting

As we mentioned above, this year there were two tasks: one for the early detection
of depression cases (task 1) and the other one for the early detection of people with
anorexia (task 2). We only used the T RDS data set for setting the parameters of our
methods because it is the largest and, therefore, it would seem to be more appropriate
to obtain more confident statistics.
     In order to find the best values for the parameters of our methods we perform a
five-fold cross validation on the depression training set. Hence, we divided the T RDS
set into five folds (see Table 1). These folds maintain the same proportions of both kind
of users and were randomly selected. Also each fold was divided into 10 chunks like
they were provided by the organizers. We trained the classifiers with four folds and
tested with the fifth fold. This process was repeated four times more, always choosing
different folds, and later the results were averaged.
     We used the Flexible Temporal Variation of Terms (FTVT) described previously
to represent the documents. For this representation, a decision must be made related
to the number n of chunks that will enrich the minority (positive) class. We considered
                       Table 1. Distribution of the T RDS set in folds.

                                Fold Positive Negative Total
                                 1      27       150     177
                                 2      27       151     178
                                 3      27       151     178
                                 4      27       150     178
                                 5      27       150     177
                                Total   135      752     887


different values for n, particularly we selected n from 0 to 5 for setting the initial chunks
used.
     FTVT was evaluated with different learning algorithms such as Logistic Regression
(LR), Support Vector Machine (SVM) and Naïve Bayes (NB), among others. We used
the implementation provided in the Scikit-learn package for Python 2.7 with the default
parameters. That is, penalty = l2 and C = 1, for both SVM and LR.
     We used the probability p assigned by the classifier to decide when to stop reading
a document and giving its classification. Thus, our approach considered that when the
probability p assigned to the positive class exceeds some particular threshold θ (p ≥ θ)
the instance/document is classified as positive. We used different thresholds θ: 0.9, 0.8,
0.7 and 0.6.
     We evaluated the performance of our approaches with the early risk detection er-
ror (ERDE) measure proposed in [4]. This measure takes into account not only the
correctness of the decision made by the system but also the delay in making that deci-
sion. ERDE uses specific costs to penalize false positives and false negatives. However,
ERDE has a different treatment with the two possible successful predictions (true neg-
atives and true positives). True negatives have no cost (cost= 0) but ERDE associates a
cost to the delay in the detection of true positives that monotonically increases with the
number k of textual items seen before giving the answer. In a nutshell, that cost is low
when k is lower than a threshold value o but rapidly approaches 1 when k > o. In that
way, o represents some type of “urgency” in detecting depression cases: the lowest the o
values the highest the urgency in detecting the positive cases. A more detailed descrip-
tion of ERDE can be found in [4]. We considered the two values of o employed in both
editions (2017 and 2018) of this pilot task: o = 5 (ERDE5 ) and o = 50 (ERDE50 ).
     Due to space constraints, only combinations of n (parameter of FTVT), θ (proba-
bility threshold to classify an instance as positive) and the used classifier that allowed
obtaining the best values of ERDE5 and ERDE50 metrics, are shown. These results
are presented in Tables 2 and 3, respectively.
     If we analyze Table 2, it can be seen that small values for n and a high threshold
θ (more restrictive) generate a lower ERDE5 . In particular, the best configuration for
ERDE5 is n = 0 and p ≥ 0.8 with the SVM algorithm obtaining 13.58. A higher
threshold means that it is necessary more confidence to classify a user as positive. This
is because as the urgency level to decide is also high, what can be classified as positive
has to be precise, otherwise the penalty is higher. On the other hand, with regards to the
ERDE50 metric we can see in Table 3 that the best thresholds are a little lower than in
                  Table 2. Best performance of FTVT for ERDE5 metric.

                                  n Classifier θ ERDE5
                                  0     SVM        0.8   13.58
                                  1      LR        0.9   13.75
                                  2      LR        0.8   13.74
                                  3      LR        0.9   13.68
                                  4     SVM        0.9   13.84
                                  5     SVM        0.9   13.87

                 Table 3. Best performance of FTVT for ERDE50 metric.

                              n       Classifier     θ ERDE50
                              0   SVM       0.6           10.25
                              1 SVM (or LR) 0.6            9.91
                              2    LR       0.6            9.61
                              3   SVM       0.6            9.77
                              4   SVM       0.7            9.59
                              5 SVM (or LR) 0.7            9.69


Table 2 (θ = 0.6 and θ = 0.7). In particular, the best result is 9.59 and is obtained with
n = 4, p ≥ 0.7 and SVM as classifier. However, the performance achieved when n = 2
is also good enough.
     From these results we can conclude that FTVT with n = 0 in combination with
the SVM classifier and a probability threshold θ = 0.8 seems to be adequate for the
ERDE5 . Hereafter, this configuration will be referred as U N SLA. The FTVT with
n = 2 in combination with the Logistic Regression and θ = 0.6 seems to be an adequate
balance between ERDE5 and ERDE50 (U N SLB). Finally, FTVT with n = 4 in
combination with SVM and θ = 0.7 looks as a reasonable alternative for the ERDE50
metric (U N SLC).
     Regarding SIC, no hyper-parameter optimization was done and the same hyper-
parameter values were used for both tasks (anorexia and depression). Hyperparameters
values were arbitrarily set to σ = 0.5 for both U N SLD and U N SLE, and λ = 3
and λ = 7 for U N SLD and U N SLE, respectively. U N SLD was meant to be less
sensitive on penalizing words and thus considering more words as being “important”
than U N SLE, hence favoring U N SLD to have a higher recall but with the risk of
having a worse precision.
     In summary, from the above study we selected the settings showed in Table 4 to
participate in the 2018 pilot tasks.


4   Evaluation Stage
Our five systems, three variants of FTVT (U N SLA, U N SLB and U N SLC) and two
variants of SIC (U N SLD and U N SLE) were trained with the full training set of the
pilot task 1 (T RDS ) and tested with the corresponding T E DS (see Table 5).In the same
                          Table 4. Settings of the submitted approaches.

                 Submitted approach Method        Parameters    Learning algorithm
                      U N SLA          FTVT n = 0, θ = 0.8               SVM
                      U N SLB          FTVT n = 2, θ = 0.6                LR
                      U N SLC          FTVT n = 4, θ = 0.7               SVM
                      U N SLD           SIC σ = 0.5, λ = 3                SIC
                      U N SLE           SIC σ = 0.5, λ = 7                SIC


way, for task 2 the methods were trained with T RAX and tested with the corresponding
T E AX (see Table 6). Both test sets were incrementally released during the testing phase
of the pilot tasks.


               Table 5. Best in the ranking of the depression pilot task (T E DS set).

                                        ERDE5 ERDE50 F1              π     ρ
                       U N SLA             8.78      7.39      0.38 0.48 0.32
                       U N SLB             8.94      7.24      0.40 0.35 0.46
                       U N SLC             8.82      6.95      0.43 0.38 0.49
                       U N SLD            10.68      7.84      0.45 0.31 0.85
                       U N SLE             9.86      7.60      0.60 0.53 0.70
                       FHDO-BCSGB          9.50      6.44      0.64 0.64 0.65
                       RKMVERIC            9.81      9.08      0.48 0.67 0.38
                       UDCB               15.79      11.95     0.18 0.10 0.95


    In Table 5 we show the results of our 5 submissions and the results of those sys-
tems that obtained the best ERDE5 , ERDE50 , F1 , precision and recall in the eRisk
depression pilot task as reported in [6]. Best values are highlighted in boldface. There,
we can observe that our U N SLA obtained the best ERDE5 value. On the other hand,
FHDO-BCSGB achieved the best ERDE50 and F -measure although our U N SLC ob-
tained a value quite similar (slightly worse) for ERDE50 . U N SLE obtained the 3th
best F1 (0.60) measure (the 1st and the 2nd one belonged to the FHDO-BCSG team)
and U N SLD obtained the 2nd best recall (0.85) measure6 .
    Table 6 shows similar results for the anorexia pilot task. As we can see, our system
(U N SLB in this case) obtained the best ERDE5 again and U N SLD the best pre-
cision value. At this point it is important to note that we did not perform a parameter
optimization of our methods for the anorexia task, such as we stated in the previous
section. Then, it is not a minor aspect that our systems can perform well in a different
domain from the used for setting the parameters. This independence of domain is such
a really important aspect of the classifier systems for the optimization of real tasks.
    With these results, we can conclude that our proposals are very reasonable and
competitive alternatives for ERD tasks.
 6
     Although, the 1st one (UDCB) had a very low precision (0.1).
              Table 6. Best in the ranking of the anorexia pilot task (T E AX set).

                                      ERDE5 ERDE50 F1              π    ρ
                     U N SLA            12.48      12.00    0.17 0.57 0.10
                     U N SLB            11.40      7.82     0.61 0.75 0.51
                     U N SLC            11.61       7.82    0.61 0.75 0.51
                     U N SLD            12.93       9.85    0.79 0.91 0.71
                     U N SLE            12.93      10.13    0.74 0.90 0.63
                     FHDO-BCSGD         12.15      5.96     0.81 0.75 0.88
                     FHDO-BCSGE         11.98       6.61    0.85 0.87 0.83


5   Conclusions and future work
This article presented the participation of UNSL at eRisk 2018 Pilot tasks on Early
Detection of Depression and Anorexia. We used two completely different approaches
to deal with those tasks: one based on the FTVT representation and other on a simple
method named SIC. Those approaches showed to be very effective on both types of
tasks obtaining the best ERDE5 value over all participants in both tasks and the best
precision value for the anorexia task. Besides, in the ERDE50 measure, although we
did not achieve the best value, our results were very close to it. Thus, the performance
of our systems seem to indicate that the used methods are very robust approaches for
ERD tasks.
     However, there are other aspects of our systems that we consider relevant. First of
all, they are completely independent of the domain because they only relies on the terms
present in the training set. That is to say, they do not require a costly process of feature
engineering or very complex hand-crafted features specific of the problem under con-
sideration. That independence was evident in this Lab where only a parameter setting
was carried out on one of the data sets (depression) and the same configuration was used
in the other one (anorexia). The excellent results obtained in both cases provide strong
evidence of this independence and robustness. Another aspect that deserves special at-
tention is that both approaches use very simple rules to decide when to stop reading
and classify a user as positive. That contrasts with other approaches that require very
complex and difficult to understand methods to make those decisions.
     As future work we plan to extend the use of FTVT and SIC to other ERD prob-
lems such as the identification of sexual predators, people with suicide tendency and
early rumour detection. In those cases, we consider that the ease and simplicity that our
methods provide to be migrated from one domain to another make these applications a
rather trivial process.


References
1. Marcelo L. Errecalde, Ma. Paula Villegas, Dario G. Funez, Ma. José Garciarena Ucelay, and
   Leticia C. Cagnina. Temporal variation of terms as concept space for early risk prediction.
   In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Vol 1866,
   2017.
2. Zhixing Li, Zhongyang Xiong, Yufang Zhang, Chunyong Liu, and Kuan Li. Fast text cat-
   egorization using concise semantic analysis. Pattern Recognition Letters, 32(3):441–448,
   February 2011.
3. Adrián Pastor López-Monroy, Manuel Montes y Gómez, Hugo Jair Escalante, Luis Villaseñor-
   Pineda, and Efstathios Stamatatos. Discriminative subprofile-specific representations for au-
   thor profiling in social media. Knowledge-Based Systems, 89:134 – 147, 2015.
4. David E. Losada and Fabio Crestani. A test collection for research on depression and language
   use. In International Conference of the Cross-Language Evaluation Forum for European
   Languages, pages 28–39. Springer, 2016.
5. David E Losada, Fabio Crestani, and Javier Parapar. erisk 2017: Clef lab on early risk pre-
   diction on the internet: Experimental foundations. In International Conference of the Cross-
   Language Evaluation Forum for European Languages, pages 346–360. Springer, 2017.
6. David E. Losada, Fabio Crestani, and Javier Parapar. Overview of eRisk – Early Risk Predic-
   tion on the Internet. In Experimental IR Meets Multilinguality, Multimodality, and Interaction.
   Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018),
   Avignon, France, 2018.
7. Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence classification.
   ACM Sigkdd Explorations Newsletter, 12(1):40–48, 2010.