<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of eRisk 2018: Early Risk Prediction on the Internet (extended lab overview)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David E. Losada</string-name>
          <email>david.losada@usc.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Crestani</string-name>
          <email>fabio.crestani@usi.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Javier Parapar</string-name>
          <email>javierparapar@udc.es</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro Singular de Investigación en Tecnoloxías da Información (CiTIUS)</institution>
          ,
          <addr-line>Universidade de Santiago de Compostela</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Informatics, Universitá della Svizzera italiana (USI)</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Information Retrieval Lab, University of A Coruña</institution>
        </aff>
      </contrib-group>
      <issue>1</issue>
      <abstract>
        <p>This paper provides an overview of eRisk 2018. This was the second year that this lab was organized at CLEF. The main purpose of eRisk was to explore issues of evaluation methodology, effectiveness metrics and other processes related to early risk detection. Early detection technologies can be employed in different areas, particularly those related to health and safety. The second edition of eRisk had two tasks: a task on early risk detection of depression and a task on early risk detection of anorexia.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The main purpose of this lab is to explore issues of evaluation methodologies,
performance metrics and other aspects related to building test collections and defining
challenges for early risk detection. Early detection technologies are potentially useful in
different areas, particularly those related to safety and health. For example, early alerts
could be sent when a person starts showing signs of a mental disorder, when a sexual
predator starts interacting with a child, or when a potential offender starts publishing
antisocial threats on the Internet. In 2017, our main goal was to pioneer a new
interdisciplinary research area that would be potentially applicable to a wide variety of profiles,
such as potential paedophiles, stalkers, individuals with a latent tendency to fall into the
hands of criminal organisations, people with suicidal inclinations, or people susceptible
to depression.</p>
      <p>
        The 2017 lab had two possible ways to participate. One of them followed a classical
workshop pattern. This workshop was open to the submission of papers describing test
collections or data sets suitable for early risk prediction or early risk prediction
challenges, tasks and evaluation metrics. This open submission format was discontinued
in 2018. eRisk 2017 also included an exploratory task on early detection of depression.
This pilot task was based on the evaluation methodology and test collection presented in
a CLEF 2016 paper [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The interaction between depression and language use is
interesting for early risk detection algorithms. We shared this collection with all participating
teams and the 2017 participants approached the problem with multiple technologies and
models (e.g. Natural Language Processing, Machine Learning, Information Retrieval,
etc.). However, the effectiveness of all participating systems was relatively low [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For
example, the highest F1 was 64%. This suggests that the 2017 task was challenging and
there was still much room from improvement.
      </p>
      <p>In 2018, the lab followed a standard campaign-style format. It was composed of two
different tasks: early risk detection of depression and early risk detection of anorexia.
The first task is a continuation of the eRisk 2017 pilot task. The teams had access to
the eRisk 2017 data as training data, and new depression and non-depression test cases
were extracted and provided to the participants during the test stage. The second task
followed the same format as the depression task. The organizers of the task collected
data on anorexia and language use, the data were divided into a training subset and a
test subset, and the task followed the same iterative evaluation schedule implemented
in 2017 (see below).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Task 1: Early Detection of Signs of Depression</title>
      <p>This is an exploratory task on early detection of signs of depression. The challenge
consists of sequentially processing pieces of evidence –in the form of writings posted
by depressed or non-depressed users– and learn to detect early signs of depression as
soon as possible. The lab focuses on Text Mining solutions and, thus, it concentrates on
Social Media submissions (posts or comments in a Social Media website). Texts should
be processed by the participating systems in the order they were created. In this way,
systems that effectively perform this task could be applied to sequentially track user
interactions in blogs, social networks, or other types of online media.</p>
      <p>
        The test collection for this task has the same format as the collection described in
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is a collection of submissions or writings (posts or comments) done by Social
Media users. There are two classes of users, depressed and non-depressed. For each user,
the collection contains his sequence of submissions (in chronological order) and this
sequence was split into 10 chunks. The first chunk has the oldest 10% of the submissions,
the second chunk has the second oldest 10%, and so forth.
      </p>
      <p>The task was organized into two different stages:
– Training stage. Initially, the teams that participated in this task had access to some
training data. In this stage, the organizers of the task released the entire history of
submissions done by a set of training users. All chunks of all training users were
sent to the participants. Additionally, the actual class (depressed or non-depressed)
of each training user was also provided (i.e. whether or not the user explicitly
mentioned that they were diagnosed with depression). In 2018, the training data
consisted of all 2017 users (2017 training split + 2017 test split). The participants could
therefore tune their systems with the training data and build up from 2017’s results.</p>
      <p>The training dataset was released on Nov 30th, 2017.
– Test stage. The test stage had 10 releases of data (one release per week). The first
week we gave the 1st chunk of data to the teams (oldest submissions of all test
users), the second week we gave the 2nd chunk of data (second oldest submissions
of all test users), and so forth. After each release, the teams had to process the data
and, before the next week, each team had to choose between: a) emitting a decision
on the user (i.e. depressed or non-depressed), or b) making no decision (i.e. waiting
to see more chunks). This choice had to be made for each user in the test split. If
the team emitted a decision then the decision was considered as final. The systems
were evaluated based on the accuracy of the decisions and the number of chunks
required to take the decisions (see below). The first release of test data was done on
Feb 6th, 2018 and the last (10th) release of test data was done on April 10th, 2018.</p>
      <p>Table 1 reports the main statistics of the train and test collections. The two splits are
unbalanced (there are more non-depression cases than depression cases). In the training
collection the percentage of depressed cases was about 15% and in the test collection
this percentage was about 9%. The number of users is not large, but each user has a long
history of submissions (on average, the collections have several hundred submissions
per user). Additionally, the mean range of dates from the first submission to the last
submission is wide (more than 500 days). Such wide history permits to analyze the
evolution of the language from the oldest post or comment to the most recent one.
2.1</p>
      <p>
        Evaluation measures
The evaluation of the tasks considered standard classification measures, such as F1,
Precision and Recall (computed with respect to the positive class –depression or anorexia,
respectively–) and an early risk detection measure proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The standard
classification measures can be employed to assess the teams’ estimations with respect to
golden truth judgments that inform us about users that are really positive cases. We
include them in our evaluation report because these metrics are well-known and easily
interpretable.
      </p>
      <p>
        However, standard classification measures are time-unaware and do not penalize
late decisions. Therefore, the evaluation of the tasks also considered a newer measure
of performance that rewards early alerts. More specifically, we employed ERDE, an
error measure for early risk detection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for which the fewer writings required to make
the alert, the better. For each user the evaluation proceeds as follows. Given a chunk of
data, if a team’s system does not emit a decision then it has access to the next chunk
of data (i.e. more submissions from the same user). However, the team’s system gets a
penalty for late emission.
      </p>
      <p>ERDE, which stands for early risk detection error, takes into account the
correctness of the (binary) decision and the delay taken by the system to make the decision.
The delay is measured by counting the number (k) of distinct submissions (posts or
comments) seen before taking the decision. For instance, imagine a user u who posted
a total number of 250 posts or comments (i.e. exactly 25 submissions per chunk to
simplify the example). If a team’s system emitted a decision for user u after the second
chunk of data then the delay k would be 50 (because the system needed to see 50 pieces
of evidence in order to make its decision).</p>
      <p>Another important factor is that data are unbalanced (many more negative cases
than positive cases) and, thus, the evaluation measure needs to weight different errors
in a different way. Consider a binary decision d taken by a team’s system with delay
k. Given golden truth judgments, the prediction d can be a true positive (TP), true
negative (TN), false positive (FP) or false negative (FN). Given these four cases, the
ERDE measure is defined as:</p>
      <p>ERDEo(d; k) =
8&gt; cfp if d=positive AND ground truth=negative (FP)
&gt;&lt; cfn if d=negative AND ground truth=positive (FN)</p>
      <p>lco(k) ctp if d=positive AND ground truth=positive (TP)
&gt;:&gt; 0 if d=negative AND ground truth=negative (TN)</p>
      <p>How to set cfp and cfn depends on the application domain and the implications
of FP and FN decisions. We will often deal with detection tasks where the number of
negative cases is several orders of magnitude larger than the number of positive cases.
Hence, if we want to avoid building trivial systems that always say no, we need to have
cfn &gt;&gt; cfp. In evaluating the systems, we fixed cfn to 1 and cfp was set according to
the proportion of positive cases in 2017’s test data (e.g. we set cfp to 0.1296).</p>
      <p>The factor lco(k)(2 [0; 1]) represents a cost associated to the delay in detecting true
positives. We set ctp to cfn (i.e. ctp was set to 1) because late detection can have severe
consequences (as a late detection is considered as equivalent to not detecting the case
at all).</p>
      <p>The function lco(k) is a monotonically increasing function of k:
lco(k) = 1</p>
      <p>1
1 + ek o
(1)</p>
      <p>The function is parameterised by o, which controls the place in the X axis where the
cost grows more quickly (Figure 1 plots lc5(k) and lc50(k)).</p>
      <p>The latency cost factor was only used for the true positives because we understand
that late detection is not an issue for true negatives. True negatives are non-risk cases
that, of course, would not demand early intervention (i.e. these cases just need to be
effectively filtered out from the positive cases). The systems must therefore focus on
early detecting risk cases and detecting non-risk cases (regardless of when these
nonrisk cases are detected).</p>
      <p>All cost weights are in [0; 1] and, thus, ERDE is in the range [0; 1]. Systems had
to take one decision for each subject and the overall error is the mean of the p ERDE
values.
)
l(ck .04
Each team could submit up to 5 runs or variants. We received 45 contributions from 11
different institutions. This is a substantial increase with respect to erisk 2017, which had
8 institutions and 30 contributed runs. Table 3 reports the institutions that contributed
to eRisk 2018 and the labels associated to their runs.</p>
      <p>First, we briefly describe the main characteristics of the early detection systems
implemented by these participants:</p>
      <p>- FHDO, Germany. This is a joint effort performed by the several institutions in
Germany (and led by the University of Applied Sciences and Arts Dortmund). This team
submitted results for four machine learning models, together with an ensemble model
that combined different base predictions. The models employ user-level linguistic
metadata, bag of words, neural word embeddings, and convolutional neural networks.</p>
      <p>- IRIT, France. This is a team composed of researchers from IRIT and LIMSI. Their
experiments focused on investigating two types of textual representations: linguistic
features vs vectorization. The team combined the representations in different ways and
trained a number of machine learning models.</p>
      <p>- LIRMM, France. This team, composed of researchers from different institutions in
Montpellier, paid special attention to the temporal dimension. Their models try to
capture temporal mood variation by sequentially analysing the available user submissions.
The resulting mdoels have two learning stages and employ standard text vectorization
methods.</p>
      <p>- PEIMEX, Mexico &amp; USA. This team submitted several runs as a result of a joint
collaboration between multiple Mexican institutions and Houston University. Their
approach makes a sentence-level analysis to detect sentences where users refer to
themselves. The main intuition is that those sentences contain terms that better expose their
interests and habits and, thus, they might reveal personality and psychological states.
This extraction of sentences was followed by a novel feature selection process and a
subsequent term weighting method.</p>
      <p>- UDC, Spain. This team performed a standard machine learning treatment of the
challenge. They formalized the task as a classification task and experimented with
different features (text-based, semantic-based and writing-based). They implemented two
independent models. The first was oriented to predict depression cases and the
second was oriented to detect non-depression cases. To meet these aims, these researchers
designed two variants, named as Duplex Model Chunk Dependent and Duplex Model
Writing Dependent.</p>
      <p>- UNSL, Argentina &amp; Mexico. This is a team composed of researchers from a
couple of Argentinian institutions (UNSL and CONICET) and INAOE, from Mexico. This
team implemented a variant based on a model of flexible temporal variation of terms
and another variant based on sequential incremental classification. The first model
follows a semantic representation of documents that explicitly considers that the
information available at each chunk is partial. The second model is a novel text classification
approach that incrementally estimates the association of each individual to each class
based on the accumulated evidence.</p>
      <p>- UPF, Spain. This team, from Univ. Pompeu Fabra in Barcelona, implemented
several machine learning models that follow a dynamic and incremental representation of
the user’s submissions. The main focus of the experimentation was on testing
different types of features, including linguistic features, domain-specific vocabularies and
psychology-based features.</p>
      <p>- UQAM, Canada. This team implemented a topic extraction approach and
experimented with Latent Dirichlet Allocation and Neural Networks. The submitted runs
represented the texts using unigrams, bigrams and trigrams and the team worked with 30
latent topics. The final estimations were supplied by a multilayer perceptron, together
with a decision-based algorithm that classifies the users in a time-aware manner.</p>
      <p>- TBS, Taiwan. This team is composed of researchers from two different institutions
in Taiwan. Their models combine tf/idf evidence with convolutional neural networks
(CNNs). The CNNs work with chunk-level evidence and are responsible of emitting the
depression decisions. These decisions are based on classifying individual submissions
made by each user.</p>
      <p>- TUA1, Japan. The University of Tokushima in Japan sent results associated to a
support vector machine classifier that works with tf/idf representations, a deep neural
network and a simple keyword-based method.</p>
      <p>Now, let us analyze the behaviour of the systems in terms of how fast they emitted
decisions. Figure 2 shows a boxplot graph of the number of chunks required to make
the decisions. The test collection has 820 users and, thus, each boxplot represents the
statistics of 820 cases.</p>
      <p>Some systems (RKMVERIB, RKMVERIC, RKMVERID, RKMVERIE, TBSA,
UPFC, UPFD) took all decisions after the last chunk (i.e. did not emit any earlier
decision). These variants were extremely conservative: they waited to see the whole history
of submissions for all users and, next, they emitted their decisions. Remember that all
teams were forced to emit a decision for each user at the last chunk.</p>
      <p>Many other runs also took most of the decisions after the last chunk. For example,
FHDO-BCSGA assigned a decision at the last chunk in 725 out of 820 users. Only a
few runs were really quick at emitting decisions. Notably, most UDC’s runs and LIIRA
had a median of 1 chunk needed to emit decisions.</p>
      <p>Figure 3 shows a boxplot of the number of submissions required by each run in
order to emit decisions. Most of the time the teams waited to see hundreds of writings
for each user. Only a few submissions (UDCA, UDCB, UDCD, UDCE, UNSLD, some
LIIRx runs) had a median number of writings analyzed below 100. It appears that the
teams have concentrated on accuracy (rather than delay) and, thus, they did not care
much about penalties for late decisions. A similar behaviour was found in the runs
submitted in 2017.</p>
      <p>The number of user submissions has a high variance. Some users have only 10
submissions, while other users have more than a thousand submissions. It would be
interesting to study the interaction between the number of user submissions and the
effectiveness of the estimations done by the participating systems. This study could
help to shed light on issues such as the usefulness of a large (vs short) history of
submissions and the effect of off-topic submissions (e.g. submissions totally unrelated to
depression).</p>
      <p>Another intriguing issue relates to potential false positives. For instance, a doctor
who is active on the depression community because he gives support to people
suffering from depression, or a wife whose husband has been diagnosed with depression.
These people would often write about depression and possibly use a style that might
imply they are depressed, but obviously they are not. The collection contains this type
of non-depressed users and these cases are challenging for automatic classification.
Arguably, these non-depressed users are much different from other non-depressed users
who do not engage in any depression-related conversation. In any case, this issue
requires further investigation. For example, it will be interesting to do error analysis with
the systems’ decisions and check the characteristics of the false positives.</p>
      <p>Figure 4 helps to analyze another aspect of the decisions emitted by the teams.
For each user class, it plots the percentage of correct decisions against the number of
users. For example, the last two bars of the upper plot show that about 5 users were
correctly identified by more than 90% of the runs. Similarly, the rightmost bar of the
lower plot means that a few non-depressed users were correctly classified by all runs
(100% correct decisions). The graphs show that the teams tend to be more effective
with non-depressed users. This is as expected because most non-depressed cases do not
engage in depression-related conversations and, therefore, they are easier to distinguish
from depressed users. The distribution of correct decisions for non-depressed users has
many cases where more than 80% of the systems are correct. The distribution of correct
decisions for depressed users is flatter, and many depressed users are only identified
by a low percentage of the runs. This suggests that the teams implemented a wide
range of strategies that detect different portions of the depression class. Furthermore,
there are not depressed users that are correctly identified by all systems. However, an
interesting point is that no depressed user has 0% of correct decisions. This means that
every depressed user was classified as such by at least one run.</p>
      <p>Let us now analyze the effectiveness results (see Table 4). The first conclusion we
can draw is that the task is as difficult as in 2017. In terms of F1, performance is again
low. The highest F1 is 0.64 and the highest precision is 0.67. This might be related to
the effect of false positives discussed above. The lowest ERDE50 was achieved by the
FHDO-BCSG team, which also submitted the runs that performed the best in terms of
F 1. The run with the lowest ERDE5 was submitted by the UNSLA team and the run
with the highest precision was submitted by RKMVERI. The UDC team submitted a
high recall run (0.95) but its precision was extremely low.</p>
      <p>In terms of ERDE5, the best performing run is UNSLA, which has poor F1,
Precision and Recall. This run was not good at identifying many depressed users but, still, it
has low ERDE5. This suggests that the true positives were emitted by this run at earlier
chunks (quick emissions). ERDE5 is extremely stringent with delays (after 5 writings,
penalties grow quickly, see Fig 1). This promotes runs that emit few but quick
depression decisions. ERDE50, instead, gives smoother penalties to delays. This makes that
the run with the lowest ERDE50, FHDO-BCSGB, has much higher F1 and Precision.
Such difference between ERDE5 and ERDE50 is highly relevant in practice. For
example, a mental health agency seeking an automatic tool for screening depression could
set the penalty weights depending on the consequences of late detection of signs of
depression.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Task 2: Early Detection of Signs of Anorexia</title>
      <p>Task 2 was an exploratory task on early detection of signs of anorexia. The format of
the task, data extraction methods and evaluation methodology (training stage followed
by a test stage with on sequential releases of user data) was the same used for Task 1.
This task was introduced in 2018 and, therefore, all users (training+test) were collected
just for this new task.</p>
      <p>Table 2 reports the main statistics of the train and test collections of Task 2. The
collection shares the main characteristics of Task 1’s collections: the two splits are
unbalanced (of course, there are more non-anorexia cases than anorexia cases). Contrary
to the depression case, the number of users is not large (and, again, each user has a long
history of submissions). The mean range of dates from the first submission to the last
submission is also wide (more than 500 days).
3.1
Each team could submit up to 5 runs or variants. We received 35 contributions from
9 different institutions. All institutions participating in Task 2 had also sent results for
Task 1. Table 5 reports the institutions that contributed to this second task of eRisk 2018
and the labels associated to their runs.</p>
      <p>Most of the teams implemented the same type of models and use them for both tasks
(with minor modifications, such as the inclusion of anorexia-related lexica). We refer to
section 2.2, where the reader can see a brief description of each group’s variants. The
interested reader is also referred to the working note papers to see a full description of
the experiments performed by each team.</p>
      <p>The behaviour of the systems in terms of how fast they emitted decisions is shown in
Figure 5, which includes boxplot graphs of the number of chunks required to make the
decisions. The test collection of Task 2 has 320 users and, thus, each boxplot represents
the statistics of 320 cases. The trends are similar to those found in Task 1. Mosf of the
systems emitted decisions at a late stage with only a few exceptions (notably, LIIRA
and LIIRB). LIIRA and LIIRB had a median number of chunks analyzed of 3 and 6,
respectively. The rest of the systems had a median number of chunks analized equal to
or near 10.</p>
      <p>Figure 6 shows a boxplot of the number of submissions required by each run in
order to emit decisions. Again, most of the variants analyzed hundred of submissions
before emitting decisions. Only the two LIIR runs discussed above and LIRMMD opted
for emitting decisions after a fewer number of user submissions. In Task 2, again, most
of the teams have ignored the penalties for late decisions and they have mostly focused
on classification accuracy.</p>
      <p>Figure 7 plots the percentage of correct decisions against the number of users. The
plot shows again a clear distinction between the positive class (anorexia) and the
negative class (non-anorexia). Most of the non-anorexia users are correctly identified by
most of the systems (nearly all non-anorexia users fall in the range 80%-100%,
meaning that at least 80% of the systems labeled them as non-anorexic). In contrast, the
distribution of anorexia users is flatter and, in many cases, they are only identified by less
than half of the systems. An interesting result is that all anorexia users were identified
by at least 10% of the systems.</p>
      <p>Table 6 reports the effectiveness of the systems. In general, performance is
remarkably higher than that achieved by the systems for Task 1. There could be a number
of reasons for such an outcome. First, the proportion of potential false positives (e.g.
people engaging in anorexia-related conversations) might be lower in Task 2’s test
collection. This hypothesis would need to be investigated through a careful analysis of the
data. Second, the submissions of anorexia users might be extremely focused on eating
habits, losing weights, etc. If they do not often engage in general (anorexia unrelated)
conversations then it would be easier for the systems to distinguish them from other
users. In any case, these are only speculations and this issue requires further research.</p>
      <p>The highest F1 is 0.85 and the highest precision is 0.91. The lowest ERDE50 was
achieved by FHDO-BCSGD, which also has the highest recall (0.88). The run with the
lowest ERDE5 was submitted by the UNSL team (UNSLB), which shows again that
this team paid more attention to emitting early decisions (at least for the true positives).</p>
      <p>Overall, the results obtained by the teams are promising. The high performance
achieved suggest that it is feasible to design automatic text analysis tools that make
early alerts of signs of eating disorders.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>This paper provided an overview of eRisk 2018. This was the second year that this
lab was organized at CLEF and the lab’s activities concentrated on two tasks (early
detection of signs of depression and early detection of signs of anorexia). Overall, the
tasks received 80 variants or runs and the teams focused on tuning different
classification solutions. The tradeoff between early detection and accuracy was ignored by most
participants.</p>
      <p>The effectiveness of the solutions implemented to early detect signs of depression
is similar to that achieved for eRisk 2017. This performance is still modest, suggesting
that it is challenging to tell depressed and non-depressed users apart. In contrast, the
effectiveness of the systems that detect signs of anorexia was much higher. This
promising result encourages us to further explore the creation of benchmarks for text-based
screening of eating disorders. In the future, we also want to instigate more research on
the tradeoff between accuracy and delay.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>We thank the support obtained from the Swiss National Science Foundation (SNSF)
under the project “Early risk prediction on the Internet: an evaluation corpus”, 2015.</p>
      <p>We also thank the financial support obtained from the i) “Ministerio de Economía
y Competitividad” of the Government of Spain and FEDER Funds under the research
project TIN2015-64282-R, ii) Xunta de Galicia (project GPC 2016/035), and iii) Xunta
de Galicia – “Consellería de Cultura, Educación e Ordenación Universitaria” and the
European Regional Development Fund (ERDF) through the following 2016-2019
accreditations: ED431G/01 (“Centro singular de investigacion de Galicia”) and ED431G/08.</p>
      <p>Institution Submitted files
FH Dortmund, Germany FHDO-BCSGA
FHDO-BCSGB
FHDO-BCSGC
FHDO-BCSGD
FHDO-BCSGE
IRIT, France LIIRA
LIIRB
LIIRC
LIIRD
LIIRE
LIRMM, University of Montpellier, France LIRMMA
LIRMMB
LIRMMC
LIRMMD
LIRMME
Instituto Tecnológico Superior del Oriente del Estado de Hidalgo, Mexico PEIMEXA
Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico PEIMEXB
Universidad de Houston, USA PEIMEXC
&amp; Universidad Autónoma del Estado de Hidalgo, Mexico PEIMEXD
PEIMEXE
Ramakrishna Mission Vivekananda Educational and Research Institute, RKMVERIA
Belur Math, West Bengal, India RKMVERIB
RKMVERIC
RKMVERID
RKMVERIE
University of A Coruña, Spain UDCA
UDCB
UDCC
UDCD
UDCE
Universidad Nacional de San Luis, Argentina UNSLA
CONICET, Argentina UNSLB
INAOE, Mexico UNSLC
UNSLD
UNSLE
Universitat Pompeu Fabra, Spain UPFA
UPFB
UPFC
UPFD
Université du Québec à Montréal, Canada UQAMA
The Black Swan, Taiwan TBSA
Tokushima University, Japan TUA1A
TUA1B
TUA1C
TUA1D
Table 3. Task 1 (depression). Participating institutions and submitted results
10.0
7.5
k
n
u
h
c#5.0
2.5
10.0
7.5
k
n
u
h
c#5.0
2.5
10.0
7.5
k
n
u
h
c#5.0
2.5
10.0
7.5
k
n
u
h
c#5.0
2.5</p>
      <p>FHDO-BCSGA FHDO-BCSGB FHDO-BCSGC FHDO-BCSGD FHDO-BCSGE</p>
      <p>LIRMMA</p>
      <p>LIRMMB</p>
      <p>LIRMMC</p>
      <p>LIRMMD</p>
      <p>LIRMME
RKMVERIA</p>
      <p>RKMVERIB</p>
      <p>RKMVERIC</p>
      <p>RKMVERID</p>
      <p>RKMVERIE</p>
      <p>UDCA</p>
      <p>UDCB</p>
      <p>UDCC</p>
      <p>UDCD</p>
      <p>UDCE
TBSA</p>
      <p>UNSLA</p>
      <p>UNSLB</p>
      <p>UNSLC</p>
      <p>UNSLD</p>
      <p>UNSLE</p>
      <p>UPFA</p>
      <p>UPFB</p>
      <p>UPFC</p>
      <p>UPFD
LI RA</p>
      <p>LI RB</p>
      <p>LI RC</p>
      <p>LI RD</p>
      <p>LI RE</p>
      <p>PEIMEXA</p>
      <p>PEIMEXB</p>
      <p>PEIMEXC</p>
      <p>PEIMEXD</p>
      <p>PEIMEXE
n
1it000
i
r
500
0
2000
1500
s
g
w
#
500
0</p>
      <p>FHDO-BCSGA FHDO-BCSGB FHDO-BCSGC FHDO-BCSGD FHDO-BCSGE</p>
      <p>LIRMMA</p>
      <p>LIRMMB</p>
      <p>LIRMMC</p>
      <p>LIRMMD</p>
      <p>LIRMME
RKMVERIA</p>
      <p>RKMVERIB</p>
      <p>RKMVERIC</p>
      <p>RKMVERID</p>
      <p>RKMVERIE</p>
      <p>UDCA</p>
      <p>UDCB</p>
      <p>UDCC</p>
      <p>UDCD</p>
      <p>UDCE
TBSA</p>
      <p>UNSLA</p>
      <p>UNSLB</p>
      <p>UNSLC</p>
      <p>UNSLD</p>
      <p>UNSLE</p>
      <p>UPFA</p>
      <p>UPFB</p>
      <p>UPFC</p>
      <p>UPFD
LI RA</p>
      <p>LI RB</p>
      <p>LI RC</p>
      <p>LI RD</p>
      <p>LI RE</p>
      <p>PEIMEXA</p>
      <p>PEIMEXB</p>
      <p>PEIMEXC</p>
      <p>PEIMEXD</p>
      <p>PEIMEXE
Institution Submitted files
FH Dortmund, Germany FHDO-BCSGA
FHDO-BCSGB
FHDO-BCSGC
FHDO-BCSGD
FHDO-BCSGE
IRIT, France LIIRA
LIIRB
LIRMM, University of Montpellier, France LIRMMA
LIRMMB
LIRMMC
LIRMMD
LIRMME
Instituto Tecnológico Superior del Oriente del Estado de Hidalgo, Mexico PEIMEXA
Instituto Nacional de Astrofísica, Óptica y Electrónica, Mexico PEIMEXB
Universidad de Houston, USA PEIMEXC
&amp; Universidad Autónoma del Estado de Hidalgo, Mexico PEIMEXD
PEIMEXE
Ramakrishna Mission Vivekananda Educational and Research Institute, RKMVERIA
Belur Math, West Bengal, India RKMVERIB
RKMVERIC
RKMVERID
RKMVERIE
Universidad Nacional de San Luis, Argentina UNSLA
CONICET, Argentina UNSLB
INAOE, Mexico UNSLC
UNSLD
UNSLE
Universitat Pompeu Fabra, Spain UPFA
UPFB
UPFC
UPFD
The Black Swan, Taiwan TBSA
Tokushima University, Japan TUA1A
TUA1B
TUA1C</p>
      <p>Table 5. Task 2 (anorexia). Participating institutions and submitted results</p>
      <p>UNSLA UNSLB UNSLC UNSLD UNSLE
Fig. 5. Number of chunks required by each contributing run in order to emit a decision.</p>
      <p>FHDO−BCSGAFHDO−BCSGBFHDO−BCSGCFHDO−BCSGDFHDO−BCSGE</p>
      <p>LIRMMA</p>
      <p>LIRMMB</p>
      <p>LIRMMC</p>
      <p>LIRMMD</p>
      <p>LIRMME
PEIMEXA</p>
      <p>PEIMEXB</p>
      <p>PEIMEXC</p>
      <p>PEIMEXD</p>
      <p>PEIMEXE</p>
      <p>RKMVERIA</p>
      <p>RKMVERIB</p>
      <p>RKMVERIC</p>
      <p>RKMVERID</p>
      <p>RKMVERIE
LIIRA</p>
      <p>LIIRB</p>
      <p>TBSA</p>
      <p>TUA1A</p>
      <p>TUA1B</p>
      <p>TUA1C</p>
      <p>UPFA</p>
      <p>UPFB</p>
      <p>UPFC</p>
      <p>UPFD
UNSLA</p>
      <p>UNSLB</p>
      <p>UNSLC</p>
      <p>UNSLD</p>
      <p>UNSLE</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>David E.</given-names>
            <surname>Losada</surname>
          </string-name>
          and
          <string-name>
            <given-names>Fabio</given-names>
            <surname>Crestani</surname>
          </string-name>
          .
          <article-title>A test collection for research on depression and language use</article-title>
          .
          <source>In Proceedings Conference and Labs of the Evaluation Forum CLEF</source>
          <year>2016</year>
          , Evora, Portugal,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>David E.</given-names>
            <surname>Losada</surname>
          </string-name>
          , Fabio Crestani, and
          <string-name>
            <given-names>Javier</given-names>
            <surname>Parapar</surname>
          </string-name>
          .
          <source>eRISK</source>
          <year>2017</year>
          :
          <article-title>CLEF Lab on Early Risk Prediction on the Internet: Experimental foundations</article-title>
          .
          <source>In Proceedings Conference and Labs of the Evaluation Forum CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>