Temporal Variation of Terms as
                on ept spa e for early risk predi tion
                                  1                    1                 1
              Mar elo L. Erre alde , Ma. Paula Villegas , Dario G. Funez ,
                                           1                         1,2
                Ma. José Gar iarena U elay , and Leti ia C. Cagnina

         1
             LIDIC Resear h Group, Universidad Na ional de San Luis, Argentina
     2
             Consejo Na ional de Investiga iones Cientí as y Té ni as (CONICET)
                {merre alde,villegasmariapaula74,funezdario}gmail. om
                        {mjgar iarenau elay,l agnina}gmail. om


         Abstra t. Early risk predi tion involves three dierent aspe ts to be
             onsidered when an automati       lassier is implemented for this task: a)
         support for    lassi ation with partial information read up to dierent
         time steps, b) support for dealing with unbalan ed data sets and          ) a
         poli y to de ide   when a do ument ould be lassied as belonging to the
         relevant    lass with a reasonable    onden e. In this paper we propose an
         approa h that naturally    opes with the rst two aspe ts and shows good
         perspe tives to deal with the last one. Our proposal, named temporal
         variation of terms (TVT) is based on using the variation of vo abulary
         along the dierent time steps as        on ept spa e to represent the do u-
         ments. Results with the eRisk 2017 data set show a better performan e
         of TVT in     omparison to other su      essful semanti   analysis approa hes
         and the standard BOW representation. Besides, it also rea hes the best
         reported results up to the moment for ERDE5 and ERDE50 error eval-
         uation measures.
         Keywords:Early Risk Dete tion, Unbalan ed Data Sets, Text Repre-
         sentations, Semanti    Analysis Te hniques.


1    Introdu tion
Early risk dete tion (ERD) is a new resear h area potentially appli able to a
wide variety of situations su h as dete tion of potential paedophiles, people with
sui idal in linations, or people sus eptible to depression, among others. In a
ERD s enario, data are sequentially read as a stream and the               hallenge   onsists
in dete ting risk       ases as soon as possible. A usual situation in these              ases is
that the target       lass (the risky one) is    learly under-sampled with respe t to the
 ontrol       lass (the non-risky one). That unequal distribution between the positive
(minority)       lass and the negative one, is a well-known problem in         ategorization
tasks and popularly referred as       unbalan ed data sets (UDS).
    Besides dealing with the UDS problem, an ERD system needs to                      onsider
the problem of assigning a         lass to do uments when only partial information is
available. A do ument is pro essed as a sequen e of terms, and the goal is to
devise a method that          an make predi tions with the information read up to a
spe i point of the time. That aspe t, that ould be named as lassi ation
with partial information (CPI) might be addressed with a simple approa h that
 onsists in training with    omplete do uments as usual and          onsidering the partial
do uments read up to the       lassi ation point as standard  omplete do uments.
In [3℄ the CPI aspe t was       onsidered by analysing the robustness of the Naïve
Bayes algorithm to deal with partial information.
      Last, but not least, an ERD system needs to          onsider not only whi h       lass
should be assigned to a do ument, but also de iding           when to make that assign-
ment. This aspe t, that we will refer as the          lassi ation time de ision (CTD)
issue has been addressed with very simple heuristi          rules
                                                                    3 although more elabo-
rated approa hes might be used.
      In this arti le we propose an original idea that expli itly      onsiders the sequen-
tiality of data to deal with the unbalan ed data sets problem. In a nutshell, we
use the temporal variation of terms as         on ept spa e of a re ent      on ise seman-
ti    analysis (CSA) approa h [7℄. CSA is an interesting do ument representation
te hnique whi h models words and do uments in a small  on ept spa e whose
 on epts are obtained from        ategory labels. CSA has obtained good results in
                                                                        tem-
author proling tasks [8℄ and the variant proposed in this arti le, named
poral variation of terms (TVT), seems to show some interesting hara teristi s
to deal with the ERD problem. In fa t, it obtained a robust performan e on the
eRisk 2017 data set and rea hed the best (lowest) reported results up to the
moment for ERDE5 and ERDE50 error evaluation measures.
      The rest of this do ument is organized as follows: Se tion 2 des ribes our
proposed method for the ERD problem. Se tion 3 shows the obtained results
with our method on the eRisk 2017 dataset. Finally, Se tion 4 depi ts potential
future works and the obtained           on lusions.


2      The proposed method
Our method is based on the          on ise semanti      analysis (CSA) te hnique pro-
posed in [7℄ and later extended in [8℄ for author proling tasks. Therefore, we
rst present in Subse tion 2.1 the key aspe ts of CSA and then explain in Sub-
se tion 2.2 how we instantiate CSA with           on epts derived from the terms used
in the temporal      hunks analysed by an ERD system at dierent time steps.


2.1      Con ise Semanti       Analysis

Standard text representation methods su h as Bag of Words (BoW) suer of
two well known drawba ks. First, their high dimensionality and sparsity; se -
ond, they do not      apture relationships among words. CSA is a semanti           analysis
te hnique that aims at dealing with those short omings by interpreting words
and do uments in a spa e of         on epts. Dierently from other semanti analy-
sis approa hes su h as     latent semanti analysis (LSA) [2℄ and expli it semanti
3
     For instan e, ex eeding a spe i     onden e threshold in the predi tion of the   las-
     sier [9℄.
analysis (ESA) [4℄ whi h usually require huge omputing osts, CSA interprets
words and text fragments in a spa e of               on epts that are       lose (or equal) to the
 ategory labels. For instan e, if do uments in the data set are labeled with q
dierent   ategory labels (usually no more than 100 elements), words and do -
uments will be represented in a q -dimensional spa e. That spa e size is usually
mu h smaller than standard BoW representations whi h dire tly depend on the
vo abulary size (more than 10000 or 20000 elements in general).
   To explain the main        on epts of the CSA te hnique we rst introdu e some
basi notation that will be used in the rest of this work. Let D = {hd1 , y1 i, . . . , hdn , yn i}
be a training set formed by n pairs of do uments (di ) and variables (yi ) that indi-
 ate the   on ept the do ument is asso iated with, yi ∈ C where C = {c1 , . . . , cq }
is the   on ept spa e. For the moment,               onsider that these       on epts      orrespond
to standard    ategory labels although, as we will see later, they might represent
more elaborate aspe ts. In this         ontext, we will denote as V = {t1 , . . . , tm } to
the vo abulary of terms of the         olle tion being analysed.


Representing terms in the         on ept spa e In CSA, ea h term ti ∈ V is
represented as a ve tor ti  ∈ Rq , ti = hti,1 , . . . , ti,q i. Here, ti,j represents the
degree of asso iation between the term ti and the on ept cj and its omputation
requires some basi     steps that are explained below. First, the raw term- on ept
asso iation between the       ith term and the j th           on ept, denoted            wij , will be
obtained. If Dcu ⊆ D , Dcu = {dr | hdr , ys i ∈ D ∧ ys = cu } is the subset of the
training instan es whose label is the           on ept cu , then wij might be dened as:

                                                                      
                                      X                       tfik
                            wij =              log2 1 +                                           (1)
                                                            len(dk )
                                    ∀dk ∈Dcj

   where tfik is the number of o           urren es of the term ti in the do ument dk
and len(dk ) is the length (number of terms) of dk .
   As noted in [7℄ and [8℄, dire t use of wij to represent terms in the ve tor ti
 ould be sensible to highly unbalan ed data. Thus, some kind of normalization
is usually required and, in our        ase, we sele ted the one proposed in [8℄:

                       wij                                                     t′ij
               t′ij = P
                      m                        (2)
                                                                  tij =                           (3)
                                                                             q
                        wij                                                  P
                                                                                   wij
                      i=1
                                                                             j=1

   With this last      onversion we nally obtain, for ea h term             ti ∈ V , a q -
dimensional ve tor ti , ti     = hti,1 , . . . , ti,q i dened over a spa e of q on epts.
Up to now, those      on epts       orrespond to the original              ategories used to label
the do uments. Later, we will use other more elaborated                     on epts.


Representing do uments in the                   on ept spa e On e the terms are repre-
sented in the q -dimensional     on ept spa e, those ve tors               an be used to represent
do uments in the same        on ept spa e. In CSA, do uments are represented as the
 entral ve tor of all the term ve tors they            ontain [7℄. Terms have dierent im-
portan e for dierent do uments so it is not a good idea                omputing that ve tor
for the do ument as the simple average of all its term ve tors. Previous works
in BoW [6℄ have        onsidered dierent statisti        te hniques to weight the impor-
                                                                 2
tan e of terms in a do ument su h as tf idf , tf ig , tf χ         or tf rf , among others.
Here, we will use the approa h used in [8℄ for author proling that represents
ea h do ument dk as the weighted aggregation of the representations (ve tors)
of terms that it     ontains:

                                         X  tfik         
                                 dk =                × ti                                      (4)
                                            len(dk )
                                        ti ∈dk

   Thus, do uments are also represented in a q -dimensional on ept spa e (i.e.,
dk ∈ Rq ) whi h is mu h smaller in dimensionality than the one required by
standard BoW approa hes (q ≪ m).


2.2     Temporal Variation of Terms


In Subse tion 2.1 we said that the          on ept spa e C usually           orresponds to stan-
dard      ategory names used to label the training instan es in supervised text
 ategorization tasks. In this s enario, that in [7℄ is referred as             dire t derivation,
ea h    ategory label simply        orresponds to a       on ept. However, in [7℄ also are
proposed other alternatives like         split derivation and ombined derivation. The
former uses the low-level labels in hierar hi al          orpora and the latter is based on
 ombining semanti ally related labels in a unique               on ept. In [8℄ those ideas are
extended by rst       lustering ea h      ategory of the       orpora and then using those
subgroups (sub- lusters) as new           on ept spa e.
                                                          4
      As we   an see, the   ommon idea to all the above approa hes is that on e a set
of do uments is identied as belonging to a group/ ategory, that                     ategory    an
be    onsidered as a     on ept and CSA           an be applied in the usual way. We take
a similar view to those works by           onsidering that the positive (minority)             lass
in ERD problems         an be augmented with the           on epts derived from the sets of
partial do uments read along the dierent time steps. In order to understand
this idea it is ne essary to rst introdu e a sequential work s heme as the one
proposed in [9℄ for resear h in ERD systems for depression                   ases.
      Following [9℄, we will assume a            orpus of do uments written by p dierent
individuals ({I1 , . . . , Ip }). For ea h individual Il (l ∈ {1, . . . , p}), the nl do u-
ments that he has written are provided in              hronologi al order (from the oldest
text to the most re ent text): DIl ,1 , DIl ,2 , . . . , DIl ,nl . In this   ontext, given these
p streams of messages, the ERD system has to pro ess every sequen e of messages
(in the    hronologi al order they are produ ed) and to make a binary de ision (as
early as possible) on whether or not the individual might be a positive                   ase of
depression. Evaluation metri s on this task must be time-aware, so an early risk
dete tion error (ERDE) is proposed. This metri                not only takes into a     ount the

4
    In that work,   on epts are referred as      proles and subgroups as sub-proles.
 orre tness of the (binary) de ision but also the delay taken by the system to
make the de ision.

    In a usual supervised text        ategorization task, we would only have two         ate-
gory labels:   positive (risk/depressive ase) and negative (non-risk/non-depressive
 ase). That would only give two          on epts for a CSA representation. However, in
ERD problems there is additional temporal information that                  ould be used to
obtain an improved         on ept spa e. For instan e, the training set         ould be split
in h  hunks, Ĉ1 , Ĉ2 , . . . , Ĉh , in su h a way that Ĉ1   ontains the oldest writings
of all users (rst (100/h)% of submitted posts or            omments),     hunk Ĉ2   ontains
the se ond oldest writings, and so forth. Ea h hunk Ĉk an be partitioned in
              +        −          +S −
two subsets Ĉk and Ĉk , Ĉk = Ĉk   Ĉk where Ĉk+ ontains the positive ases
                  −
of hunk Ĉk and Ĉk the negatives ones of this hunk.

    It is interesting to note that we       an also    onsider the data sets that result of
 on atenating        hunks that are    ontiguous in time and using the notation Ĉi−j
to refer to the      hunk obtained from      on atenating all the (original)     hunks from
the ith    hunk to the j th      hunk (in lusive). Thus, Ĉ1−h will represent the data
set with the omplete streams of messages of all the p individuals. In this ase,
  +         −
Ĉ1−h and Ĉ1−h will have the obvious semanti spe ied above for the omplete
do uments of the training set.

    The     lassi    way of   onstru ting a     lassier would be to take the         omplete
do uments of the p individuals (Ĉ1−h ) and use an indu tive learning algorithm
su h as SVM or Naïve Bayes to obtain that               lassier. As we mentioned earlier,
another important aspe t in EDS systems is that the               lassi ation problem being
addressed is usually highly unbalan ed (UDS problem). That is, the number
of do uments of the majority/negative             lass (non-depression) is signi antly
larger than that of the minority/positive    lass (depression). More formally,
                                               −         +
following the previously spe ied notation | Ĉ1−h |≫| Ĉ1−h |.

    An alternative to try to alleviate the UDS problem would be to              onsider that
the minority        lass is formed not only by the    omplete do uments of the individu-
als but also by the partial do uments obtained in the dierent             hunks. Following
the general ideas posed in CSA, we            ould    onsider that the partial do uments
read in the dierent       hunks represent temporal        on epts that should be taken
into a    ount. In this    ontext, one might think that variations of the terms used in
these dierent sequential stages of the do uments may have relevant information
for the    lassi ation task. With this idea in mind, the method proposed in this
work named      temporal variation of terms (TVT) arises, whi h onsists in enri h-
ing the do uments of the minority           lass with the partial do uments read in the
rst     hunks. These rst     hunks of the minority       lass, along with their     omplete
do uments, will be        onsidered as a new      on ept spa e for a CSA method.

    Therefore, in TVT we rst determine the number f of initial                  hunks that
will be used to enri h the minority (positive) lass. Then, we use the do ument
       +     +               +         +                                          −
sets Ĉ1 , Ĉ1−2 , . . . , Ĉ1−f and Ĉ1−h as on epts for the positive lass and Ĉ1−h
for the negative        lass. Finally, we represent terms as do uments in this new
(f + 2)-dimensional spa e using the CSA approa h explained in Se tion 2.1.
3       Experimental Analysis
3.1     Data Set


Our approa h was tested on the data set used in the eRisk 2017 pilot task
                                                                                 5 and
des ribed in [9℄. It is a    olle tion of writings (posts or      omments) from a set
of So ial Media users. There are two       ategories of users, depressed and non-
depressed and, for ea h user, the     olle tion   ontains a sequen e of writings (in
 hronologi al order). For ea h user, the     olle tion of writings has been divided
into 10    hunks. The rst    hunk    ontains the oldest 10% of the messages, the
se ond    hunk    ontains the se ond oldest 10%, and so forth. This       olle tion was
split into a training and a test set that we will refer as T RDS and T E DS respe -
tively. The (training) T RDS set     ontained 486 users (83 positive, 403 negative)
and the (test) T E DS set     ontained 401 users (52 positive, 349 negative). The
users labeled as positive are those that have expli itly mentioned that they have
been diagnosed with depression.
      This task was divided into a training stage and a testing stage. In the rst
one, the parti ipating teams had a        ess to the T RDS set with all       hunks of
all training users. They     ould therefore tune their systems with the training
data. To reprodu e the same       onditions of the pilot task, we use the training
set (T RDS ) to generate a new       orpus divided into a training set (that we will
refer as T RDS − train) and a test set (named T RDS − test) with the same
 ategories (depressed and non-depressed) for ea h sequen e of writings of the
users in the     olle tion. Those sets maintained the same proportions of post per
user and words per user as des ribed in [9℄. T RDS −train and T RDS −test were
generated by randomly sele ting around a 70% of writings for the rst one and
the rest 30% for the se ond one. Thus, T RDS − train resulted in 351 writings
(63 positive, 288 negative) meanwhile T RDS − test        ontains 135 individuals (20
positive, 115 negative). In the pilot task the      olle tion of writings was divided
into 10 hunks, so we made the same division on T RDS −train and T RDS −test.


3.2     Experimental Results


We reprodu ed the same       onditions fa ed by the parti ipants of the eRisk pilot
task, so we rst worked on the data set released on the training stage (T RDS )
and then, the obtained models were tested on the test stage (T E DS ). The a tiv-
ities   arried out at ea h stage are des ribed below.


Training stage       CSA is a do ument representation that aims at addressing
some drawba ks of       lassi al representations su h as BoW. On the other hand,
TVT is supposed to extend CSA by dening           on epts that   apture the sequential
aspe ts of the ERD problems and the variations of vo abulary observed in the
distin t stages of the individuals' writings. Thus, CSA and BoW arise as obvious
 andidates to     ompare TVT in the data set used in the pilot task. Those three

5
    http://early.irlab.org/task.html
representations were evaluated with dierent learning algorithms su h as SVM,
Naïve Bayes and Random Forest, among others. In ea h              ase, the best param-
eters were sele ted for ea h algorithm-representation        ombination (  model ) and
the reported results    orrespond to the best obtained values.

    We tested BoW with dierent weighting s hemes and learning algorithms
but, in all   ases, the best results were obtained with binary representations and
the Naïve Bayes algorithm. From now on, all referen es to BoW will stand
for that setting. We use CSA with representations of terms with normalized
weights a     ording to Equations 2 and 3 and do ument representations obtained
from Equation 4 as proposed in [8℄ for author proling tasks. We named this
              ⋆
setting as CSA . For the TVT representation, a de ision must be made related
to the number f of     hunks that will enri h the minority (positive)       lass. In our
studies, we use f = 4 and, in     onsequen e, the positive    lass was represented by
5   on epts. In that way, the number of do uments in the depressed           lass was
in remented by 5 with respe t to the original size, from 83 positive instan es
to 415. As we     an see, with this te hnique we are also obtaining some kind of
balan ing in the size of both    lasses and addressing in that way another usual
problem that we previously refer as the UDS problem.

    A parti ularity that ERD methods must        onsider is the    riterion used to de-
 ide when (in what situations) the lassi ation generated by the system is on-
sidered the nal/denitive de ision on the evaluated instan es (the lassi ation
time de ision (CTD) issue). We will start our evaluation of the dierent do u-
ment representations and algorithms assuming that the         lassi ation is made on
a stati    hunk by   hunk basis. That is, for ea h   hunk Ĉi provided to the ERD
systems we will evaluate their performan e       onsidering that all the models are
(simultaneously) applied to the writings re eived up to the        hunk Ĉi . With this
kind of information it will be possible to observe to what extent the dierent
approa hes are robust to the partial information in the dierent stages, in whi h
moment they start to obtain a       eptable results, and other interesting statisti s.

    Tables 1, 2 and 3 show the results of experiments for this stati          hunk by
 hunk    lassi ation s heme. Values of   pre ision (π), re all (ρ) and F1 -measure
(F1 ) of the target (depressed)    lass are reported for ea h      onsidered model.
Statisti s also in lude the   early risk dete tion error (ERDE) measure proposed
in [9℄. This measure    onsiders not only the    orre tness of the de ision made by
the system but also the delay in making that de ision. ERDE uses spe i            osts
to penalize false positives and false negatives. However, ERDE has a dierent
treatment with the two possible su      essful predi tions (true negatives and true
positives). True negatives have no    ost ( ost = 0) but ERDE asso iates a       ost to
the delay in the dete tion of true positives that monotoni ally in reases with the
number k of textual items seen before giving the answer. In a nutshell, that        ost
is low when k is lower than a threshold value o but rapidly approa hes 1 when
k > o. In that way, o represents some type of urgen y in dete ting depression
 ases: the lowest the o values the highest the urgen y in dete ting the positive
 ases. A more detailed des ription of ERDE        an be found in [9℄.
     In our study we        onsider the two values of o used in the pilot task: o = 5
(ERDE5 ) and o       = 50 (ERDE50 ). In ea h          hunk,     lassiers usually produ e
their predi tions with some kind of  onden e, in general, the estimated prob-
ability of the predi ted      lass. In those   ases, we   an sele t dierent thresholds tr
 onsidering that an instan e (do ument) is assigned to the target              lass when its
asso iated probability p is greater (or equal) than           ertain threshold tr (p ≥ tr).
In this study we evaluated 5 dierent settings for the probabilities assigned for
ea h     lassier: p = 1, p ≥ 0.9, p ≥ 0.8, p        ≥ 0.7 and p ≥ 0.6. Due to spa e
 onstraints, only the best results obtained with a parti ular setting are shown.
                                                                                          6

     Table 1 shows the results obtained with a BoW representation and a Naïve
Bayes     lassier. Those values      orrespond to the setting where an instan e is
 onsidered as depressive if the        lassier assigns to the target/positive        lass a
probability greater or equal than 0.8 (p ≥ 0.8). Surprisingly, the best results for
all the    onsidered measures are obtained on the rst           hunk. In this    hunk, we
 an observe that this model only re overs a 45% of the depressed individuals.
However, this is not the worst aspe t. Only a 12% of the individual                 lassied
as depressed ee tively had this        ondition resulting in
                                                          onsequen e in a very
                                                                ⋆
low F1 measure (0.19). Table 2 shows similar results when a CSA -RF (random
forest) ombination with p ≥ 0.6 is used to         lassify the writings of the individuals.
Here, F1 measure is also low but we        an observe a deterioration in the (ERDE5 )
and (ERDE50 ) error values with respe t to the previous model.
     Finally, in Table 3, the results of TVT with a Naïve Bayes algorithm and p ≥
0.6 are shown. There, we        an see a remarkable improvement in the performan e
of the    lassier in the    hunk 3 with ex ellent values of ERDE50 (7.02), pre ision
π (0.63), re all ρ (0.85) and F1 measure (0.72). Analysing the results along the
10    onsidered    hunks we observe how the measures keep improving from the
 hunk 1 up to rea h the best values in           hunk 3 and, from then on, they start
to deteriorate    hunk by      hunk and obtaining the worst results on the last two
 hunks. As weak points of those results we         an say that the best value of ERDE5
obtained in     hunk 1 is not very good. Besides, even thought ERDE50 values are
a    eptable for most of the      onsidered    hunks, they need at least two       hunks to
show a     ompetitive performan e. That aspe t looks reasonable if we               onsider
that TVT is based on the variation of terms between                onse utive    hunks and
that information is not available on the rst         hunk.
     As general   on lusion to the  hunk by       hunk analysis, we      ould say that im-
balan ed     lasses seem to ae t in a dierent way to the dierent methods. BoW
and CSA dire tly depend on the vo abulary of positive and negative                   lasses.
In the rst     hunk where texts are supposed to be the shortest, relevant words
of the positive    lass appearing in the posts will probably have more             han e of
being balan ed with respe t to the words appearing in the negative              lass. That
makes     lassiers be more sensitive to the positive       lass and, in    onsequen e, the
re all and general performan e is improved. As more information is read, words
related to the negative       lass are more probable to o      ur introdu ing noise and

6
    All the tables generated for the dierent probabilities        an be downloaded from
    https://sites.google. om/site/l agnina/resear h/Tables_eRisk17.rar
Table 1.Model : BoW + Naïve Bayes (p ≥ 0.8). Chunk by hunk setting. ERDE5 ,
ERDE50 , F1 -measure (F1 ), pre ision (π ) and re all (ρ) of the depressed lass.

                ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
        ERDE5 18.09 20.98 21.5 21.73 21.95 21.95 21.95 21.95 22.17 22.17
        ERDE50 15.17 16.84 20.77 20.25 21.21 21.95 21.95 21.52 22.17 22.17
          F1    0.19 0.16 0.09 0.11 0.09 0.09 0.09 0.09 0.09 0.13
          π     0.12 0.11 0.06 0.17 0.06 0.06 0.06 0.06 0.06 0.08
          ρ     0.45 0.35   0.2   0.25  0.2   0.2   0.2   0.2   0.2   0.3


Table 2.   Model : CSA⋆ + RF (p ≥ 0.6). Chunk by hunk setting. ERDE5 , ERDE50 ,
F1 -measure (F1 ), pre ision (π ) and re all (ρ) of the depressed   lass.


                ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
        ERDE5 21.93 25.64 25.46 25.57 26.12 25.68 25.68 25.46 25.35 25.68
        ERDE50 19.47 24.94 25.46 23.35 25.37 24.2 23.46 22.5 22.39 23.47
          F1    0.19 0.08 0.05    0.1  0.06 0.08 0.13 0.16 0.16 0.14
          π     0.11 0.05 0.03 0.06 0.04 0.05 0.07 0.09 0.09 0.08
          ρ     0.6   0.25 0.15   0.3   0.2  0.25  0.4   0.5   0.5  0.45


ae ting in    onsequen e the performan e. TVT does not seem to be so ae ted
by this problem showing a more stable performan e along all the               hunks, with
the best results in the third   hunk and then with a little deterioration from then
on. Those results     ould be giving eviden e that the variation of terms (with
f = 4) allows to better dete t the o        urren e of relevant words of the positive
 lass in the rst   hunks. However, it also seems to be ae ted by the unbalan e
problem in subsequent       hunks, although in a lower level than BoW and CSA
representations. Unfortunately, verifying those hypotheses would require            onsid-
ering more balan ed settings and dierent f values what is out of the s ope of
this paper. However, that important aspe t will be addressed in future works
   Another approa h for the CTD issue            ould be dire tly use the probability
(or some measure of     onden e) assigned by the       lassier to de ide    when to stop
reading a do ument and giving its        lassi ation. That approa h, that in        [9℄ is
referred as   dynami , only onsiders that this probability ex eeds some parti ular
threshold to     lassify the instan e/individual as positive. That means, that dif-
ferent streams of messages ould be lassied as depressed in dierent stages
                                                     ⋆
( hunks). Table 4 show those statisti s for BoW, CSA and TVT representa-
tions for those learning algorithms and probability thresholds that obtained the
best performan e. There, we        an see that TVT representation, with a Naïve
Bayes and     lassifying instan es as depressed when the assigned probability is 1,
obtains the best results for the measures we are more interested in: ERDE5 ,
ERDE50 and F1 -measure. In this          ontext, BoW gets a better re all value but
at the expense of lowering the pre ision values resulting in a poor F1 -measure.


Testing stage The previous results were obtained by training the                  lassiers
with the T RDS − train data set and testing them with the T RDS − test data
Table 3.Model : T V T + Naïve Bayes (p ≥ 0.6). Chunk by hunk setting. ERDE5 ,
ERDE50 , F1 -measure (F1 ), pre ision (π ) and re all (ρ) of the depressed lass.

                 ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
         ERDE5 14.24 14.27 14.59 14.83 15.17 15.51 15.74 15.84 16.21 16.13
         ERDE50 10.80 7.22 7.02 9.24 9.25 9.97 10.73 10.73 11.06 10.96
           F1    0.42 0.65 0.72 0.67 0.67 0.67 0.64 0.64 0.57 0.58
           π     0.39 0.58 0.63 0.60 0.60 0.60 0.58 0.58 0.50 0.52
           ρ     0.45 0.75 0.85 0.75 0.75 0.75 0.70 0.70 0.65 0.65


           Table 4.   Dynami Models for BoW-NB, CSA⋆ -NB and TVT-NB.

                                   ERDE5 ERDE50 F1             π    ρ
                 BoW (p ≥ 0.8)      21.05      18.13     0.24 0.14 0.75
                 CSA⋆ -NB(p = 1)    23.09      23.07     0.06 0.04 0.15
                 TVT-NB (p = 1)     14.13      11.25   0.40 0.47 0.35


set. The obvious question now is if similar results are obtained by training with
the full training set of the pilot task (T RDS ) and using the            lassiers with
the data set T E DS that was in rementally released during the testing phase
of the pilot task. In this new s enario, the TVT representation was used with
a simple rule for the CTD issue that        onsists in    lassifying all the individual
in the   hunk 3 as positive (depressed) if a Naïve Bayes           lassier produ ed a
probability equal or greater than 0.6 for the positive lass. That strategy, that
                      3
we will refer as T V Tp≥0.6 , is motivated by the good results showed by TVT in
                                         ⋆
Table 3. We also tested the BoW, CSA and TVT representations with dynami
strategies and using those probabilities that best values obtained in the training
stage. As baselines we also tested two approa hes des ribed in [9℄ that will be
named as Ran and M in. Ran, simply emits a random de ision (depressed/non-
depressed) for ea h user in the rst    hunk. M in, on the other hand, stands for
minority and   onsists in   lassifying ea h user as depressive in the rst    hunk.
   Table 5 shows the performan e of all the above mentioned approa hes on the
test set of the pilot task (T E DS ). We also in luded the results reported in the
eRisk page for the systems that obtained the best ERDE5 (F HDO − BCSGB ),
ERDE50 (U N SLA) and F1 (F HDO − BCSGB ) measures on the pilot task.
                                                     3
Here we   an observe that results obtained with T V Tp≥0.6 are not as good as
those obtained in the training stage. However, the setting TVT-NB (p = 1)
would have obtained the best ERDE5 s ore and the third ERDE50 value, with
a small dieren e respe t to the best reported value (9.84 versus 9.68).
   Those good results of TVT were a hieved taking into a             ount the   best pa-
rameters obtained in the training stage. However, it also would be interesting
analysing what would have been the TVT's performan e if other parameter set-
tings had been sele ted. Table 6 shows this type of information by reporting the
results obtained with dierent learning algorithms (Naïve Bayes and Random
Forest) and dierent probability values for dynami  approa hes to the CTD
aspe t. The results are    on lusive in this   ase. TVT shows a high robustness in
                       Table 5. Results on the T E DS test set.


                                    ERDE5 ERDE50 F1            π     ρ
                Ran                  16.83      14.63    0.17 0.11   0.4
                M in                 21.67      15.03    0.23 0.13   1
                BoW (p ≥ 0.8)        16.45      10.87    0.38 0.25 0.77
                CSA⋆ -NB(p = 1)      20.58      19.58    0.05 0.03 0.15
                     3
                T V Tp≥0.6           13.64      10.17    0.53 0.46 0.62
                TVT-NB (p = 1)       12.38      9.84     0.42 0.50 0.37
                F HDO − BCSGA        12.82      9.69     0.64 0.61 0.67
                F HDO − BCSGB        12.70      10.39    0.55 0.69 0.46
                U N SLA              13.66      9.68     0.59 0.48 0.79


 Table 6. Results of TVT with dierent learning algorithms and probability values.


                                   ERDE5 ERDE50 F1             π     ρ
                TVT-NB (p ≥ 0.6)     13.59      8.40     0.50 0.37 0.75
                TVT-NB (p ≥ 0.7)     13.43      8.24     0.51 0.39 0.75
                TVT-NB (p ≥ 0.8)     13.13     8.17      0.54 0.42 0.73
                TVT-NB (p ≥ 0.9)     13.07      8.35     0.52 0.42 0.69
                 TVT-NB(p = 1)       12.38      9.84     0.42 0.50 0.37
                TVT-RF (p ≥ 0.6)     12.46      8.37     0.55 0.49 0.63
                TVT-RF (p ≥ 0.7)     12.49      8.52     0.55 0.50 0.62
                TVT-RF (p ≥ 0.8)    12.30       8.95    0.56 0.54 0.58
                TVT-RF (p ≥ 0.9)     12.34     10.28     0.47 0.55 0.40
                 TVT-RF(p = 1)       12.82     11.82     0.20 0.67 0.12


the ERDE measures      independently of the algorithm used to learn the model and
the probability used in the dynami     approa hes. Most of the ERDE5 values are
low and in 7 out of 10 settings the ERDE50 values are lowest than the best re-
ported in the pilot task (U N SLA: 9.68). In this       ontext, TVT a hieves the best
reported ERDE5 value up to now (12.30) with the setting TVT-RF (p ≥ 0.8)
and the lowest ERDE50 value (8.17) with the model TVT-NB (p ≥ 0.8).


4    Con lusions and future work
In this arti le we present   temporal variation of terms (TVT) an approa h for
early risk dete tion based on using the variation of vo abulary along the dierent
time steps as   on ept spa e for do ument representation. TVT naturally              opes
with the sequential nature of ERD problems and also gives a tool for dealing
with unbalan ed data sets. Preliminary results with the eRisk 2017 data set
show a better performan e of TVT in          omparison to other su         essful semanti
analysis approa h and the standard BOW representation. It also shows a robust
performan e along dierent parameter settings and rea hes the best reported
results up to the moment for ERDE5 and ERDE50 error evaluation measures.
    As future work, we plan to apply the TVT approa h to other problems that
 an be dire tly ta kled as ERD problems su h as sexual predation and sui ide
dis ourse identi ation. Our rst option to work will be the                orpus used in the
PAN-2012       ompetition on sexual predator identi ation [5℄ whi h shares several
 hara teristi s with the data set used in the present work su h as the sequentially
of data, unbalan ed         lasses and the requirement of dete ting the minority            lass
(predator) as soon as possible, among others.
   TVT is expli itly based on the enri hment of the minority                    lass with new
 on epts derived from the partial information obtained from the initial                   hunks.
However, some improvements               an be a hieved by also        lustering the negative
 lass as proposed by [8℄ in author proling tasks. We                 arried out some initial
experiments by       ombining TVT with the               lustering of the negative      lass but
more study is required to determine how both approa hes                      an be ee tively
integrated. Besides, in the present work, the ele tion of f                = 4 mainly aimed
at obtaining balan ed positive and negative               lasses. In future works, dierent f
values will be     onsidered to see how they impa t on the TVT's performan e.
   TVT provides, as a side ee t, an interesting tool for dealing with the unbal-
an ed data set problem. We plan to apply TVT on unbalan ed data sets that do
not ne essarily     orrespond to the ERD eld and             omparing it against other well
known methods in this area, su h as SMOTE [1℄. Finally, it would be interesting
 omparing the       on ept spa e used in our approa h against other re ent and ef-
fe tive representations based on word embeddings. In this                ontext, it   ould also
be analysed how our           on ept spa e representation          an be extended/improved
with information provided by those embeddings.


Referen es
1. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. Ph. Kegelmeyer. Smote: Syntheti
   minority over-sampling te hnique.        J. Artif. Intell. Res., 16:321357, 2002.
2. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman.
   Indexing by latent semanti         analysis.   Journal of the ASIS, 41(6):391407, 1990.
3. H. Jair Es alante, M. Montes-y-Gómez, L. Villaseñor Pineda, and M. Erre alde.
   Early text    lassi ation: a naïve solution. In A. Balahur, E. Van der Goot, P. Vossen,
   and A. Montoyo, editors, Pro . of WASSANAACL-HLT 2016, 2016, San Diego,
   California, USA, pages 9199. The Asso iation for Computer Linguisti s, 2016.
4. E. Gabrilovi h and S. Markovit h.              Wikipedia-based semanti    interpretation for
   natural language pro essing.       JAIR, 34(1):443498, Mar h 2009.
5. G. In hes and F. Crestani. Overview of the international sexual predator identi-
    ation     ompetition at pan-2012. In P. Forner, J. Karlgren, and C. Womser-Ha ker,
   editors,   CLEF (Online Working Notes/Labs/Workshop), pages 112, 2012.
6. M. Lan, Ch. Tan, J. Su, and Y. Lu.              Supervised and traditional term weighting
   methods for automati        text   ategorization.    IEEE TPAMI, 31(4):721735, 2009.
7. Z. Li, Z. Xiong, Y. Zhang, Ch. Liu, and K. Li. Fast text         ategorization using   on ise
   semanti     analysis.   Pattern Re ogn. Lett., 32(3):441448, February 2011.
8. A. Pastor López-Monroy, M. Montes y Gómez, H. Jair Es alante, L. Villaseñor-
   Pineda, and E. Stamatatos.          Dis riminative subprole-spe i      representations for
   author proling in so ial media.       Knowledge-Based Systems, 89:134  147, 2015.
9. D. E. Losada and F. Crestani. A test            olle tion for resear h on depression and lan-
   guage use. In Experimental IR Meets Multilinguality, Multimodality, and Intera tion
   - 7th Int. Conf. of the CLEF Asso iation, Portugal, pages 2839, 2016.