Temporal Variation of Terms as on ept spa e for early risk predi tion 1 1 1 Mar elo L. Erre alde , Ma. Paula Villegas , Dario G. Funez , 1 1,2 Ma. José Gar iarena U elay , and Leti ia C. Cagnina 1 LIDIC Resear h Group, Universidad Na ional de San Luis, Argentina 2 Consejo Na ional de Investiga iones Cientí as y Té ni as (CONICET) {merre alde,villegasmariapaula74,funezdario}gmail. om {mjgar iarenau elay,l agnina}gmail. om Abstra t. Early risk predi tion involves three dierent aspe ts to be onsidered when an automati lassier is implemented for this task: a) support for lassi ation with partial information read up to dierent time steps, b) support for dealing with unbalan ed data sets and ) a poli y to de ide when a do ument ould be lassied as belonging to the relevant lass with a reasonable onden e. In this paper we propose an approa h that naturally opes with the rst two aspe ts and shows good perspe tives to deal with the last one. Our proposal, named temporal variation of terms (TVT) is based on using the variation of vo abulary along the dierent time steps as on ept spa e to represent the do u- ments. Results with the eRisk 2017 data set show a better performan e of TVT in omparison to other su essful semanti analysis approa hes and the standard BOW representation. Besides, it also rea hes the best reported results up to the moment for ERDE5 and ERDE50 error eval- uation measures. Keywords:Early Risk Dete tion, Unbalan ed Data Sets, Text Repre- sentations, Semanti Analysis Te hniques. 1 Introdu tion Early risk dete tion (ERD) is a new resear h area potentially appli able to a wide variety of situations su h as dete tion of potential paedophiles, people with sui idal in linations, or people sus eptible to depression, among others. In a ERD s enario, data are sequentially read as a stream and the hallenge onsists in dete ting risk ases as soon as possible. A usual situation in these ases is that the target lass (the risky one) is learly under-sampled with respe t to the ontrol lass (the non-risky one). That unequal distribution between the positive (minority) lass and the negative one, is a well-known problem in ategorization tasks and popularly referred as unbalan ed data sets (UDS). Besides dealing with the UDS problem, an ERD system needs to onsider the problem of assigning a lass to do uments when only partial information is available. A do ument is pro essed as a sequen e of terms, and the goal is to devise a method that an make predi tions with the information read up to a spe i point of the time. That aspe t, that ould be named as lassi ation with partial information (CPI) might be addressed with a simple approa h that onsists in training with omplete do uments as usual and onsidering the partial do uments read up to the lassi ation point as standard  omplete do uments. In [3℄ the CPI aspe t was onsidered by analysing the robustness of the Naïve Bayes algorithm to deal with partial information. Last, but not least, an ERD system needs to onsider not only whi h lass should be assigned to a do ument, but also de iding when to make that assign- ment. This aspe t, that we will refer as the lassi ation time de ision (CTD) issue has been addressed with very simple heuristi rules 3 although more elabo- rated approa hes might be used. In this arti le we propose an original idea that expli itly onsiders the sequen- tiality of data to deal with the unbalan ed data sets problem. In a nutshell, we use the temporal variation of terms as on ept spa e of a re ent on ise seman- ti analysis (CSA) approa h [7℄. CSA is an interesting do ument representation te hnique whi h models words and do uments in a small  on ept spa e whose on epts are obtained from ategory labels. CSA has obtained good results in tem- author proling tasks [8℄ and the variant proposed in this arti le, named poral variation of terms (TVT), seems to show some interesting hara teristi s to deal with the ERD problem. In fa t, it obtained a robust performan e on the eRisk 2017 data set and rea hed the best (lowest) reported results up to the moment for ERDE5 and ERDE50 error evaluation measures. The rest of this do ument is organized as follows: Se tion 2 des ribes our proposed method for the ERD problem. Se tion 3 shows the obtained results with our method on the eRisk 2017 dataset. Finally, Se tion 4 depi ts potential future works and the obtained on lusions. 2 The proposed method Our method is based on the on ise semanti analysis (CSA) te hnique pro- posed in [7℄ and later extended in [8℄ for author proling tasks. Therefore, we rst present in Subse tion 2.1 the key aspe ts of CSA and then explain in Sub- se tion 2.2 how we instantiate CSA with on epts derived from the terms used in the temporal hunks analysed by an ERD system at dierent time steps. 2.1 Con ise Semanti Analysis Standard text representation methods su h as Bag of Words (BoW) suer of two well known drawba ks. First, their high dimensionality and sparsity; se - ond, they do not apture relationships among words. CSA is a semanti analysis te hnique that aims at dealing with those short omings by interpreting words and do uments in a spa e of on epts. Dierently from other semanti analy- sis approa hes su h as latent semanti analysis (LSA) [2℄ and expli it semanti 3 For instan e, ex eeding a spe i onden e threshold in the predi tion of the las- sier [9℄. analysis (ESA) [4℄ whi h usually require huge omputing osts, CSA interprets words and text fragments in a spa e of on epts that are lose (or equal) to the ategory labels. For instan e, if do uments in the data set are labeled with q dierent ategory labels (usually no more than 100 elements), words and do - uments will be represented in a q -dimensional spa e. That spa e size is usually mu h smaller than standard BoW representations whi h dire tly depend on the vo abulary size (more than 10000 or 20000 elements in general). To explain the main on epts of the CSA te hnique we rst introdu e some basi notation that will be used in the rest of this work. Let D = {hd1 , y1 i, . . . , hdn , yn i} be a training set formed by n pairs of do uments (di ) and variables (yi ) that indi- ate the on ept the do ument is asso iated with, yi ∈ C where C = {c1 , . . . , cq } is the on ept spa e. For the moment, onsider that these on epts orrespond to standard ategory labels although, as we will see later, they might represent more elaborate aspe ts. In this ontext, we will denote as V = {t1 , . . . , tm } to the vo abulary of terms of the olle tion being analysed. Representing terms in the on ept spa e In CSA, ea h term ti ∈ V is represented as a ve tor ti ∈ Rq , ti = hti,1 , . . . , ti,q i. Here, ti,j represents the degree of asso iation between the term ti and the on ept cj and its omputation requires some basi steps that are explained below. First, the raw term- on ept asso iation between the ith term and the j th on ept, denoted wij , will be obtained. If Dcu ⊆ D , Dcu = {dr | hdr , ys i ∈ D ∧ ys = cu } is the subset of the training instan es whose label is the on ept cu , then wij might be dened as:   X tfik wij = log2 1 + (1) len(dk ) ∀dk ∈Dcj where tfik is the number of o urren es of the term ti in the do ument dk and len(dk ) is the length (number of terms) of dk . As noted in [7℄ and [8℄, dire t use of wij to represent terms in the ve tor ti ould be sensible to highly unbalan ed data. Thus, some kind of normalization is usually required and, in our ase, we sele ted the one proposed in [8℄: wij t′ij t′ij = P m (2) tij = (3) q wij P wij i=1 j=1 With this last onversion we nally obtain, for ea h term ti ∈ V , a q - dimensional ve tor ti , ti = hti,1 , . . . , ti,q i dened over a spa e of q on epts. Up to now, those on epts orrespond to the original ategories used to label the do uments. Later, we will use other more elaborated on epts. Representing do uments in the on ept spa e On e the terms are repre- sented in the q -dimensional on ept spa e, those ve tors an be used to represent do uments in the same on ept spa e. In CSA, do uments are represented as the entral ve tor of all the term ve tors they ontain [7℄. Terms have dierent im- portan e for dierent do uments so it is not a good idea omputing that ve tor for the do ument as the simple average of all its term ve tors. Previous works in BoW [6℄ have onsidered dierent statisti te hniques to weight the impor- 2 tan e of terms in a do ument su h as tf idf , tf ig , tf χ or tf rf , among others. Here, we will use the approa h used in [8℄ for author proling that represents ea h do ument dk as the weighted aggregation of the representations (ve tors) of terms that it ontains: X  tfik  dk = × ti (4) len(dk ) ti ∈dk Thus, do uments are also represented in a q -dimensional on ept spa e (i.e., dk ∈ Rq ) whi h is mu h smaller in dimensionality than the one required by standard BoW approa hes (q ≪ m). 2.2 Temporal Variation of Terms In Subse tion 2.1 we said that the on ept spa e C usually orresponds to stan- dard ategory names used to label the training instan es in supervised text ategorization tasks. In this s enario, that in [7℄ is referred as dire t derivation, ea h ategory label simply orresponds to a on ept. However, in [7℄ also are proposed other alternatives like split derivation and ombined derivation. The former uses the low-level labels in hierar hi al orpora and the latter is based on ombining semanti ally related labels in a unique on ept. In [8℄ those ideas are extended by rst lustering ea h ategory of the orpora and then using those subgroups (sub- lusters) as new on ept spa e. 4 As we an see, the ommon idea to all the above approa hes is that on e a set of do uments is identied as belonging to a group/ ategory, that ategory an be onsidered as a on ept and CSA an be applied in the usual way. We take a similar view to those works by onsidering that the positive (minority) lass in ERD problems an be augmented with the on epts derived from the sets of partial do uments read along the dierent time steps. In order to understand this idea it is ne essary to rst introdu e a sequential work s heme as the one proposed in [9℄ for resear h in ERD systems for depression ases. Following [9℄, we will assume a orpus of do uments written by p dierent individuals ({I1 , . . . , Ip }). For ea h individual Il (l ∈ {1, . . . , p}), the nl do u- ments that he has written are provided in hronologi al order (from the oldest text to the most re ent text): DIl ,1 , DIl ,2 , . . . , DIl ,nl . In this ontext, given these p streams of messages, the ERD system has to pro ess every sequen e of messages (in the hronologi al order they are produ ed) and to make a binary de ision (as early as possible) on whether or not the individual might be a positive ase of depression. Evaluation metri s on this task must be time-aware, so an early risk dete tion error (ERDE) is proposed. This metri not only takes into a ount the 4 In that work, on epts are referred as proles and subgroups as sub-proles. orre tness of the (binary) de ision but also the delay taken by the system to make the de ision. In a usual supervised text ategorization task, we would only have two ate- gory labels: positive (risk/depressive ase) and negative (non-risk/non-depressive ase). That would only give two on epts for a CSA representation. However, in ERD problems there is additional temporal information that ould be used to obtain an improved on ept spa e. For instan e, the training set ould be split in h  hunks, Ĉ1 , Ĉ2 , . . . , Ĉh , in su h a way that Ĉ1 ontains the oldest writings of all users (rst (100/h)% of submitted posts or omments), hunk Ĉ2 ontains the se ond oldest writings, and so forth. Ea h hunk Ĉk an be partitioned in + − +S − two subsets Ĉk and Ĉk , Ĉk = Ĉk Ĉk where Ĉk+ ontains the positive ases − of hunk Ĉk and Ĉk the negatives ones of this hunk. It is interesting to note that we an also onsider the data sets that result of on atenating hunks that are ontiguous in time and using the notation Ĉi−j to refer to the hunk obtained from on atenating all the (original) hunks from the ith hunk to the j th hunk (in lusive). Thus, Ĉ1−h will represent the data set with the omplete streams of messages of all the p individuals. In this ase, + − Ĉ1−h and Ĉ1−h will have the obvious semanti spe ied above for the omplete do uments of the training set. The lassi way of onstru ting a lassier would be to take the omplete do uments of the p individuals (Ĉ1−h ) and use an indu tive learning algorithm su h as SVM or Naïve Bayes to obtain that lassier. As we mentioned earlier, another important aspe t in EDS systems is that the lassi ation problem being addressed is usually highly unbalan ed (UDS problem). That is, the number of do uments of the majority/negative lass (non-depression) is signi antly larger than that of the minority/positive lass (depression). More formally, − + following the previously spe ied notation | Ĉ1−h |≫| Ĉ1−h |. An alternative to try to alleviate the UDS problem would be to onsider that the minority lass is formed not only by the omplete do uments of the individu- als but also by the partial do uments obtained in the dierent hunks. Following the general ideas posed in CSA, we ould onsider that the partial do uments read in the dierent hunks represent temporal on epts that should be taken into a ount. In this ontext, one might think that variations of the terms used in these dierent sequential stages of the do uments may have relevant information for the lassi ation task. With this idea in mind, the method proposed in this work named temporal variation of terms (TVT) arises, whi h onsists in enri h- ing the do uments of the minority lass with the partial do uments read in the rst hunks. These rst hunks of the minority lass, along with their omplete do uments, will be onsidered as a new on ept spa e for a CSA method. Therefore, in TVT we rst determine the number f of initial hunks that will be used to enri h the minority (positive) lass. Then, we use the do ument + + + + − sets Ĉ1 , Ĉ1−2 , . . . , Ĉ1−f and Ĉ1−h as on epts for the positive lass and Ĉ1−h for the negative lass. Finally, we represent terms as do uments in this new (f + 2)-dimensional spa e using the CSA approa h explained in Se tion 2.1. 3 Experimental Analysis 3.1 Data Set Our approa h was tested on the data set used in the eRisk 2017 pilot task 5 and des ribed in [9℄. It is a olle tion of writings (posts or omments) from a set of So ial Media users. There are two ategories of users, depressed and non- depressed and, for ea h user, the olle tion ontains a sequen e of writings (in hronologi al order). For ea h user, the olle tion of writings has been divided into 10 hunks. The rst hunk ontains the oldest 10% of the messages, the se ond hunk ontains the se ond oldest 10%, and so forth. This olle tion was split into a training and a test set that we will refer as T RDS and T E DS respe - tively. The (training) T RDS set ontained 486 users (83 positive, 403 negative) and the (test) T E DS set ontained 401 users (52 positive, 349 negative). The users labeled as positive are those that have expli itly mentioned that they have been diagnosed with depression. This task was divided into a training stage and a testing stage. In the rst one, the parti ipating teams had a ess to the T RDS set with all hunks of all training users. They ould therefore tune their systems with the training data. To reprodu e the same onditions of the pilot task, we use the training set (T RDS ) to generate a new orpus divided into a training set (that we will refer as T RDS − train) and a test set (named T RDS − test) with the same ategories (depressed and non-depressed) for ea h sequen e of writings of the users in the olle tion. Those sets maintained the same proportions of post per user and words per user as des ribed in [9℄. T RDS −train and T RDS −test were generated by randomly sele ting around a 70% of writings for the rst one and the rest 30% for the se ond one. Thus, T RDS − train resulted in 351 writings (63 positive, 288 negative) meanwhile T RDS − test ontains 135 individuals (20 positive, 115 negative). In the pilot task the olle tion of writings was divided into 10 hunks, so we made the same division on T RDS −train and T RDS −test. 3.2 Experimental Results We reprodu ed the same onditions fa ed by the parti ipants of the eRisk pilot task, so we rst worked on the data set released on the training stage (T RDS ) and then, the obtained models were tested on the test stage (T E DS ). The a tiv- ities arried out at ea h stage are des ribed below. Training stage CSA is a do ument representation that aims at addressing some drawba ks of lassi al representations su h as BoW. On the other hand, TVT is supposed to extend CSA by dening on epts that apture the sequential aspe ts of the ERD problems and the variations of vo abulary observed in the distin t stages of the individuals' writings. Thus, CSA and BoW arise as obvious andidates to ompare TVT in the data set used in the pilot task. Those three 5 http://early.irlab.org/task.html representations were evaluated with dierent learning algorithms su h as SVM, Naïve Bayes and Random Forest, among others. In ea h ase, the best param- eters were sele ted for ea h algorithm-representation ombination ( model ) and the reported results orrespond to the best obtained values. We tested BoW with dierent weighting s hemes and learning algorithms but, in all ases, the best results were obtained with binary representations and the Naïve Bayes algorithm. From now on, all referen es to BoW will stand for that setting. We use CSA with representations of terms with normalized weights a ording to Equations 2 and 3 and do ument representations obtained from Equation 4 as proposed in [8℄ for author proling tasks. We named this ⋆ setting as CSA . For the TVT representation, a de ision must be made related to the number f of hunks that will enri h the minority (positive) lass. In our studies, we use f = 4 and, in onsequen e, the positive lass was represented by 5 on epts. In that way, the number of do uments in the depressed lass was in remented by 5 with respe t to the original size, from 83 positive instan es to 415. As we an see, with this te hnique we are also obtaining some kind of balan ing in the size of both lasses and addressing in that way another usual problem that we previously refer as the UDS problem. A parti ularity that ERD methods must onsider is the riterion used to de- ide when (in what situations) the lassi ation generated by the system is on- sidered the nal/denitive de ision on the evaluated instan es (the lassi ation time de ision (CTD) issue). We will start our evaluation of the dierent do u- ment representations and algorithms assuming that the lassi ation is made on a stati  hunk by hunk basis. That is, for ea h hunk Ĉi provided to the ERD systems we will evaluate their performan e onsidering that all the models are (simultaneously) applied to the writings re eived up to the hunk Ĉi . With this kind of information it will be possible to observe to what extent the dierent approa hes are robust to the partial information in the dierent stages, in whi h moment they start to obtain a eptable results, and other interesting statisti s. Tables 1, 2 and 3 show the results of experiments for this stati  hunk by hunk lassi ation s heme. Values of pre ision (π), re all (ρ) and F1 -measure (F1 ) of the target (depressed) lass are reported for ea h onsidered model. Statisti s also in lude the early risk dete tion error (ERDE) measure proposed in [9℄. This measure onsiders not only the orre tness of the de ision made by the system but also the delay in making that de ision. ERDE uses spe i osts to penalize false positives and false negatives. However, ERDE has a dierent treatment with the two possible su essful predi tions (true negatives and true positives). True negatives have no ost ( ost = 0) but ERDE asso iates a ost to the delay in the dete tion of true positives that monotoni ally in reases with the number k of textual items seen before giving the answer. In a nutshell, that ost is low when k is lower than a threshold value o but rapidly approa hes 1 when k > o. In that way, o represents some type of urgen y in dete ting depression ases: the lowest the o values the highest the urgen y in dete ting the positive ases. A more detailed des ription of ERDE an be found in [9℄. In our study we onsider the two values of o used in the pilot task: o = 5 (ERDE5 ) and o = 50 (ERDE50 ). In ea h hunk, lassiers usually produ e their predi tions with some kind of  onden e, in general, the estimated prob- ability of the predi ted lass. In those ases, we an sele t dierent thresholds tr onsidering that an instan e (do ument) is assigned to the target lass when its asso iated probability p is greater (or equal) than ertain threshold tr (p ≥ tr). In this study we evaluated 5 dierent settings for the probabilities assigned for ea h lassier: p = 1, p ≥ 0.9, p ≥ 0.8, p ≥ 0.7 and p ≥ 0.6. Due to spa e onstraints, only the best results obtained with a parti ular setting are shown. 6 Table 1 shows the results obtained with a BoW representation and a Naïve Bayes lassier. Those values orrespond to the setting where an instan e is onsidered as depressive if the lassier assigns to the target/positive lass a probability greater or equal than 0.8 (p ≥ 0.8). Surprisingly, the best results for all the onsidered measures are obtained on the rst hunk. In this hunk, we an observe that this model only re overs a 45% of the depressed individuals. However, this is not the worst aspe t. Only a 12% of the individual lassied as depressed ee tively had this ondition resulting in onsequen e in a very ⋆ low F1 measure (0.19). Table 2 shows similar results when a CSA -RF (random forest) ombination with p ≥ 0.6 is used to lassify the writings of the individuals. Here, F1 measure is also low but we an observe a deterioration in the (ERDE5 ) and (ERDE50 ) error values with respe t to the previous model. Finally, in Table 3, the results of TVT with a Naïve Bayes algorithm and p ≥ 0.6 are shown. There, we an see a remarkable improvement in the performan e of the lassier in the hunk 3 with ex ellent values of ERDE50 (7.02), pre ision π (0.63), re all ρ (0.85) and F1 measure (0.72). Analysing the results along the 10 onsidered hunks we observe how the measures keep improving from the hunk 1 up to rea h the best values in hunk 3 and, from then on, they start to deteriorate hunk by hunk and obtaining the worst results on the last two hunks. As weak points of those results we an say that the best value of ERDE5 obtained in hunk 1 is not very good. Besides, even thought ERDE50 values are a eptable for most of the onsidered hunks, they need at least two hunks to show a ompetitive performan e. That aspe t looks reasonable if we onsider that TVT is based on the variation of terms between onse utive hunks and that information is not available on the rst hunk. As general on lusion to the  hunk by hunk analysis, we ould say that im- balan ed lasses seem to ae t in a dierent way to the dierent methods. BoW and CSA dire tly depend on the vo abulary of positive and negative lasses. In the rst hunk where texts are supposed to be the shortest, relevant words of the positive lass appearing in the posts will probably have more han e of being balan ed with respe t to the words appearing in the negative lass. That makes lassiers be more sensitive to the positive lass and, in onsequen e, the re all and general performan e is improved. As more information is read, words related to the negative lass are more probable to o ur introdu ing noise and 6 All the tables generated for the dierent probabilities an be downloaded from https://sites.google. om/site/l agnina/resear h/Tables_eRisk17.rar Table 1.Model : BoW + Naïve Bayes (p ≥ 0.8). Chunk by hunk setting. ERDE5 , ERDE50 , F1 -measure (F1 ), pre ision (π ) and re all (ρ) of the depressed lass. ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ERDE5 18.09 20.98 21.5 21.73 21.95 21.95 21.95 21.95 22.17 22.17 ERDE50 15.17 16.84 20.77 20.25 21.21 21.95 21.95 21.52 22.17 22.17 F1 0.19 0.16 0.09 0.11 0.09 0.09 0.09 0.09 0.09 0.13 π 0.12 0.11 0.06 0.17 0.06 0.06 0.06 0.06 0.06 0.08 ρ 0.45 0.35 0.2 0.25 0.2 0.2 0.2 0.2 0.2 0.3 Table 2. Model : CSA⋆ + RF (p ≥ 0.6). Chunk by hunk setting. ERDE5 , ERDE50 , F1 -measure (F1 ), pre ision (π ) and re all (ρ) of the depressed lass. ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ERDE5 21.93 25.64 25.46 25.57 26.12 25.68 25.68 25.46 25.35 25.68 ERDE50 19.47 24.94 25.46 23.35 25.37 24.2 23.46 22.5 22.39 23.47 F1 0.19 0.08 0.05 0.1 0.06 0.08 0.13 0.16 0.16 0.14 π 0.11 0.05 0.03 0.06 0.04 0.05 0.07 0.09 0.09 0.08 ρ 0.6 0.25 0.15 0.3 0.2 0.25 0.4 0.5 0.5 0.45 ae ting in onsequen e the performan e. TVT does not seem to be so ae ted by this problem showing a more stable performan e along all the hunks, with the best results in the third hunk and then with a little deterioration from then on. Those results ould be giving eviden e that the variation of terms (with f = 4) allows to better dete t the o urren e of relevant words of the positive lass in the rst hunks. However, it also seems to be ae ted by the unbalan e problem in subsequent hunks, although in a lower level than BoW and CSA representations. Unfortunately, verifying those hypotheses would require onsid- ering more balan ed settings and dierent f values what is out of the s ope of this paper. However, that important aspe t will be addressed in future works Another approa h for the CTD issue ould be dire tly use the probability (or some measure of onden e) assigned by the lassier to de ide when to stop reading a do ument and giving its lassi ation. That approa h, that in [9℄ is referred as dynami , only onsiders that this probability ex eeds some parti ular threshold to lassify the instan e/individual as positive. That means, that dif- ferent streams of messages ould be lassied as depressed in dierent stages ⋆ ( hunks). Table 4 show those statisti s for BoW, CSA and TVT representa- tions for those learning algorithms and probability thresholds that obtained the best performan e. There, we an see that TVT representation, with a Naïve Bayes and lassifying instan es as depressed when the assigned probability is 1, obtains the best results for the measures we are more interested in: ERDE5 , ERDE50 and F1 -measure. In this ontext, BoW gets a better re all value but at the expense of lowering the pre ision values resulting in a poor F1 -measure. Testing stage The previous results were obtained by training the lassiers with the T RDS − train data set and testing them with the T RDS − test data Table 3.Model : T V T + Naïve Bayes (p ≥ 0.6). Chunk by hunk setting. ERDE5 , ERDE50 , F1 -measure (F1 ), pre ision (π ) and re all (ρ) of the depressed lass. ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 ERDE5 14.24 14.27 14.59 14.83 15.17 15.51 15.74 15.84 16.21 16.13 ERDE50 10.80 7.22 7.02 9.24 9.25 9.97 10.73 10.73 11.06 10.96 F1 0.42 0.65 0.72 0.67 0.67 0.67 0.64 0.64 0.57 0.58 π 0.39 0.58 0.63 0.60 0.60 0.60 0.58 0.58 0.50 0.52 ρ 0.45 0.75 0.85 0.75 0.75 0.75 0.70 0.70 0.65 0.65 Table 4. Dynami Models for BoW-NB, CSA⋆ -NB and TVT-NB. ERDE5 ERDE50 F1 π ρ BoW (p ≥ 0.8) 21.05 18.13 0.24 0.14 0.75 CSA⋆ -NB(p = 1) 23.09 23.07 0.06 0.04 0.15 TVT-NB (p = 1) 14.13 11.25 0.40 0.47 0.35 set. The obvious question now is if similar results are obtained by training with the full training set of the pilot task (T RDS ) and using the lassiers with the data set T E DS that was in rementally released during the testing phase of the pilot task. In this new s enario, the TVT representation was used with a simple rule for the CTD issue that onsists in lassifying all the individual in the hunk 3 as positive (depressed) if a Naïve Bayes lassier produ ed a probability equal or greater than 0.6 for the positive lass. That strategy, that 3 we will refer as T V Tp≥0.6 , is motivated by the good results showed by TVT in ⋆ Table 3. We also tested the BoW, CSA and TVT representations with dynami strategies and using those probabilities that best values obtained in the training stage. As baselines we also tested two approa hes des ribed in [9℄ that will be named as Ran and M in. Ran, simply emits a random de ision (depressed/non- depressed) for ea h user in the rst hunk. M in, on the other hand, stands for minority and onsists in lassifying ea h user as depressive in the rst hunk. Table 5 shows the performan e of all the above mentioned approa hes on the test set of the pilot task (T E DS ). We also in luded the results reported in the eRisk page for the systems that obtained the best ERDE5 (F HDO − BCSGB ), ERDE50 (U N SLA) and F1 (F HDO − BCSGB ) measures on the pilot task. 3 Here we an observe that results obtained with T V Tp≥0.6 are not as good as those obtained in the training stage. However, the setting TVT-NB (p = 1) would have obtained the best ERDE5 s ore and the third ERDE50 value, with a small dieren e respe t to the best reported value (9.84 versus 9.68). Those good results of TVT were a hieved taking into a ount the best pa- rameters obtained in the training stage. However, it also would be interesting analysing what would have been the TVT's performan e if other parameter set- tings had been sele ted. Table 6 shows this type of information by reporting the results obtained with dierent learning algorithms (Naïve Bayes and Random Forest) and dierent probability values for dynami  approa hes to the CTD aspe t. The results are on lusive in this ase. TVT shows a high robustness in Table 5. Results on the T E DS test set. ERDE5 ERDE50 F1 π ρ Ran 16.83 14.63 0.17 0.11 0.4 M in 21.67 15.03 0.23 0.13 1 BoW (p ≥ 0.8) 16.45 10.87 0.38 0.25 0.77 CSA⋆ -NB(p = 1) 20.58 19.58 0.05 0.03 0.15 3 T V Tp≥0.6 13.64 10.17 0.53 0.46 0.62 TVT-NB (p = 1) 12.38 9.84 0.42 0.50 0.37 F HDO − BCSGA 12.82 9.69 0.64 0.61 0.67 F HDO − BCSGB 12.70 10.39 0.55 0.69 0.46 U N SLA 13.66 9.68 0.59 0.48 0.79 Table 6. Results of TVT with dierent learning algorithms and probability values. ERDE5 ERDE50 F1 π ρ TVT-NB (p ≥ 0.6) 13.59 8.40 0.50 0.37 0.75 TVT-NB (p ≥ 0.7) 13.43 8.24 0.51 0.39 0.75 TVT-NB (p ≥ 0.8) 13.13 8.17 0.54 0.42 0.73 TVT-NB (p ≥ 0.9) 13.07 8.35 0.52 0.42 0.69 TVT-NB(p = 1) 12.38 9.84 0.42 0.50 0.37 TVT-RF (p ≥ 0.6) 12.46 8.37 0.55 0.49 0.63 TVT-RF (p ≥ 0.7) 12.49 8.52 0.55 0.50 0.62 TVT-RF (p ≥ 0.8) 12.30 8.95 0.56 0.54 0.58 TVT-RF (p ≥ 0.9) 12.34 10.28 0.47 0.55 0.40 TVT-RF(p = 1) 12.82 11.82 0.20 0.67 0.12 the ERDE measures independently of the algorithm used to learn the model and the probability used in the dynami approa hes. Most of the ERDE5 values are low and in 7 out of 10 settings the ERDE50 values are lowest than the best re- ported in the pilot task (U N SLA: 9.68). In this ontext, TVT a hieves the best reported ERDE5 value up to now (12.30) with the setting TVT-RF (p ≥ 0.8) and the lowest ERDE50 value (8.17) with the model TVT-NB (p ≥ 0.8). 4 Con lusions and future work In this arti le we present temporal variation of terms (TVT) an approa h for early risk dete tion based on using the variation of vo abulary along the dierent time steps as on ept spa e for do ument representation. TVT naturally opes with the sequential nature of ERD problems and also gives a tool for dealing with unbalan ed data sets. Preliminary results with the eRisk 2017 data set show a better performan e of TVT in omparison to other su essful semanti analysis approa h and the standard BOW representation. It also shows a robust performan e along dierent parameter settings and rea hes the best reported results up to the moment for ERDE5 and ERDE50 error evaluation measures. As future work, we plan to apply the TVT approa h to other problems that an be dire tly ta kled as ERD problems su h as sexual predation and sui ide dis ourse identi ation. Our rst option to work will be the orpus used in the PAN-2012 ompetition on sexual predator identi ation [5℄ whi h shares several hara teristi s with the data set used in the present work su h as the sequentially of data, unbalan ed lasses and the requirement of dete ting the minority lass (predator) as soon as possible, among others. TVT is expli itly based on the enri hment of the minority lass with new on epts derived from the partial information obtained from the initial hunks. However, some improvements an be a hieved by also lustering the negative lass as proposed by [8℄ in author proling tasks. We arried out some initial experiments by ombining TVT with the lustering of the negative lass but more study is required to determine how both approa hes an be ee tively integrated. Besides, in the present work, the ele tion of f = 4 mainly aimed at obtaining balan ed positive and negative lasses. In future works, dierent f values will be onsidered to see how they impa t on the TVT's performan e. TVT provides, as a side ee t, an interesting tool for dealing with the unbal- an ed data set problem. We plan to apply TVT on unbalan ed data sets that do not ne essarily orrespond to the ERD eld and omparing it against other well known methods in this area, su h as SMOTE [1℄. Finally, it would be interesting omparing the on ept spa e used in our approa h against other re ent and ef- fe tive representations based on word embeddings. In this ontext, it ould also be analysed how our on ept spa e representation an be extended/improved with information provided by those embeddings. Referen es 1. N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. Ph. Kegelmeyer. Smote: Syntheti minority over-sampling te hnique. J. Artif. Intell. Res., 16:321357, 2002. 2. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semanti analysis. Journal of the ASIS, 41(6):391407, 1990. 3. H. Jair Es alante, M. Montes-y-Gómez, L. Villaseñor Pineda, and M. Erre alde. Early text lassi ation: a naïve solution. In A. Balahur, E. Van der Goot, P. Vossen, and A. Montoyo, editors, Pro . of WASSANAACL-HLT 2016, 2016, San Diego, California, USA, pages 9199. The Asso iation for Computer Linguisti s, 2016. 4. E. Gabrilovi h and S. Markovit h. Wikipedia-based semanti interpretation for natural language pro essing. JAIR, 34(1):443498, Mar h 2009. 5. G. In hes and F. Crestani. Overview of the international sexual predator identi- ation ompetition at pan-2012. In P. Forner, J. Karlgren, and C. Womser-Ha ker, editors, CLEF (Online Working Notes/Labs/Workshop), pages 112, 2012. 6. M. Lan, Ch. Tan, J. Su, and Y. Lu. Supervised and traditional term weighting methods for automati text ategorization. IEEE TPAMI, 31(4):721735, 2009. 7. Z. Li, Z. Xiong, Y. Zhang, Ch. Liu, and K. Li. Fast text ategorization using on ise semanti analysis. Pattern Re ogn. Lett., 32(3):441448, February 2011. 8. A. Pastor López-Monroy, M. Montes y Gómez, H. Jair Es alante, L. Villaseñor- Pineda, and E. Stamatatos. Dis riminative subprole-spe i representations for author proling in so ial media. Knowledge-Based Systems, 89:134  147, 2015. 9. D. E. Losada and F. Crestani. A test olle tion for resear h on depression and lan- guage use. In Experimental IR Meets Multilinguality, Multimodality, and Intera tion - 7th Int. Conf. of the CLEF Asso iation, Portugal, pages 2839, 2016.