-

3 For instan e, ex eeding a spe i onden e threshold in the predi tion of the laste hnique that aims at dealing with those short omings by interpreting words sis approa hes su h as latent semanti analysis (LSA) [2℄ and expli it semanti 2.1 Con ise Semanti Analysis two well known drawba ks. First, their high dimensionality and sparsity; se and do uments in a spa e of on epts. Dierently from other semanti analysier [9℄. ond, they do not apture relationships among words. CSA is a semanti analysis Standard text representation methods su h as Bag of Words (BoW) suer of The rest of this do ument is organized as follows: Se tion 2 des ribes our with our method on the eRisk 2017 dataset. Finally, Se tion 4 depi ts potential spe i point of the time. That aspe t, that ould be named as lassi ation eRisk 2017 data set and rea hed the best (lowest) reported results up to the do uments read up to the lassi ation point as standard omplete do uments. with partial information (CPI) might be addressed with a simple approa h that onsists in training with omplete do uments as usual and onsidering the partial poral variation of terms (TVT), seems to show some interesting hara teristi s use the temporal variation of terms as on ept spa e of a re ent on ise semanrated approa hes might be used.

In this arti le we propose an original idea that expli itly onsiders the sequente hnique whi h models words and do uments in a small on ept spa e whose issue has been addressed with very simple heuristi although more elabo- rules3 proposed method for the ERD problem. Se tion 3 shows the obtained results ment. This aspe t, that we will refer as the lassi ation time de ision (CTD) should be assigned to a do ument, but also de iding when to make that assignBayes algorithm to deal with partial information. ti analysis (CSA) approa h [7℄. CSA is an interesting do ument representation future works and the obtained on lusions.

In [3℄ the CPI aspe t was onsidered by analysing the robustness of the Nave tiality of data to deal with the unbalan ed data sets problem. In a nutshell, we to deal with the ERD problem. In fa t, it obtained a robust performan e on the on epts are obtained from ategory labels. CSA has obtained good results in author proling tasks [8℄ and the variant proposed in this arti le, named temLast, but not least, an ERD system needs to onsider not only whi h lass moment for and error evaluation measures. ERDE5 ERDE50 words and text fragments in a spa e of on epts that are lose (or equal) to the uments will be represented in a spa e. That spa e size is usually q-dimensional ategory labels. For instan e, if do uments in the data set are labeled with q to standard ategory labels although, as we will see later, they might represent To explain the main on epts of the CSA te hnique we rst introdu e some basi notation that will be used in the rest of this work. Let D = {hd1, y1i, . . . , hdn, yni} more elaborate aspe ts. In this ontext, we will denote as to V = {t1, . . . , tm} the vo abulary of terms of the olle tion being analysed. be a training set formed by pairs of do uments and variables that indi- n (di) (yi) is the on ept spa e. For the moment, onsider that these on epts orrespond vo abulary size (more than 10000 or 20000 elements in general). analysis (ESA) [4℄ whi h usually require huge omputing osts, CSA interprets dierent ategory labels (usually no more than 100 elements), words and do mu h smaller than standard BoW representations whi h dire tly depend on the ate the on ept the do ument is asso iated with, where yi ∈ C C = {c1, . . . , cq}

4 In that work, on epts are referred as proles and subgroups as sub-proles . 2.2 Temporal Variation of Terms

( 4 ) do uments, will be onsidered as a new on ept spa e for a CSA method. als but also by the partial do uments obtained in the dierent hunks. Following the minority lass is formed not only by the omplete do uments of the individuinto a ount. In this ontext, one might think that variations of the terms used in the general ideas posed in CSA, we ould onsider that the partial do uments work named temporal variation of terms (TVT) arises, whi h onsists in enri hing the do uments of the minority lass with the partial do uments read in the An alternative to try to alleviate the UDS problem would be to onsider that read in the dierent hunks represent temporal on epts that should be taken these dierent sequential stages of the do uments may have relevant information for the lassi ation task. With this idea in mind, the method proposed in this rst hunks. These rst hunks of the minority lass, along with their omplete

3.2 Experimental Results

3 Experimental Analysis 3.1 Data Set

6 All the tables generated for the dierent probabilities an be downloaded from

https://sites.google. om/site/l agnina/resear h/Tables_eRisk17.rar

However, this is not the worst aspe t. Only a 12% of the individual lassied

an observe that this model only re overs a 45% of the depressed individuals.

In our study we onsider the two values of used in the pilot task: o o = 5 low measure (0.19). Table 2 shows similar results when a (random F1 CSA⋆-RF Bayes lassier. Those values orrespond to the setting where an instan e is probability greater or equal than 0.8 Surprisingly, the best results for (p ≥ 0.8). and error values with respe t to the previous model. (ERDE50) forest) ombination with is used to lassify the writings of the individuals. p ≥ 0.6 all the onsidered measures are obtained on the rst hunk. In this hunk, we asso iated probability is greater (or equal) than ertain threshold p tr (p ≥ tr). onsidered as depressive if the lassier assigns to the target/positive lass a In this study we evaluated 5 dierent settings for the probabilities assigned for ability of the predi ted lass. In those ases, we an sele t dierent thresholds tr Table 1 shows the results obtained with a BoW representation and a Nave their predi tions with some kind of onden e, in general, the estimated proband In ea h hunk, lassiers usually produ e (ERDE5) o = 50 (ERDE50). onsidering that an instan e (do ument) is assigned to the target lass when its as depressed ee tively had this ondition resulting in onsequen e in a very ea h lassier: and Due to spa e p = 1, p ≥ 0.9, p ≥ 0.8, p ≥ 0.7 p ≥ 0.6. onstraints, only the best results obtained with a parti ular setting are shown.6 Here, measure is also low but we an observe a deterioration in the F1 (ERDE5) 0.40 0.47 0.35 0.24 0.14 0.75 0.06 0.04 0.15

Li ,

Xiong ,

Zhang , Ch. Liu, and

Li . Fast text ategorization using on ise

minority over-sampling te hnique . J. Artif. Intell. Res. , 16 : 321357 , 2002 .

N. V.

Chawla ,

K. W.

Bowyer ,

L. O.

Hall , and

Ph . Kegelmeyer. Smote: Syntheti

editors, CLEF (Online Working Notes/Labs/Workshop) , pages 112 , 2012 .

- 7th Int. Conf. of the CLEF Asso iation, Portugal , pages 2839 , 2016 .

methods for automati text ategorization . IEEE TPAMI , 31 ( 4 ): 721735 , 2009 .

guage use . In Experimental IR Meets Multilinguality , Multimodality, and Intera tion

D. E.

Losada and

Crestani . A test olle tion for resear h on depression and lan-

Lan , Ch. Tan,

Su , and

Lu . Supervised and traditional term weighting

author proling in so ial media . Knowledge-Based Systems , 89 : 134 147, 2015 .

California , USA, pages 9199 . The Asso iation for Computer Linguisti s , 2016 .

Jair

Es alante, M. Montes-y-

Gmez , L. Villaseaeor

Pineda , and M.

Erre alde .

ation ompetition at pan-2012 . In P. Forner,

Karlgren , and

Womser-Ha ker ,

Pineda , and E.

Stamatatos . Dis riminative subprole-spe i representations for

5. G.

In hes and F. Crestani. Overview of the international sexual predator identi-

semanti analysis. Pattern Re ogn . Lett. , 32 ( 3 ): 441448 , February 2011 .

Pastor Lpez-Monroy, M. Montes y Gmez, H. Jair Es alante, L. Villaseaeor-

4. E.

Gabrilovi h and S. Markovit h. Wikipedia-based semanti interpretation for

Deerwester ,

S. T.

Dumais ,

G. W.

Furnas ,

T. K.

Landauer , and

Harshman .

and A . Montoyo, editors, Pro . of WASSA NAACL-HLT 2016 , 2016 , San Diego,

Early text lassi ation: a nave solution . In A. Balahur , E. Van der Goot , P. Vossen,

natural language pro essing . JAIR , 34 ( 1 ): 443498 , Mar

2009 .

Indexing by latent semanti analysis . Journal of the ASIS , 41 ( 6 ): 391407 , 1990 .