<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>13</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>This type of situations have started to be studied in a new resear h eld</title>
      <p>To deal with this problem we implemented an original proposal that we
The eRisk 2017 pilot task was organized into two dierent stages: training
read as a stream and the hallenge onsists in dete ting risk ases as soon as
test data set was released, we ould verify that TVT alone might have obtained
the test data set following a sequential, hunk by hunk riterion; that is, the
very robust results and the lowest reported error for both thresholds ERDE
system. Next, in Se tion 3 the a tivities arried out in the training stage are
opinions. The resulting system had a very a eptable performan e on the data
ould have in many relevant and urrent problems of the real world. In this
named temporal variation of terms (TVT) [3℄. However, at the training stage,
plement it with other standard methods to help our approa h in these spe i
on the eRisk 2017 dataset released in the test stage. Then, some omplementary
The rest of the arti le is organized as follows: Se tion 2 des ribes some general
the se ond oldest 10%, and so forth up to omplete 10 hunks that represent the
set released in the test stage obtaining the lowest error on a total of 30 ERDE50
task was organized and the present arti le des ribes our parti ipation in this
potential future works and the obtained on lusions.
submissions from 8 dierent institutions. However, on e the golden-truth of the
known as early risk dete tion (ERD) whi h has re eived in reasingly interest
ontext, this year was organized the rst early risk predi tion onferen e eRisk
this work with preliminary results obtained with TVT alone on the test set in
to depression, among others.
and test stages. It also assumed an ERD s enario, that is, data are sequentially
of potential paedophiles, people with sui idal in linations, or people sus eptible
system, are presented. Se tion 4 shows the obtained results with our proposal
full writing of the analysed individuals.
order to observe the potential of this method for general ERD problems.
task.</p>
      <p>TVT seemed to show some weakness at spe i hunks so we de ided to
omrst hunk ontained the oldest 10% of the messages, the se ond hunk ontains
results are presented in Se tion 5 where interesting aspe ts are shown on the
performan e of TVT working alone on the test set. Finally, Se tion 6 depi ts
strongly depends on TVT but also uses other methods as additional sour e of
used in the pilot task. For this reason, we also in lude an additional se tion in
possible. In order to reprodu e this s enario, the eRisk 2017’s organizers released
aspe ts of the data set used in the pilot task and the methods used in our ERD
from s ienti resear hers at world level due to the important impa t that it
in the ontext of the CLEF 2017 Workshop. As part of this event, a pilot 20173
situations. Thus, our implemented ERD system is in fa t a ombined system that
des ribed and the justi ations of the main design de isions made on our ERD</p>
    </sec>
    <sec id="sec-2">
      <title>2.2 Methods</title>
      <p>Do ument representations
2 Data set and methods
2.1 Data Set</p>
    </sec>
    <sec id="sec-3">
      <title>In our study we used BoW with the boolean weighting s heme.</title>
      <p>frequen y ) or term frequen y - inverse do ument frequen y )). (tf (tf − idf
word appears. This popular representation is simple to implement, fast to
sentations, features are words and do uments are simply treated as olle
Bag of Words. The traditional Bag of Words (BoW) representation is one of
a ording to whether the word appears in a do ument or how frequently this
tions of unordered words. Formally, a do ument is represented by the ve tor d
the dataset. Ea h weight is a value that is assigned to ea h feature (word) wi
of weights where is the size of the vo abulary of dBoW = (w1, w2, ..., wn) n
obtain and an be used under dierent weighting s hemes (boolean, term
the language models most used in text ategorization tasks. In BoW
reprelogi al aspe ts of individuals. LIWC has been su essfully used to identify
numbers) and pun tuation (number of apostrophes) aspe ts.
ea h text in the dataset. In the model, the di tionary ontains all n-gram
idal individuals have in orporated LIWC as a valuable tool to extra t
inCount (LIWC)[11, 10℄ have been used in several studies related to psy
hohara ter have demonstrated to be ee tive in many appli ations n-grams
depressed and non-depressed people analyzing linguisti markers of
depres[2, 4℄. Due to the fa t we wanted to onsider more meaningful features, we
death, health, sad, they, I, sexual, ller, swear, anger, and negative emotions
words and words with more than 6 letters), psy hologi al pro ess (for
ex[12℄ and the presen e of words related to the death (e.g., dead, kill,
suifeatures. Features derived from Linguisti Inquiry and Word LIW C-based
sion su h as the use of the personal pronouns and positive-negative emotions
ide), sex (e.g., arouse, makeout, orgasm) and ingestion (e.g., hew,
[5℄ where are onsidered the terms used in BoW representations. n-grams
also onsidered in preliminary studies the most informative features
belongof fun tion words), summary of language variables (for example, di tionary
that o ur in any term in the vo abulary. The representations using n-grams
ample, negative emotions and ae tive pro esses) and, grammar (verbs and
formation related to sui ide and sui idal ideation analyzing the ategories
Chara ter 3-grams. A is a sequen e of hara ters obtained from n-gram n
drink, hunger) besides emotions also resulting useful [1℄. Studies on
suiing to linguisti dimensions (for example, personal pronouns and number
see later, this de ision tree was used to assist to the TVT method in the initial
our system, a de ision tree (Weka’s J48) obtained by rst sele ting the 100 words
Even though we arried out several omparative studies with LIWC features,
J48 de ision tree previously explained.
rithms and we obtained the best results with Random Forest and Nave Bayes in
spe i situations. For this reason, in our ERD system we only used the Random
to sele t only those approa hes that seemed to be ee tive to assist TVT in some
of features obtained a de ision tree of only 39 nodes ontaining some interesting
that were onsidered dependent of spe i domains (names of politi ians like
hunks.
words like meds, depression, therapist and ry, among others. As we will
Obama and ountries like China). The J48 algorithm trained on this subset
several omparisons with other popular methods like We also used in LIBSVM5.
Learning algorithms In preliminary studies we tested dierent learning
algoForest and Nave Bayes algorithms with BoW and TVT representations and the
with the highest information gain and then removing from that list those words
hara ter 3-grams, CSA representations and the LIBSVM algorithm, we de ided
ertain basi properties in the dierent hunks:
values. However, they obtained low re all values. In order to address this
an instan e as positive if the three lassiers lassied it in that way.
hunks the penalization omponent in the omputation would ae t the ERDE
it as positive with probability p ≥ 0.9.
where best results were obtained ( hunks 3 and 4) and assuming that after those
as positive if both models obtained with TVT-NB and TVT-RF lassied
Fig. 2. Pre ision values obtained with TVT models.
the words in a white list. That list was obtained from the words with the
the instan e as positive with probability and the text in luded all p = 1
words depression and diagnosed.
1. Chunk 1 : Here the ERD system should be extremely onservative and only
lassifying an instan e as positive (depressed) if there exists strong eviden e
highest information gain of the do uments of the rst hunk. It in luded the
plement the TVT methods with a more general approa h. For this end, we
TVT’s results we de ided to set some hunk by hunk rules that a omplish
2. Chunk 2 : Here the restri tion of the white list ould be relaxed and
omused the predi tions of the J48 de ision tree explained above and lassied
aspe t, an instan e was lassied as positive if at least two lassiers lassify
Due to the fa t that we wanted to fo us our predi tions on those hunks
of that. We used for this ase, the riterion of only lassifying an instan e
3. Chunk 3 : In this hunk most of the lassiers obtained the best pre ision</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Analysis of results</title>
      <p>method, this last method played a role of lter obtaining good results when
the J48 method. For this reason, any instan e lassied as positive by both,
instan es not dete ted for this ombination were lassied as positive by
additional information to determine whether an individual was depressive or
5. Chunks 5 to 10 : From hunk 5 forward, we assumed that most of the relevant
pre ision. We also ould observe that when it was ombined with the TVT
not. For this purpose, we used the same rule of hunk 4 for hunks 5 and 6
but in reasing the probability of BoW and TVT to From hunks 7 p ≥ 0.8.
lassied as positive.</p>
      <p>BoW and TVT methods with probability or by the J48 method was p ≥ 0.7
both methods lassied an instan e as positive. On the other hand, many
it as positive with probability p ≥ 0.9.
lassi ations had already been made in the previous hunks. However, we
4. Chunk 4 : BoW obtained in this hunk the highest re all values but low
to 10 an instan e was lassied as positive if at least two lassiers lassied
kept a monitoring system to identify those ases that needed mu h more
0.24 0.14 0.75
0.22 0.43 0.14
0.4 0.47 0.35
0.4 0.33 0.5
0.59 0.48 0.75
0.52 0.42 0.69
0.42 0.50 0.37
0.50 0.37 0.75
0.56 0.54 0.58
0.54 0.42 0.73
0.47 0.55 0.40
0.20 0.67 0.12
0.55 0.49 0.63
0.51 0.39 0.75
0.55 0.50 0.62
algorithm used to learn the model, and the probability threshold. Most of the
approa hes to the CTD aspe t. The obtained results are on lusive in this ase.
stage we ould not obtain su h as good results. This makes us on lude that if
the organizers, we ould analyze what would have been the performan e of the
with 9.68). In this ontext, TVT a hieves the best value (U N SLA) ERDE5
Our ERD system tested on the pilot task was derived from our analysis of the
algorithms. Besides, dierent probability values were tested for the dynami
were lowest than the best one reported in the pilot task (the ombined methods
with TVT representation and using Nave Bayes and Random Forest as learning
the TVT method had parti ipated alone in the pilot task, it had obtained similar
had been sele ted.
reported up to now (12.30) with the setting TVT-RF and the lowest (p ≥ 0.8)
value (8.17) with the model TVT-NB The performan e of ERDE50 (p ≥ 0.8).
TVT shows a high robustness in the measures independently of the ERDE
TVT’s values were low and in 7 out of 10 settings, the values ERDE5 ERDE50
on e the golden-truth information of the set was made available by T EDS
weakness and strengths of the TVT method on the training data. However,
or better results than the ones obtained with the ombined methods.
the TVT methods on the test set was surprising for us be ause on the training
models, in parti ular the TVT method, working alone if dierent probabilities
Table 6 shows this type of information by reporting the results obtained</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>3 Training Stage</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>CHI '16</source>
          , pages
          <fpage>20982110</fpage>
          , New York, NY, USA,
          <year>2016</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Major depression duration redu es appetitive word use: An elaborated verbal re all</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>and psy hometri properties of liw</source>
          <year>2015</year>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>of emotional photographs</article-title>
          .
          <source>Journal of Psy hiatri Resear h</source>
          ,
          <volume>47</volume>
          (
          <issue>6</issue>
          ):
          <fpage>809</fpage>
          <lpage>815</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          577,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Cagnina</surname>
          </string-name>
          .
          <article-title>Temporal Variation of Terms as on ept spa e for early risk predi tion</article-title>
          . In
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>Predi tion on the Internet: Experimental foundations</article-title>
          . In Pro eedings Conferen e
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          1.
          <string-name>
            <surname>M. R. Cape</surname>
          </string-name>
          elatro, M. D. Sa het, P. F.
          <article-title>Hit h o k, S</article-title>
          . M.
          <string-name>
            <surname>Miller</surname>
            , and
            <given-names>W. B.</given-names>
          </string-name>
          <string-name>
            <surname>Britton</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          3.
          <string-name>
            <surname>M. L. Erre alde</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>D. G.</given-names>
          </string-name>
          <string-name>
            <surname>Funez</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <article-title>Gar iarena U elay</article-title>
          , and L. C.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <volume>60</volume>
          (
          <issue>1</issue>
          ):
          <fpage>926</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lawless</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Goeuriot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          , L. Cappel-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>and Efstathios</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          .
          <article-title>Dis riminative subprole-spe i representations for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>and Labs of the Evaluation Forum CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          bution.
          <article-title>Journal of the Ameri an So iety for Information S ien e</article-title>
          and Te hnology,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>and Intera tion.</surname>
          </string-name>
          , volume
          <volume>10456</volume>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Language</given-names>
            <surname>Use</surname>
          </string-name>
          , pages
          <fpage>2839</fpage>
          . Springer International Publishing, Cham,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>words: an liw analysis of sui ide notes from spain</article-title>
          .
          <source>European Psy h.</source>
          ,
          <volume>27</volume>
          :
          <fpage>1</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>lato</surname>
          </string-name>
          , and N. Ferro, editors,
          <source>Experimental IR Meets Multilinguality</source>
          , Multimodality,
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>Pro eedings of the 2016 CHI Conferen e on Human Fa tors in Computing Systems,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Gar</surname>
          </string-name>
          ia-Caballero,
          <string-name>
            <given-names>J. JimØnez</given-names>
            , M.
            <surname>Fernandez-Cabana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Gar</surname>
          </string-name>
          a-Lado. Last
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          .
          <article-title>A Test Colle tion for Resear h on Depression and</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          11.
          <string-name>
            <given-names>J.W.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.R.</given-names>
            <surname>Mehl</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.G.</given-names>
            <surname>Niederhoer</surname>
          </string-name>
          .
          <article-title>Psy hologi al aspe ts of nat-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>ural language use: Our words, our selves</article-title>
          .
          <source>Annual review of psy hology</source>
          ,
          <volume>54</volume>
          (
          <issue>1</issue>
          ):
          <fpage>547</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          2.
          <string-name>
            <surname>M. De Choudhury</surname>
            , E. Ki iman,
            <given-names>M.</given-names>
            Dredze, G. Coppersmith, and M.
          </string-name>
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          . Dis-
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          10.
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Boyd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Jordan</surname>
          </string-name>
          , and
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Bla kburn</article-title>
          .
          <source>The development</source>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>author proling in so ial media</article-title>
          .
          <source>Knowledge-Based Systems</source>
          ,
          <volume>89</volume>
          :
          <fpage>134</fpage>
          147,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          5.
          <string-name>
            <given-names>M.</given-names>
            <surname>Koppel</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>S hler, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Argamon</surname>
          </string-name>
          .
          <article-title>Computational methods in authorship attri-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>semanti analysis. Pattern Re ogn</article-title>
          .
          <source>Lett.</source>
          ,
          <volume>32</volume>
          (
          <issue>3</issue>
          ):
          <fpage>441448</fpage>
          ,
          <year>February 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <article-title>psy hology of word use in depression forums in english and in spanish: Testing two</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Lpez-Monroy</surname>
          </string-name>
          , M. Montes y Gmez,
          <string-name>
            <surname>H. J. Es alante</surname>
          </string-name>
          , L. Villaseaeor-Pineda,
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          12. N.
          <string-name>
            <surname>Ramirez-Esparza</surname>
            ,
            <given-names>C. K.</given-names>
          </string-name>
          <string-name>
            <surname>Chung</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <article-title>Ka ewi z, and</article-title>
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          . The
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          9.
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Parapar. eRISK</surname>
          </string-name>
          <year>2017</year>
          : CLEF Lab on Early Risk
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>overing shifts to sui idal ideation from mental health ontent in so ial media</article-title>
          . In
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <article-title>text analyti approa hes</article-title>
          .
          <source>In Pro . ICWSM</source>
          <year>2008</year>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Liu, and
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Fast text ategorization using on ise</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>