<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UACH-INAOE participation at eRisk2017</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alan A. Far as-Anzaldua</string-name>
          <email>alan.alexis.fa@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Montes-y-Gomez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A. Pastor Lopez-Monroy</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis C. Gonzalez-Gurrola</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto Nacional de Astrof sica</institution>
          ,
          <addr-line>Optica y Electronica</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Autonoma de Chihuahua</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Houston</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Being depression a mental disorder that a ects 1 in 10 people world-wide, it is important to react in an e ective way before fatal decisions could be taken. Moreover, with the rise of social media, some issues regarding how to identify patterns of depressed users in a timely fashion is an important task that is attracting the attention from the NLP community. Accordingly, the CLEF 2017 launched a challenge to identify depressed users in reddit forums. Our proposal to attend this task was based on a two-step classi cation procedure, where we rst look at the post level to create basic features that were then applied at user level, to build a pro le for each user. To evaluate this methodology, we applied a 10-fold cross validation on the training data, obtaining an F-score of 0.73 in the identi cation of depressed users. For the CLEF challenge, our team reached the fourth place, obtaining an F-score of 0.48.</p>
      </abstract>
      <kwd-group>
        <kwd>natural language processing</kwd>
        <kwd>depression</kwd>
        <kwd>social media</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>According to the National Institute of Mental Health (NIMH), 1 in 10 adults
will exhibit some kind of depression at some point in life [15]. People a ected
with depression su er from feelings of hopelessness, loss of motivation, have sleep
and eating disorders, and feel disconnected from other people. If untreated,
depression can lead to suicide. Psychological criteria for detection of depression is
presented in the Diagnostic and Statistical Manual of Mental Disorders, Fifth
Edition (DSM-5), and consists of long-term (two or more years) behavioral
analysis performed by a professional. This requires individuals to get diagnosed by
a psychologist, something that many people are not willing or capable to do.</p>
      <p>
        Online social platforms are quickly arising in popularity, becoming nearly
ubiquitous in countries where internet is readily-available. These platforms allow
people to share and express their thoughts and feelings freely and publicly with
other people. As could be supposed, this information is a rich source for Natural
Language Processing tasks such as: Opinion Mining [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Sentiment Analysis [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
or inferring mental health issues [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Following this last route, in this manuscript
we approach the problem of predicting depressive users based on the analysis of
posts that they share. Motivation for this task is that if depressive symptoms
are accurately identi ed in a timely fashion, then, professionals could intervene
before depression progresses. The approach that we introduce in this manuscript
was submitted to the CLEF eRisk 2017 Challenge (Early Risk Prediction On
The Internet). For this task, we were provided with a dataset [14] built from
reddit posts, grouped into anonymized users and ordered chronologically. reddit
is an online discussion forum that covers a wide variety of topics. Content is
organized into areas of interest called \subreddits", and it is there that threads
are created and commented.
      </p>
      <p>The proposed approach is based on a two-step classi cation procedure; the
rst step focuses on the analysis of individual posts by their content, whereas
the second step considers the classi cation of users by their behaviors expressed
by their kind of posts. The main idea behind our approach consists in modeling
each user post, instead of directly modeling each user, under the assumption
that each post is a window into the mind of a person in a particular point in
time. Then, if we collect enough of these posts, we can observe how that person's
mind changes - or do not - through time, in order to improve the model of each
user. In our rst set of experiments, we applied a 10-fold cross validation on the
training data, obtaining a F1 score of 0.73 for predicting depressed individuals
and above 0.90 for non-depressed individuals. For the challenge task, we ended
up in the fourth place (out of eight teams) with an F-score of 0.48. Since our
participation in the challenge was late, we only submitted predictions for the last
part of the competition, so we would like to further analyze our model simulating
their application at di erent levels of evidence for each user.</p>
      <p>The organization of this paper is as follows. Section 2 presents some related
work. In section 3 we describe our approach. Experimental Settings are
presented in section 4, while results are presented in section 5. Finally, Conclusion
is presented in section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Social media platforms have become an attractive place to infer health related
issues for entire communities or individual users. A study presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
compared online depression communities against control online communities,
reporting that using topics and style markers as features showed good predictive power
to di erentiate between users. Two of the most common online platforms that
have been used to mine this kind of data are Facebook and Twitter. In Facebook
some e orts have been made to identify di erent contexts of depression. In [13],
a study presents the case of predicting Postpartum Depression from shared data.
While on [12], the study focused on evaluating depressive symptoms through an
application developed for Facebook. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors attempt to answer the
question: what public health information could be learned from Twitter? Being
Depression one of the most common ailments that could be associated to a
certain group of words. One of the former works to characterize depressed users
in Twitter is [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], where authors proposed some attributes to measure depressive
behavior such as social engagement, language or emotion. The authors built a
Support Vector Machine that achieved an score of 70% accuracy to predict, even
prior to onset, depressive users. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Minsu et al. followed a qualitative study to
investigate perception di erences between depressed and non-depressed twitter
users. They found that depressive users were more prone to perceive Twitter as
a tool for social awareness and emotional interactions. More recently, Nadeem et
al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] compiled over 2.5M tweets to identify Major Depressive Disorder (MDD),
with a reported accuracy of 81%. The authors claim that their method could
even help estimate the risk of an individual being depressed. With a good
number of studies focused on detecting depression, new works suggest the possibility
of detecting other type of social phenomena like suicidal ideation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. To the
best of our knowledge, no works have been done using reddit as a platform, being
the dataset created by Losada and Crestani [14] the rst attempt to ll this gap.
We will use this dataset for experimentation.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>The two-step classi cation approach</title>
      <p>This section presents an overview of the proposed two-step classi cation
approach. The rst step considers the analysis of individual posts by their content.
Its aim is to discriminate messages produced by depressed and non-depressed
users. Then, a second step analyzes this information with the aim of modeling
the respective users' behaviors based on their category. The following explains
in detail these two steps.
3.1</p>
      <sec id="sec-3-1">
        <title>Post Level Analysis</title>
        <p>The main goal of the rst step is to classify each post as likely produced by
a depressed or non-depressed person. For this, we train a classi er using user
posts considering that all posts inherit the label from their respective author.
More formally, let D = f(D1; y1); : : : ; (Dn; yn)g be a training set of labeled
userdocuments, that is, D is a collection of n tuples of user-documents, i.e. posts,
(Di) and category-labels (yi), where yi 2 fdepressed; non-depressedg. Also let
Di = f(P1; yi); : : : ; (Pn; yi)g be the set of posts from user Di. We represented
each post Pj by a feature vector Pj that combines three di erent kinds of
features: unigrams, bigrams and trigrams. For each individual feature space we
considered the 10k top frequent items and then we selected all the features with
information gain greater than zero. Additionally, we included two extra time
attributes: the hour of the post and a binary attribute for whether or not it was
posted on a weekend. Therefore, the nal post representation Pj is expressed as
follows:</p>
        <p>Pj = hPj1gram; Pj2gram; Pj3gram; Pjtime; Pjweekendi
(1)
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>User Level Analysis</title>
        <p>The intuitive idea behind analyzing posts individually is that users can be
understood in terms of their behavior, rather than by their vocabulary. Hence, in
this second step we build the user representation by considering some statistics
obtained from the rst classi cation step.</p>
        <p>Given a document Di, we rst classify all its posts using the previous classi er
(refer to Section 3.1). Then, we model Di as the sequence of predicted labels
for each post Pj 2 Di, that is, Di0 = (y1; : : : ; yh), where yi corresponds to the
predicted label of post Pi. From this sequence of labels, we represent each user by
a 12-dimensional vector Di including the features described below. For simplicity
we will refer to a post as being "depressive" (or "non-depressive") depending if
it was generated by a depressed or non-depressed user.
1. Percentage of posts that were classi ed as \depressive".
2. Percentage of posts that were classi ed as \non-depressive".
3. Percentage of times that a \non-depressive" post is followed by another
\non-depressive" post.
4. Percentage of times that a \non-depressive" post is followed by a
"depressive" post.
5. Percentage of times that a \depressive" post is followed by a \non-depressive"
post.
6. Percentage of times that a \depressive" post is followed by another
\depressive" post.
7. Percentage of \depressive" posts written in the morning.
8. Percentage of \non-depressive" posts written in the morning.
9. Percentage of \depressive" posts written in the evening.
10. Percentage of \non-depressive" posts written in the evening.
11. Percentage of \depressive" posts written in the night.
12. Percentage of \non-depressive" posts written in the night.</p>
        <p>Time periods were considered after analyzing heatmaps of users' activity, See
Figure 1. Accordingly, the periods that we consider were morning [6am-2pm], ii)
evening [2pm-10pm], and iii) night [10pm-6am].
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Settings</title>
      <p>Dataset. For the evaluation we use the dataset presented in [14]. This dataset
is in XML format, where each user had a list of posts, and each post contained
the following elds: title, date, info, and text. The info eld simply contained the
string \reddit post" and was discarded for not providing any relevant
information. Title and text elds both contained useful information; they were merged
into a \text" eld.</p>
      <p>Preprocessing. Uppercase letters were turned to lowercase, numbers were
removed, and negations were joined to the following word (i.e., \don't want"
returns a single token \don't want").</p>
      <p>Classi cation. For post and user classi cation we used a Nave-Bayes Classi er
(NBC). For validation we applied a strati ed 10 fold-cross validation.</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>To evaluate the appropriateness of the proposed approach to identify depressed
users from online forums, we compared its performance against a traditional text
classi cation approach using a BOW representation. The rst part of Table 1
shows the results from this comparison; they indicate that the two-step classi
cation approach outperformed by more than 40% the baseline results. In order
to evaluate the usefulness of some components of the proposed approach we
carried out some experiments disabling some of them. In particular, we ran an
experiment without considering the time information for the post representation
and other using only word unigrams. The last two rows from Table 1 present the
results from these two experiments; they clearly show that word sequences and
time information are key elements in the classi cation of depressed posts and
therefore on the identi cation of depressed users.</p>
      <p>An analysis on the relevance of the features showed us that the most
discriminative n-grams were those related to the health condition of the users such
as depression, diagnosis and therapy; this result was somehow expected due to
the nature of the corpus. However, other n-grams related to feelings, news and
interpersonal relationships were also highly discriminative for both kind of users.
Table 2 shows the 40 most discriminative word n-grams.
depression, pope, to talk to, my depression, miserable, conservative
depression , global, depresesion and, therapist, gop, obama</p>
      <p>depression ., anxiety, suicidal, meds, $ billion, not alone
diagnosed, my life, myself, depressed, when i was, conspiracy
diagnosed with, because i have, boyfriend, feel like i</p>
      <p>hollywood, iraq, like i, report, president, cnn
makes me feel, of my life, isis, me feel, vote, helped me</p>
      <p>We also analyze the posting behavior of the users. The heatmaps in Figure
1 show that there is a clear di erence in the posting behaviors of both kinds of
users, particularly on weekends. We expected depressed users to be more active
during night, as people with depression also su ers from insomnia or has bad
sleeping habits overall, but evidence showed us that it was the contrary, not
depressed users were more active overall, except on weekends.
In this paper we presented our approach to identify depression through online
publications, which is based on a two step analysis. The analysis of individual
posts to characterize users' behavior was very useful in a second step to analyze
the users from higher point of view. In this work, the general behavior of each
user is captured by means of observing the posting history, especially the posting
behavior and the posting period of time.</p>
      <p>We presented experimental results of our method that shows to be useful
compared with other methodologies. For this, we study the user-documents from
di erent perspectives, including: post, and analysis of period of time. We found
that both kinds of users can be di erentiated by exploiting the time of
publication, but more importantly, the changes in their behavior related to depressed
and non-depressed posts. It was also found that both users have di erent
activity during the weekend and during the day, which is valuable for future research
paths.</p>
      <p>Future work includes developing an early risk assessment strategy. We could
not participate until late for this contest, and thus we did not develop an
strategy for early risk assessment.</p>
      <p>Acknowledgements. This work was supported by CONACYT under
scholarship 735623/599448 and project grant PDCPN-2014-01-247870.
12. Park S., Lee S. W., Kwak J., Cha M., Jeong B. (2013). Activities on Facebook
reveal the depressive state of users. J. Med. Int. Res. 15:e217 10.2196/jmir.2718
13. De Choudhury M, Counts S, Horvitz EJ, Ho A. Characterizing and predicting
postpartum depression from shared Facebook data. 2014 Presented at: ACM
Computer Supported Collaborative Work 2014; February 15-19, 2014; Baltimore, MD p.
626-638.
14. Losada, David E., and Fabio Crestani. "A Test Collection for Research on
Depression and Language Use." CLEF. 2016.
15. National Institute of Mental Health. Depression Research, 1999.
16. Michalek, R., Tarantello, G.: Subharmonic solutions with prescribed minimal
period for nonautonomous Hamiltonian systems. J. Di . Eq. 72, 28{55 (1988)
17. Tarantello, G.: Subharmonic solutions for Hamiltonian systems via a ZZp
pseudoindex theory. Annali di Matematica Pura (to appear)
18. Rabinowitz, P.: On subharmonic solutions of a Hamiltonian system. Comm. Pure
Appl. Math. 33, 609{633 (1980)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Clarke</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ekeland</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Nonlinear oscillations and boundary-value problems for Hamiltonian systems</article-title>
          .
          <source>Arch. Rat. Mech. Anal</source>
          .
          <volume>78</volume>
          ,
          <issue>315</issue>
          {
          <fpage>333</fpage>
          (
          <year>1982</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Clarke</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ekeland</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Solutions periodiques, du periode donnee, des equations hamiltoniennes</article-title>
          .
          <source>Note CRAS Paris 287, 1013{1015</source>
          (
          <year>1978</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Choudhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gamon</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Counts</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Horvitz</surname>
            ,
            <given-names>E:</given-names>
          </string-name>
          <article-title>Predicting depression via social media</article-title>
          .
          <source>Proceedings of the 7th International AAAI Conference on Weblogs and Social Media</source>
          , Boston, MA, July 8-
          <issue>11</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Paul</surname>
            <given-names>MJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dredze</surname>
            <given-names>M.:</given-names>
          </string-name>
          <article-title>You are what you Tweet: Analyzing Twitter for public health</article-title>
          .
          <source>ICWSM. 2011 Jul</source>
          <volume>17</volume>
          ;
          <fpage>20</fpage>
          :
          <fpage>265</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Phung</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>B</given-names>
          </string-name>
          and
          <string-name>
            <surname>Venkatesh</surname>
            , S and Berk,
            <given-names>M:</given-names>
          </string-name>
          <article-title>A ective and content analysis of online depression communities</article-title>
          .
          <source>IEEE Transactions on A ective Computing</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <fpage>217226</fpage>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Coppersmith</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dredze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hollingshead</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>From ADHD to SAD: Analyzing the language of mental health on Twitter through selfreported diagnoses</article-title>
          .
          <source>NAACL HLT</source>
          <year>2015</year>
          ,
          <article-title>1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Nadeem</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horn</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Coppersmith</surname>
            <given-names>G</given-names>
          </string-name>
          (
          <year>2016</year>
          )
          <article-title>Identifying depression on Twitter</article-title>
          .
          <source>arXiv:1607</source>
          .07384 [cs,Stat]. Available at http://arxiv.org/abs/1607.07384. Accessed on May 23
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kouloumpis</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <given-names>T.</given-names>
            , and
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. D.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Twitter sentiment analysis: The good the bad and the OMG!</article-title>
          .
          <source>ICWSM</source>
          ,
          <volume>11</volume>
          (
          <fpage>538</fpage>
          -
          <lpage>541</lpage>
          ),
          <fpage>164</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Park</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McDoald</surname>
            <given-names>DW</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cha</surname>
            <given-names>M.</given-names>
          </string-name>
          <article-title>Perception di erences between the depressed and non-depressed users in Twitter</article-title>
          .
          <source>Seventh International AAAI Conference on Weblogs and Social Media; July 8-11</source>
          ,
          <year>2013</year>
          ; Massachusetts, USA.
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>G. B. Colombo</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Burnap</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hodorog</surname>
            , and
            <given-names>J.</given-names>
          </string-name>
          <article-title>Scour eld. Analysing the connectivity and communication of suicidal users on twitter</article-title>
          .
          <source>Computer Communications</source>
          ,
          <volume>73</volume>
          :
          <fpage>291300</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2004</year>
          ,
          <article-title>July)</article-title>
          .
          <article-title>Mining opinion features in customer reviews</article-title>
          .
          <source>In AAAI</source>
          (Vol.
          <volume>4</volume>
          , No.
          <issue>4</issue>
          , pp.
          <fpage>755</fpage>
          -
          <lpage>760</lpage>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>