<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Idriss Abdou Malam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamed Arziki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammed Nezar Bellazrak</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Farah Benamara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Assafa El Kaidi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bouchra Es-Saghir</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhaolong He</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mouad Housni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veronique Moriceau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josiane Mothe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Faneva Ramiandrisoa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>(1) IRIT, UMR5505, CNRS &amp; ENSEEIHT</institution>
          ,
          <addr-line>France, (2) IRIT, UMR5505</addr-line>
          ,
          <institution>CNRS &amp; Univ. Toulouse</institution>
          ,
          <addr-line>France, (3) LIMSI</addr-line>
          ,
          <institution>CNRS, Univ. Paris-Sud, Universit Paris-Saclay</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present the method we developed when participating to the e-Risk pilot task. We use machine learning in order to solve the problem of early detection of depressive users in social media relying on various features that we detail in this paper. We submitted 4 models which di erences are also detailed in this paper. Best results were obtained when using a combination of lexical and statistical features.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The WHO (World Health Organization) reports that \the number of people
su ering from depression and/or anxiety increased by almost 50% from 416
million to 615 million" from 1990 to 20131. Depression and Bipolar Support
Alliance also estimates that \major depressive disorder a ects approximately
14.8 million American adults" and \annual toll on U.S. businesses amounts to
about $70 billion in medical expenditures, lost productivity and other costs"
(http://www.dbsalliance.org).</p>
      <p>
        Depression detection is crucial and many studies are devoted to this challenge
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. While there are clinical factors that can help for early detection of patients
at risk for depression [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], in this paper we present our approach to help early
depression detection from social media analysis, as part of our participation to
CLEF e-risk 2017 pilot task [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Recent related work focus on people communication and social media post
analysis to detect depression. Rude's study shows that depressed people tend to
use the personal pronoun (\I") more intensively than others [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Other features
have also been noticed. For example, De Choudhury et al. noticed that the
depressive people show less activity during the day and more activity during
the night [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Schwartz at al. reported that depressive people tend to use swear
words and talk more about the past [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>These previous studies show that some cues and features extracted from
social media posts can be related to depression. In this paper, we report our
investigations on using various features in order to answer the e-risk challenge
1 http://www.who.int/mediacentre/news/releases/2016/</p>
      <p>
        depression-anxiety-treatment/fr/
as described in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The e-risk pilot task aims to detect a depressive person as
soon as possible by analysing her or his posts in Reddit2 that are provided as a
simulated data ow.
      </p>
      <p>In our participation runs, the features we used to characterize posts are of two
types: lexicon-based (extracted using NLTK toolkit3) and numerical features.
These features are used in a machine learning method using Weka.</p>
      <p>The remaining of this paper is organized as follows: Section 2 provides an
overview of the model we used. Section 3 details the di erent features we
implemented to train di erent models. In Section 4 we detail the 4 runs we have
submitted and the underlying models and present the results. In Section 5 we
discuss the results and depict future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Model overview</title>
      <p>2 https://www.reddit.com/
3 NLTK is a platform for building Python programs for natural language
processing that interfaces easily with text processing and machine learning libraries
(www.nltk.org)</p>
      <p>The model is composed of three modules. In the rst one, we pre-process
the XML les that contain the users' posts. The second module aims at
extracting the features. Notice that while some features capture information from any
textual parts, others focus either on the Title part (corresponding to the initial
post) or on the Text part which corresponds to comments on the initial post.
The feature extraction module is extensible: while we developed some features,
new features can easily be added. Then, in the formatting module, we select a
subset of the features to be used in the model.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Features and models</title>
      <p>We developed di erent types of features. Some have linguistic foundation while
others are more statistically-based. We distinguish lexicon-based features from
other numerical features.</p>
      <p>
        For lexicon-based features, we rely either on previous observations on
depressive subjects' behaviour [
        <xref ref-type="bibr" rid="ref11 ref3">3, 11</xref>
        ] or on hypothesis that we wanted to evaluate.
      </p>
      <p>Features are calculated for each user as follows: we rst calculate the feature
value for each of his or her post or comment, then we average the value over his
or her posts in the chunk ; when several chunks are used, we average the feature
values obtained for each chunk for the considered user.</p>
      <p>
        We also used some other numerical features that are described in Table 2.
The details of the features are described in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>We submitted 4 runs corresponding to 4 models. The features that were
used for each model are listed in Table 3. While we used model GPLA to start
with, the other models were introduced later on. The second column of Table 3
indicates the chunk number when each model was introduced. The 4 runs
corresponding to our 4 models were performed with the Random Forest learning
algorithm under the Weka platform using the default parameters.</p>
      <p>In order to decide whether to issue a decision for a subject or wait for more
chunks, we used the prediction con dence rate that Weka generates for each
prediction. We set a threshold (estimated using samples of depressive subjects)
and we only issued decisions that had a prediction con dence that exceeds the
selected threshold. The evolution of the threshold for each model through the
runs and according to the chunks can be tracked using Table 4, a threshold of
0.5 basically means that all predictions are considered.</p>
      <sec id="sec-3-1">
        <title>Num Name</title>
      </sec>
      <sec id="sec-3-2">
        <title>1 Self-Reference</title>
      </sec>
      <sec id="sec-3-3">
        <title>Over generalization</title>
      </sec>
      <sec id="sec-3-4">
        <title>Sentiment</title>
      </sec>
      <sec id="sec-3-5">
        <title>Emotion</title>
      </sec>
      <sec id="sec-3-6">
        <title>Past words</title>
      </sec>
      <sec id="sec-3-7">
        <title>Speci c verbs</title>
      </sec>
      <sec id="sec-3-8">
        <title>Targeted "I"</title>
      </sec>
      <sec id="sec-3-9">
        <title>Negative words</title>
      </sec>
      <sec id="sec-3-10">
        <title>Part-Of-Speech frequency</title>
      </sec>
      <sec id="sec-3-11">
        <title>Relevant 3-grams</title>
      </sec>
      <sec id="sec-3-12">
        <title>Relevant 5-grams</title>
      </sec>
      <sec id="sec-3-13">
        <title>Depression symptoms From De Choudhury et al. [3]</title>
        <p>&amp; related drugs and Wikipedia list4.</p>
      </sec>
      <sec id="sec-3-14">
        <title>Hypothesis or tool/resource used</title>
      </sec>
      <sec id="sec-3-15">
        <title>High frequency of self-reference words.</title>
      </sec>
      <sec id="sec-3-16">
        <title>Depressive users use words like: "everyone",</title>
        <p>"everywhere", "everything" a lot.</p>
      </sec>
      <sec id="sec-3-17">
        <title>Use of Vader analyser [4] for assigning a polarity score to users' posts: - Negative &lt; -0.05 and Positive &gt; 0.05 - Neutral otherwise</title>
      </sec>
      <sec id="sec-3-18">
        <title>High frequency of emotionaly negative words</title>
      </sec>
      <sec id="sec-3-19">
        <title>Used WordNet-A ect [12], to assign a label</title>
        <p>to each word: Negative, Positive or Ambiguous
we then calculated the frequency of each category</p>
      </sec>
      <sec id="sec-3-20">
        <title>High frequency of past words.</title>
      </sec>
      <sec id="sec-3-21">
        <title>High frequency of "were" and "was", "like" "have", "being"</title>
      </sec>
      <sec id="sec-3-22">
        <title>Depressive people tend to target themselves more in subjective context expecially using adjectives</title>
      </sec>
      <sec id="sec-3-23">
        <title>High frequency of negative words</title>
      </sec>
      <sec id="sec-3-24">
        <title>Used SentiWordNet [1] to detect negative words in texts</title>
      </sec>
      <sec id="sec-3-25">
        <title>Higher usage of verbs and adverbs and lower usage of nouns</title>
      </sec>
      <sec id="sec-3-26">
        <title>Higher frequency of 3-grams described</title>
        <p>
          by Gualtiero B. et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and suggested ones
Higher frequency of 5-grams described
by Gualtiero B. et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and suggested ones
2
3
4
5
6
7
8
9
10
11
12
13
        </p>
      </sec>
      <sec id="sec-3-27">
        <title>Relevant 1-grams</title>
      </sec>
      <sec id="sec-3-28">
        <title>Higher frequency of 1-grams described by Gualtiero B. et al. [2] and suggested ones Table 1. Details of the features based on lexicons.</title>
        <p>4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>
        The evaluation takes into account not only the correctness of the output of the
system (i.e. whether or not the user is depressed) but also the delay taken to
emit its decision. To this aim, the ERDE (Early Risk Detection Error ) metric
proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is used. This measure rewards early alerts and the delay taken by
the system to make its decision is measured by counting the number of distinct
textual items seen before giving the answer.
      </p>
      <p>Our best results when considering ERDE measures are obtained using model
GPLC which does not use POS results nor the most frequent n-grams. Including
them in the model slightly improves F1 measure mainly because of higher recall</p>
      <sec id="sec-4-1">
        <title>Num Name</title>
        <p>14 Variation of the
number of posts</p>
      </sec>
      <sec id="sec-4-2">
        <title>Average number of posts</title>
      </sec>
      <sec id="sec-4-3">
        <title>Average number of words per post Minimum number of posts</title>
        <p>15
16
17
18
19
20</p>
      </sec>
      <sec id="sec-4-4">
        <title>Hypothesis or tool/resource used</title>
      </sec>
      <sec id="sec-4-5">
        <title>For depressive people, the variation of</title>
        <p>the number of posts is generally small.</p>
      </sec>
      <sec id="sec-4-6">
        <title>Depressive users have a much lower number of posts.</title>
      </sec>
      <sec id="sec-4-7">
        <title>The two groups of users have di erent means.</title>
      </sec>
      <sec id="sec-4-8">
        <title>Depressive users have a lower value in general.</title>
      </sec>
      <sec id="sec-4-9">
        <title>Variation of the For depressive people, the variation of number of comments the number of comments is generally small.</title>
      </sec>
      <sec id="sec-4-10">
        <title>Average number</title>
        <p>of comments</p>
      </sec>
      <sec id="sec-4-11">
        <title>Depressive users have a much lower number of comments.</title>
      </sec>
      <sec id="sec-4-12">
        <title>Average number The two groups of users have di erent of words per comment variances. Table 2. Details of the other numerical features. Name</title>
        <p>(0.60 against 0.50) (see Table 5). Our run GPLB had the 2nd best Recall (0.83)
across participants and GPLA the 5th.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In the runs we submitted we consider 19 features. However, some additional
features are worth studying. In future work, we aim at considering temporal
features such as the date of the posts, part of the day, etc. Moreover, we would
like to modify the way features are calculated : in the case of lexicon-based
features, each lexicon item would be a distinct feature. By this way, we would
obtained a richer representation of each user and potentially a better detection.
12. C. Strapparava and A. Valitutti. Wordnet-a ect: an a ective extension of
wordnet. Proceedings of the 4th International Conference on Language Resources and
Evaluation (LREC 2004), Lisbon, May 2004, pp. 1083-1086, 2004.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Baccianella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          .
          <article-title>Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining</article-title>
          .
          <source>Istituto di Scienza e Tecnologie dell Informazione Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi</source>
          <volume>1</volume>
          , 56124 Pisa, Italy,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>G. B.</given-names>
            <surname>Colombo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Burnap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hodorog</surname>
          </string-name>
          , and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Scour eld. Analysing the connectivity and communication of suicidal users on twitter</article-title>
          .
          <source>Computer Communications</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>M. De Choudhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gamon</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Counts</surname>
            , and
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Horvitz</surname>
          </string-name>
          .
          <article-title>Predicting depression via social media</article-title>
          .
          <source>In ICWSM, page 2</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>C.</given-names>
            <surname>Hutto</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Gilbert</surname>
          </string-name>
          .
          <article-title>Vader: A parsimonious rule-based model for sentiment analysis of social media text</article-title>
          .
          <source>Eighth International Conference on Weblogs and Social Media (ICWSM-14)</source>
          . Ann Arbor, MI,
          <year>June 2014</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          .
          <article-title>A test collection for research on depression and language use</article-title>
          .
          <source>In Conference Labs of the Evaluation Forum, page 12</source>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Losada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Crestani</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Parapar. eRISK</surname>
          </string-name>
          <year>2017</year>
          :
          <article-title>CLEF Lab on Early Risk Prediction on the Internet: Experimental foundations</article-title>
          .
          <source>In Proceedings Conference and Labs of the Evaluation Forum CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. L.
          <string-name>
            <surname>-S</surname>
          </string-name>
          . A.
          <string-name>
            <surname>Low</surname>
            ,
            <given-names>N. C.</given-names>
          </string-name>
          <string-name>
            <surname>Maddage</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lech</surname>
            ,
            <given-names>L. B.</given-names>
          </string-name>
          <string-name>
            <surname>Sheeber</surname>
            , and
            <given-names>N. B.</given-names>
          </string-name>
          <string-name>
            <surname>Allen</surname>
          </string-name>
          .
          <article-title>Detection of clinical depression in adolescents speech during family interactions</article-title>
          .
          <source>IEEE Transactions on Biomedical Engineering</source>
          ,
          <volume>58</volume>
          (
          <issue>3</issue>
          ):
          <volume>574</volume>
          {
          <fpage>586</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>I. A.</given-names>
            <surname>Malam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Arziki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Bellazrak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El Kaidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Es-Saghir</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Housni</surname>
          </string-name>
          .
          <article-title>Automatic detection of depression in social networks</article-title>
          .
          <source>Technical report</source>
          , Universit de Toulouse, France,
          <year>07 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Rude</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.-M.</given-names>
            <surname>Gortner</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennebaker</surname>
          </string-name>
          .
          <article-title>Language use of depressed and depression-vulnerable college students</article-title>
          .
          <source>Cognition &amp; Emotion</source>
          ,
          <volume>18</volume>
          (
          <issue>8</issue>
          ):
          <volume>1121</volume>
          {
          <fpage>1133</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>U.</given-names>
            <surname>Sagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Finset</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Moum</surname>
          </string-name>
          , T. M rland, T. G. Vik,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nagy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Dammen</surname>
          </string-name>
          .
          <article-title>Early detection of patients at risk for anxiety, depression and apathy after stroke</article-title>
          .
          <source>General hospital psychiatry</source>
          ,
          <volume>32</volume>
          (
          <issue>1</issue>
          ):
          <volume>80</volume>
          {
          <fpage>85</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Eichstaedt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Kern</surname>
          </string-name>
          , G. Park,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stillwell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kosinski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Ungar</surname>
          </string-name>
          .
          <article-title>Towards assessing changes in degree of depression through facebook</article-title>
          .
          <source>In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality</source>
          , pages
          <volume>118</volume>
          {
          <fpage>125</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>