<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>INAOE-CIMAT at eRisk 2020: Detecting Signs of Self-Harm using Sub-Emotions and Words</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mario Ezra Aragón</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A. Pastor López-Monroy</string-name>
          <email>pastor.lopez@cimat.mx</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel Montes-y-Gómez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sentiment</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bag of Sub-Emotions</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Centro de Investigación en Matemáticas (CIMAT)</institution>
          ,
          <country country="MX">Mexico</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Instituto Nacional de Astrofísica</institution>
          ,
          <addr-line>Óptica y Electrónica (INAOE)</addr-line>
          ,
          <country country="MX">Mexico</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our approach to the detection of self-harm at eRisk 2020. The main objective of this shared task was to identify as soon as possible if a user presents signs of committing selfharm by using their posts on Reddit. To tackle this problem, we used a representation called Bag of Sub-Emotions (BoSE), an approach that represents the posts of the users in a set of sub-emotions, in combination with a Bag of Words. With this strategy, we were able to capture the sub-emotions and topics that users with signs of self-harm tend to use. For the early classification, we choose five different strategies based on the temporal stability shown by the users through their posts. Our approach showed competitive performance in comparison with other participants. Additionally, the interpretability and simplicity of our representation present an opportunity for the analysis detection of different mental disorders in social media.</p>
      </abstract>
      <kwd-group>
        <kwd>Self-harm Detection</kwd>
        <kwd>Analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Self-harm is defined as the direct and intentional injuring of body tissue with
the intent to commit suicide [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. People that commit self-harm commonly use a
sharp object to cut their own skin. This practice includes other behaviors such
as burning, hitting body parts, ingestion of toxic substances or scratching. The
desire for self-harm is a common symptom of some mental disorders like
depression, anxiety, eating disorders, post-traumatic stress disorders, etc. The 2020
eRisk@CLEF shared task 1 tackled the problem of detecting users that present
signs of commit self-harm using Natural Language Processing (NLP) techniques
and machine learning approaches [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. To accomplish this, participants needed to
process the post history of the users as pieces of evidence and make predictions
as soon as possible. The posts were processed in chronological order applying
different analyses of the user’s interactions in their social media platforms.
      </p>
      <p>
        In this work, we describe the joint participation of INAOE-CIMAT, two
research centers from Mexico, at eRisk@CLEF shared task 1. For this
participation, we used a representation called Bag of Sub-Emotions (BoSE), an approach
based in the use of fine-grained emotions to capture specific emotional topics on
posts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This representation consists of changing the users’ posts to a masked
string of sub-emotions. It uses a clustering algorithm to create the sub-emotions
from a lexical resource of emotions, and then generates a histogram of these new
fine-grained emotions. For our participation, we evaluated the BoSE
representation using five different strategies for generating early decisions.
      </p>
      <p>The remainder of this paper is as follows: Section 2 presents some related work
for the self-harm detection task and early predictions. Section 3 describes our
text representation. Section 4 and Section 5 presents the experimental settings
as well as the obtained results. Lastly, Section 6 depicts our conclusions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        As previously described, self-harm is a mental disorder associated with the intent
of committing suicide or directly damaging the body. Most works related to the
detection of the signs of self-harm in social media content focus on the analysis
of the post content, mostly considering different kinds of word based features
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. For example, in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] the authors implemented a word-based approach that
estimates the risk of commit self-harm based on several term statistics such as
their class frequency and inter-class significance. Another approach focused on
identifying personal phrases and on extracting content-based features from them
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In this work words and word n-grams are selected and weighted regarding
their co-occurrence with personal pronouns. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the authors implemented
a method aimed to model the temporal mood variation. This work presented
a two-stage approach which employs attention-based deep learning models to
represent the temporal mood variation, and a second stage that makes the final
decision based on Bayesian inference.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Representation</title>
      <p>
        In psychology, it has been established the correlation between emotions and
mental disorders, and the study of the manifestation of emotions in language is an
active research area [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Motivated by these findings, and similar to the previous
year, our approach for this year’s participation consisted on using emotions at a
fine-grained level as basic elements for the representation of users’ posts. In the
following paragraphs, we briefly describe the creation of the sub-emotions
vocabulary and how we converted the posts’ content into sub-emotions sequences.
      </p>
      <p>
        Generate Sub-Emotions. The creation of sub-emotions used the lexical
resource from [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This lexical resource consists of eight recognized emotions
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and two sentiments: Anger, Anticipation, Disgust, Fear, Joy, Sadness,
Surprise, Trust, Positive and Negative, respectively. Each emotion consists of a set
of words that are associated with it. Given the set of words associated with
each emotion, first, we obtained a word vector for each word using pre-trained
word embeddings from FastText. Then, we generated sub-groups of words using
the Affinity Propagation clustering algorithm [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This algorithm computes the
number of clusters based on the data provided, where each centroid represents a
different sub-emotion. With this approach, we were able to separate the words
of each emotion in different topics and represent each emotion in what we call
sub-emotions. Figure 1 illustrates this whole process.
      </p>
      <p>As described above, the obtained sub-groups of words allow separating each
coarse emotion in different topics. These topics help to capture more specific
emotions expressed by the users in their posts. Figure 2 shows some examples
of the kind of sub-emotions automatically generated by the proposed approach.
Analyzing this figure in detail, it can be appreciated that words with similar
context tend to group together. It can also be noticed that even for the same
emotion each group of words shows a different topic. For example, for the Anger
emotion has one group related to the topic of "fighting and battles" and another
group about "loud noises or growls". In another example, the Surprise emotion
has one group to "art and museums", whereas other groups contain words related
to "accidents and disasters", or "magic and illusion".</p>
      <p>Text to Sub-Emotions. Once generated the sub-emotions, we masked the
users’ posts by replacing each word with the label of its closest sub-emotion.
To do this process, first, we calculated the word embedding vector of each word
in the vocabulary of the users using FastText. Then, we obtained the cosine
similarity between each word vector and the sub-emotions. Finally, the closest
sub-emotion was selected to replace the word.</p>
      <p>
        After documents were masked, we built the BoSE representation by using
histograms of sub-emotions. Basically, each document was represented as a vector
of weights associated to sub-emotions, where weights are computed in the tf-idf
fashion. Figure 3 describes the whole process to create the representation. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
the whole process is explained in more detail.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        This year’s shared task was a continuation of the eRisk 2019 T1 task [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]; it
consisted of detecting traces of self-harm in users of Reddit as soon as possible.
To observe these traces, we sequentially processed the users’ posts. Basically, the
server iteratively provided users’ writings in chronological order, and for each
user we needed to respond with a positive or negative prediction, indicating
if he or she presents or not signs of committing self-harm. After sending the
predictions, the server continued with the next set of writings for each user.
For generating the predictions we used the following five different classification
strategies.
4.1
      </p>
      <p>Used classification strategies
1. Run 0: it considered only the training set from the previous edition, and
employed the BoSE representation. A user was classified as committing
selfharm if his/her probability of belonging to the positive class was higher than
0.60 in two consecutive predictions.
2. Run 1: it is similar to Run 0, but the model was trained using the depression
data set from the eRisk 2018 task.
3. Run 2: It combined a BoW representation with BoSE, and trained the
classification model using the self-harm and depression datasets together.
4. Run 3: It is similar to Run 2, except that the training was done using the
self-harm dataset.
5. Run 4: It employed the BoW and BoSE representations trained with the
depression and self-harm datasets; a positive prediction was generated when
its probability was higher than 0.55.</p>
      <p>Here, it is important to note that the used approach presents two main
differences with respect to our previous year strategy. On the one hand, the
addition of the users’ vocabulary to the BoSE representation, which allow to
capture some specific words related to self-harm, and, on the other hand, the
use of the 2018 depression dataset in the training phase, which aims to build a
more robust classification model by taking advantage of the existing relationship
between self-harm and depression. In Table 1 we show the stragy used for each
run.
We first trained and evaluated our model using the 2019 eRisk dataset. This
experiments helped us to select the best parameters before sending the
predictions to the server. The 2019 dataset contains two categories of users: self-harm
and control. For this configuration experiment, we used the users’ whole post
histories, performed a cross-validation strategy, and considered the F1 over the
positive class as evaluation measure. Table 2 presents the obtained results. It
compares the results using the BoSE and BoW representations, trained
exclusively with the self-harm data set as well as adding the depression data set. These
results show that adding information from the depression collection helped to
improve the classification performance. This could be due the lack of data using
only the self-harm dataset. In Table 3, we present five of the most relevant words
of each dataset (depression and self-harm). We can appreciate that the model
captures some differences between both problems. For example, for self-harm,
most relevant words are related to physical damage with some mental problems,
and for depression words are more related to emotional problems.</p>
      <p>For the submission of results, we trained the model using all the information
from all the users of the training dataset. Then, using the five classification
strategies previously mentioned, we detected the the users who presents signs of
committing self-harm. Table 4 shows the results obtained by the five strategies
over the 2020 test data set. The strategy named as Run 3 was the one which
obtained the best results; it consists of the usage of BoSE and BoW trained only
over the self-harm dataset. In this strategy, a user was identified as committing
self-harm if the probability of the positive class was higher than 0.60 in two
consecutive predictions, indicating a temporal emotional stability of the user.
We can appreciate that this approach also obtains the best ERDE prediction,
which imply a good prediction with relatively less information. The usage of
both representations, indicate that not only the emotional information but also
the presence of certain words (associated with certain topics) are important for
the detection of people who commit self-harm. In table 5 we show some of the
most relevant sub-emotions related to self-harm and the topics they capture.
Some topics are related to negative aspects, like hate, criticise or refuse. An
interesting sub-emotion captured automatically by our model is related to young
people, where people that commit self-harm usually is a teenager or closer to
that age.</p>
      <p>Self-harm
anger11 unsociable, crowd, mischievous
disgust17 condemn, criticise, refuse, repudiate
fear5 dreadful, hate, bad, nasty
negative18 adolescence, teen, juvenile
trust22 impatient, desire, anxious
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>In this paper, we presented our approach for the eRisk 2020 shared task 1, which
consists in deciding as soon as possible if a user presents signs of self-harming by
using his/her post history in chronological order. For this, we proposed the use
of a representation that combines a bag of words with a bag of sub-emotions,
which was created using a lexical resource of emotions and FastText sub-word
embeddings. The main idea of our approach is to capture specific fine-grained
emotions and topics that a user committing self-harm tend to express through
his/her posts. Our approach differs from other methods in its simplicity and
interpretability, particularly against approaches that use several different features
and complex classification models. In the test set, it obtains competitive results,
showing an opportunity for a deeper exploration on the usefulness of modeling
the emotional information from users that have the risk of committing self-harm
or suffering from another mental disorder.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was supported by CONACyT-Mexico (Scholarship 654803).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Laye-Gindhu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schonert-Reichl</surname>
          </string-name>
          ,
          <article-title>Kimberly A. Nonsuicidal Self-Harm Among Community Adolescents: Understanding the "Whats" and "Whys" of Self-Harm</article-title>
          .
          <source>Journal of Youth and Adolescence</source>
          . (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Aragón</surname>
          </string-name>
          , ME.,
          <string-name>
            <surname>López-Monroy</surname>
          </string-name>
          , AP.,
          <string-name>
            <surname>González-Gurrola</surname>
          </string-name>
          , LC.,
          <string-name>
            <surname>Montes-</surname>
            y-Gómez,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Detecting Depression in Social Media using Fine-Grained Emotions</article-title>
          .
          <source>Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers). (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Trifan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliveira</surname>
          </string-name>
          , JL.: BioInfo@UAVR at eRisk 2019:
          <article-title>delving into social media texts for the early detection of mental and food disorders</article-title>
          .
          <source>Proceedings of the 10th International Conference of the CLEF Association, CLEF</source>
          <year>2019</year>
          , Lugano,
          <string-name>
            <surname>Switzerland.</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Van</given-names>
            <surname>Rijen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Teodoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Naderi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Mottin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Knafou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Jeffryes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ruch</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>A Data-Driven Approach for Measuring the Severity of the Signs of Depression using Reddit Posts</article-title>
          .
          <source>Proceedings of the 10th International Conference of the CLEF Association, CLEF</source>
          <year>2019</year>
          , Lugano,
          <string-name>
            <surname>Switzerland.</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ragheb</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aze</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bringay</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Servajean</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Attentive Multi-stage Learning for Early Risk Detection of Signs of Anorexia and Self-harm on Social Media</article-title>
          .
          <source>Proceedings of the 10th International Conference of the CLEF Association, CLEF</source>
          <year>2019</year>
          , Lugano,
          <string-name>
            <surname>Switzerland.</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Burdisso</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Errecalde</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes-</surname>
            y-Gomez,
            <given-names>M.</given-names>
          </string-name>
          : UNSL at eRisk
          <year>2019</year>
          :
          <article-title>a Unified Approach for Anorexia, Self-harm and Depression Detection in Social Media</article-title>
          .
          <source>Proceedings of the 10th International Conference of the CLEF Association, CLEF</source>
          <year>2019</year>
          , Lugano,
          <string-name>
            <surname>Switzerland.</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ortega-Mendoza</surname>
          </string-name>
          , RM.,
          <string-name>
            <surname>Hernandez-Farias</surname>
          </string-name>
          , DI.,
          <string-name>
            <surname>Montes-</surname>
            y-Gomez,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>LTL-INAOE's Participation at eRisk 2019: Detecting Anorexia in Social Media through Shared Personal Information</article-title>
          .
          <source>Proceedings of the 10th International Conference of the CLEF Association, CLEF</source>
          <year>2019</year>
          , Lugano,
          <string-name>
            <surname>Switzerland.</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ekman</surname>
          </string-name>
          , PE.,
          <string-name>
            <surname>Davidson</surname>
          </string-name>
          , RJ.:
          <article-title>The nature of emotion: Fundamental questions</article-title>
          . New York, NY, US: Oxford University Press. (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Coppersmith</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dredze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Quantifying mental health signals in Twitter</article-title>
          .
          <source>In Proceedings of the Workshop on Computational. Proceedings of the Workshop on Computational. Linguistics and Clinical Psychology:</source>
          From Linguistic Signal to Clinical Reality. (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turney</surname>
          </string-name>
          , P.D.:
          <article-title>Crowdsourcing a Word-Emotion Association Lexicon</article-title>
          .
          <source>Computational Intelligence</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Walck</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Hand-book on Statistical Distributions for experimentalists</article-title>
          . University of Stockholm,
          <source>Internal Report SUF-PFY/96-01</source>
          . (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Losada, DE.,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <source>Overview of eRisk</source>
          <year>2018</year>
          :
          <article-title>Early Risk Prediction on the Internet (extended lab overview)</article-title>
          .
          <source>Proceedings of the 9th International Conference of the CLEF Association, CLEF</source>
          <year>2018</year>
          , Avignon, France. (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Losada, DE.,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <source>Overview of eRisk</source>
          <year>2019</year>
          :
          <article-title>Early Risk Prediction on the Internet</article-title>
          .
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. 10th International Conference of the CLEF Association, CLEF</source>
          <year>2019</year>
          , Lugano,
          <string-name>
            <surname>Switzerland.</surname>
          </string-name>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Losada, DE.,
          <string-name>
            <surname>Crestani</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parapar</surname>
          </string-name>
          , J.:
          <source>Overview of eRisk</source>
          <year>2020</year>
          :
          <article-title>Early Risk Prediction on the Internet. Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <source>Interaction Proceedings of the Eleventh International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          ).
          <article-title>(</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. American Psychiatric Association:
          <article-title>Diagnostic and Statistical Manual of Mental Disorders</article-title>
          .
          <source>Fourth Edition</source>
          . Washington, DC: American Psychiatric Press. (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Thavikulwat</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Affinity Propagation: A clustering algorithm for computerassisted business simulation and experimental exercises</article-title>
          .
          <source>Developments in Business Simulation and Experiential Learning</source>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>