Anorexia Topical Trends in Self-declared Reddit Users
          Razan Masood♣ , Mengjiao Hu♣ , Hermenegildo Fabregat♦ , Ahmet Aker♣ , and Norbert Fuhr♣
                                                 ♣ University of Duisburg-Essen, Duisburg, Germany
                                         ♦ Universidad Nacional de Educación a Distancia, Madrid, Spain

                                                               firstname.lastname@uni-due.de
                                                                   gildo.fabregat@lsi.uned.es

ABSTRACT                                                                             Anorexia is “an eating disorder characterized by abnormally low
Social Media platforms have been a vital environment to share                        body weight, an intense fear of gaining weight, and a distorted
experiences and seek knowledge. People with various interests                        perception of weight. People with anorexia place a high value on
form online communities in which they can accumulate many                            controlling their weight and shape”1 . We use posts extracted from
experiences from many peers. Among these communities are the                         Reddit, “an online network of communities based on people’s inter-
mental health-related ones that have been growing on Social Media                    ests”. The different Reddit communities are referred to as subreddits.
in the last few years. However, users can show alarming behavioral                   Each subreddit is devoted to a specific topic. Plenty of subreddits are
signs at the stage of their mental illness that should be identified                 related to AN and other eating disorders such as EatingDisoders and
before it is too late. Hence, equipping social media platforms with                  AnorexiaRecovery subreddits. People resort to such communities
the needed tools to monitor its users, identify risks, and intervene                 for many purposes. Some communities promote sharing recovery
on time has been of great concern recently. In this paper, we target                 experiences and emotional support, and others can cause more
users who self disclose as being diagnosed with an eating disorder,                  harm like pro-Anorexia communities, which promote unhealthy
namely Anorexia. We provide a dataset of manually labeled Reddit                     body-image and diets. Hence, the chances are that there are users
users’ posts, focused on the extraction of some potentially relevant                 who may face serious risks, which obliged SM platforms to keep
topics for the study of eating disorders. E.g. diets, exercises, body                their environments under control and provide possible intervention
image, etc. These topics can be utilized to find patterns in Anorexic                when needed.
users’ behaviors to distinguish them from users who are less likely                     By investigating the specific type of information or topics AN
to have Anorexia. They can also be used to interpret afflicted users’                diagnosed users post about, we observed that the most frequently
attitudes. We support our labeling with baseline experiments to                      discussed topics are diet and eating routines, weight, family and
learn how to differentiate between these topics.                                     relationship issues, anxiety, and depression problems. Figure 1 shows
                                                                                     an example of a pair of a positive user (diagnosed with AN) and a
CCS CONCEPTS                                                                         negative user (not diagnosed with AN). The timeline of the first 50
                                                                                     posts for the two users and their post topics are plotted. The example
• Human-centered computing → Social networking sites; •
                                                                                     shows that the positive user (blue) posts more frequently on topics
Applied computing → Psychology.
                                                                                     related to their mental health state (4 times), eating disorders (2
KEYWORDS                                                                             times), diet (4 times) and physical pain (1 time). On the other hand,
                                                                                     the negative user posts (orange) about family and exercises, among
mental health, Reddit, social media, Anorexia, machine learning                      other non-significant topics. Based on this analysis, we suggest that
                                                                                     we can use particular topical patterns to analyze and explain AN
1    INTRODUCTION                                                                    users behaviour in a more understandable way, which can be helpful
Humanity has come a long way in maintaining a high level of                          to distinguish risky users. Furthermore, when topical patterns are
societies’ and individuals’ well-being, including physical health,                   combined with, e.g., emotions [5, 8] or other aspects like stance, they
education and freedom. Still, a lot needs to be done in the mental                   can help to reveal the severeness level of the illness [26]. Besides,
health domain, which is getting more attention in the modern age                     these patterns could be extended and adapted to other mental health
of prospering technologies [17]. More people with mental health                      issues such as substance abuse and depression. Hence, we believe
issues resort to Social Media (SM) platforms either to directly seek                 that post-level classification could be useful for medical researchers
support and information or to communicate their thoughts and                         and psychiatrists to analyze topical extracts of SM history and
feelings indirectly. Recently, the data that such users produce on SM                evaluate the prevalence of the pattern of certain topics among AN
has proved to predict their mental health state and its severity [7].                sufferers.
Besides, it provides precious resources for practitioners and experts                   Our main contributions in this paper are as follows: (1) We define
as a possible tool for mental health-related research. Moreover,                     topics of importance to identify Reddit users who are more likely
predicting mental health issues in the early stages is essential to                  to have AN. (2) We provide a dataset of users’ posts annotated
provide the needed support in alarming situations like preventing                    with defined relevant topics. (3) We present baselines to predict the
suicide, self-harming, and eating disorders [12, 16, 28].                            different posts categories based on the labeled dataset 2 .
   In this paper, we target SM users who have explicitly stated
                                                                                     1 https://www.mayoclinic.org/diseases-conditions/anorexia-nervosa/symptoms-
that they were clinically diagnosed with Anorexia Nervosa (AN).
                                                                                     causes/syc-20353591
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-   2 The dataset and best performing models code are released for research pur-
mons License Attribution 4.0 International (CC BY 4.0)."                             poses https://github.com/razanmasood/Anorexia_Topical_Trends_in_Self_declared_
                                                                                                                                                               Masood, et al.


                                                                                          writer towards the mentioned topics in the post when assigning
Figure 1: Positive vs. negative users’ posts topics shown for                             the labels.
the first 50 posts in two users’ timeline. The x-axis shows the
order of the posts, and the y-axis the assigned topics.                                   3     DATASET
                                                                                          We use eRisk 2019 dataset 3 . eRisk is a part of CLEF (Conference
                                                                                          and Labs of the Evaluation Forum) 2019 labs. The lab has desig-
                                                                                          nated a task for the early detection of Reddit users with signs of
                                                                                          Anorexia [19]. The training data is a set of users under two cate-
                                                                                          gories. One is the users who stated in at least one of their posts that
                                                                                          they were diagnosed with Anorexia, and the other category did not.
                                                                                          For each user in the dataset, all posts and comments made by that
                                                                                          user (which are up to 1000 posts and 1000 comments) are chronolog-
                                                                                          ically sorted [18]. The post in which a user declares their diagnosis
2    RELATED WORK                                                                         was filtered out. Posts and comments of the users can belong to
Online Social Media has driven a wide range of investigations                             any subreddit. For our purposes, we selected 55 positive users and
on mental health by exploiting the growing users’ data [14, 27].                          labeled 50-100 posts starting from the earliest post/comment. This
Many datasets have been collected from SM platforms for language                          research is oriented towards investigating the topics of interest of
or communication analysis and risk prediction [10, 26, 30]. The                           positive users, but to examine the occurrence of similar topics in
datasets collection is based on different rules and methods [15].                         negative users’ posts, we picked ten negative users to label their
One method is to identify users affected by mental illnesses using                        posts using the same criteria.
psychiatric surveys assessed by experts. The selected users’ SM
accounts are then explored based on the results of the survey [13].                       3.1      Labels
   Another method is to consider users who have mentioned that                            To choose the posts’ labels, we manually examined the different
they have been diagnosed with a mental illness on their social                            topics that appeared more frequently than other topics in posi-
media as positive cases [9, 18]. Nonetheless, the datasets mentioned                      tive users’ posts. Then, we verified and expanded the topics using
above are labeled on users level, i.e a user have or does not have                        related work that analyzed posts of social media users, which in-
the targeted mental illness.                                                              dicate symptoms of Anorexia [6, 7, 29]. The selected topics are
   A third method is to annotate posts based on the signals it holds                      related to mental health disorders and anxiety, self-harm, suicidal
and that characterize the mental illness in question manually. The                        thoughts, pain, and hints of the desire to be skinny. In addition, we
annotations are either determined from the data or based on the-                          used additional topics that were shown to be related to general
ory [15]. The vast majority of post level annotations regarding                           mental health evaluation, like family, and sleep, as found in [11, 25].
eating disorders characteristics were done as part of content analy-                      The selected set of topics was rearranged under seven labels of
sis work by experts. Mowrey et al. defined an annotation scheme for                       posts with related AN topics and an additional label for posts that
labeling tweets according to depressive symptoms and psychoso-                            cannot be labeled under any of the seven defined topics. The num-
cial stressors [21]. The goal of the final corpus is to understand the                    ber of labels is qualified for automatic classification experiments.
depression language and to identify the differences between psycho-                       The posts labels are as the following 4 : (1) Eating disorder: Posts
logical factors. Moreover, Sowles et al. extended the annotation to                       with explicit mentions of experiences that indicate eating disorder
coding the attitude and the support behaviors of the comments [23].                       (Anorexia, Bulimia, ED), and behaviors like binging and induced
On the other hand, the frequent topics brought up by mental health                        throw-ups. (2) General mental health: Under this label are posts
online communities has been explored using topic modeling such                            with mentions of signs of mental disturbances and inconveniences.
as LDA (Latent Dirichlet Allocation) and other methods [7, 22, 29].                       Examples of these are: a. Posts with indications of depression, anxi-
   However, the problem with automatic topic modeling, when ap-                           ety, and sadness expressions. b. Posts with signs of harming oneself
plied to Reddit posts, is that posts as documents are not long enough                     and suicidal expressions. c. Posts with mentions of issues related
for topic modeling. Moreover, when all posts of a user are joined in                      to sleep like lack of sleep or oversleep. d. Posts with mentions of
one document, it is more likely to undergo topic shifts, variation                        alcohol drinking problems and other addiction issues like drugs
in tone, and hence, be out of context [10]. To our knowledge, a                           and smoking. (3) Medication: Posts with mentions of medication
manually annotated post-level Reddit dataset for topics related to                        names. Some medications could be used for treatment reasons or
Anorexia is not available as a basis for both enhanced supervised                         for inducing throw-ups. (4) Family & friends: Posts that contain
and unsupervised classification models. Besides, our topic annota-                        stories on friends or family members. (5) Diets & Food: Posts with
tions are more descriptive than bare automatic topics. Our manual                         mentions of specific foods, recipes, and diets that include fasting,
annotation criteria are defined based on the dataset observations                         skipping meals, and purging. (6) Body shape & exercises: Posts with
and on the previous work that defined frequently mentioned topics                         mention of the body’s weight, height, BMI, and other body-image
by people who show symptoms of eating disorder online. Unlike                             expressions. In addition to posts that mention exercise routines and
the experts based annotations, we do not involve the attitude of the
                                                                                          3 https://early.irlab.org/2019/index.html

Reddit_Users. The labels are released by the IDs of the original dataset because of the   4 According to the user agreement signed with eRisk organizers, it is not allowed to
signed user agreement.                                                                    show contents from the dataset.
Anorexia Topical Trends in Self-declared Reddit Users


Table 1: Labels with number of instances (#) for each and                              obtained using well-known classification approaches based on Lo-
agreement scores considering (Fleiss Kappa κ). The number                              gistic Regression Classifier, LSTM (Long Short-Term Memory), and
of instances for each label in Training, Development and                               CNN (Convolutional Neural Networks). Firstly, we divided the cor-
Test sets are shown in the corresponding columns.                                      pus into three sets (training, development, and test), where each
                                                                                       set comprised different users to avoid learning user-specific fea-
    Label                            κ        #         Train      Dev     Test        tures like individual writing styles. We choose to experiment with
    Eating disorder                  0.72     126       85         24      17          the main label only rather than dealing with multiple labels for a
    General mental health            0.49     170       71         43      56          post. The distribution of posts can be seen in Table 1. Then, we
    Medication                       0.4      60        33         22      5           pre-process the posts’ textual content by lemmatizing, lower-casing,
    Family & friends                 0.49     146       84         36      26          and removing expressions related to the Reddit platform like tag-
    Diets & food                     0.69     234       171        39      24          ging forums and users. The cleaned posts and comments are the
    Body shape & exercises           0.75     280       188        48      47          input to the ML models, and the assigned labels are the targets to
    Physical pain & sickness         0.49     130       108        8       14          be learned.
    Other                            0.69     3959      2549       765     645            To analyze the task at different levels of complexity, we con-
                                                                                       sider two experimental frameworks, namely, binary and multiclass
                                                                                       classifiers. For the binary task, we transform our eight labels into
other physical activities. (7) Physical pain & sickness: Posts with                    two labels, one is the Related label that has all the seven labels
mentions of physical sickness or illness. (8) Other: Any post not                      relevant to AN, and the Unrelated label that has the Other label.
related to the categories mentioned above.                                             The second task is a fine-grained multi-classification task that is
   We define the topics in a way that makes it more straightforward                    set to distinguish the eight labels individually.
for non-clinician annotators. The labels’ definitions do no involve                       The three used models are set as the following:
judgment on the severeness, emotions, or attitude the writer has                          Logistic Regression with TF-IDF (LR-TFIDF). We use the lo-
towards the reported topics. This separation is necessary to ensure                    gistic regression implementation by Python’s Sklearn package. We
annotation with fewer inaccuracies and to separate emotions and                        feed the classifier with Term Frequency-Inverse Document Fre-
attitude factors from the plain topic labels.                                          quency (TF-IDF) features of uni- and bi-word grams.
                                                                                          LSTM with inner-attention (LSTM-Att). For this model, we
3.2     Annotation                                                                     represented each post by its term embeddings extracted using
Five master students from the Computer Science department anno-                        GloVe [3]. Then each term was weighted by the average value
tated the data. The annotators were paid per hour. We dedicated a                      of the embeddings of certain recurrent terms. The recurrent terms
session to train the annotators and made sure that they follow the                     are selected by extracting the significant terms for each label against
definition of the labels through a selected sample of posts. The posts                 the other labels by the Chi-square test on TF-IDF features of uni-
were annotated with as many labels as the topics mentioned. Taking                     gram words. We selected the most significant 200 terms for each
into account possible further experiments, we fixed one label for                      label. We then calculated the average embedding GloVe vector for
each post as the main label. We define the main label as the one                       each set of terms for each label. The inner attention mechanism
that the annotator found being the most representative/dominant                        is based on weighting each term of a post/comment by the aver-
label of the post [4].                                                                 age vector of each label. The model was implemented as in [1, 24].
    In the case of multiple labels, we calculated the agreement reached                For the LSTM, we used a single forward layer, eight neurons, and
on individual labels using Fleiss’ Kappa for multiple raters using                     Hyperbolic Tangent activation function.
the Python library statsmodels 0.11.05 . Because each post can have                       CNN with Meta-Map (CNN-MM) with which we explored the
multiple labels, we calculate the agreement for each label separately,                 addition of more focused knowledge using concepts extracted by
i.e., to observe the agreement between raters to choose a specific                     Meta-Map, an NLP tool focused on information retrieval from the
label for the post. The agreement results are shown in the second                      biomedical domain and enriched with several thesauri [2]. As Meta-
column of Table 1. The least agreement value we get is on the label                    Map provides for each identified concept the semantic category to
Medication, which can be due to fewer posts available on the topic.                    which it belongs, we explored an approach using this knowledge.
Another reason is that the annotators are not experts, and in many                     In total, we studied 50 semantic groups manually selected based
cases, it is hard to recognize medication names easily. The third                      on their relationship with the labels, including Activity, Behavior,
column of Table 1 shows the number of instances for each label                         Disease or Syndrome. In short, each post has been represented as
taken as a main label only. We use Fleiss’ Kappa as well to compute                    a sequence of terms and its respective category. We used GloVe
the agreement on the main label. The agreement score obtained is                       to represent the words and a trainable embedding vector of 50 di-
0.65, which is considered to be in an acceptable range [20].                           mensions to represent the semantic categories. The CNN is applied
                                                                                       with a fixed window of 5 elements and a total of 128 neurons.
4     AUTOMATIC TOPIC CLASSIFICATION
To further understand the complexity of classifying users’ posts                       4.1    Results and Error Analysis
according to the defined topic labels, we present baseline results
                                                                                       The overall classification performance including precision (P), recall
5 https://www.statsmodels.org/stable/generated/statsmodels.stats.inter_rater.fleiss_   (R) and F1 shown in Table 3 confirms that the multiclass classifica-
kappa.html                                                                             tion is trickier than the binary one. CNN, combined with Meta-Map
                                                                                                                                                                          Masood, et al.


Table 2: Performance of the models reported on each label individually on the test set. (1) Logistic Regression with TF-IDF
features (LR-TFIDF), (2) LSTM with attention (LSTM-Att), and (3) CNN with MetaMap (CNN-MM)

                                        General mental                                                                        Body shape          Physical pain
                  Eating disorder                           Medication         Family & friends         Diets & Food                                                       Other
                                           health                                                                             & exercises         & sickness
                  P      R      F1      P     R     F1    P      R       F1     P      R      F1        P     R      F1     P     R       F1    P       R     F1    P       R       F1
 M/LR-TFIDF      0.64   0.53   0.58   0.80 0.07 0.13       0      0       0    0.27   0.12   0.16     0.88   0.29   0.44   0.65 0.28 0.39      0.50 0.07 0.12      0.83    0.99    0.90
 M/LSTM-Att      0.36   0.29   0.32    0.4   0.04 0.07     0      0       0    0.24   0.27   0.25     0.54   0.54   0.54   0.50 0.32 0.39      0.17 0.43 0.24      0.88    0.95    0.91
 M/CNN-MM        0.62   0.59   0.61   0.50 0.07 0.12     0.50   0.20    0.29   0.17   0.04   0.06     0.59   0.54   0.57   0.65 0.43 0.51      0.22 0.14 0.17      0.86    0.98    0.92


Figure 2: Confusion matrices obtained on test set. The num-                                       Table 3: Results using Binary (B) and Multi-class (M) classifi-
bers refer to the labels in the order they are mentioned in                                       cation with each of the models Logistic Regression with TF-
section 3.1                                                                                       IDF features (LR-TFIDF), LSTM with attention (LSTM-Att),
                                                                                                  and CNN with MetaMap (CNN-MM)

                                                                                                                                      Dev                      Test
                                                                                                         Model                P       R        F1      P       R          F1
                                                                                                         B/LR-TFID            0.83    0.80     0.81    0.77    0.76       0.77
                                                                                                         B/LSTM-Att           0.84    0.78     0.80    0.85    0.80       0.82
                                                                                                         B/CNN-MM             0.85    0.80     0.82    0.88    0.79       0.83
                                                                                                         M/LR-TFIDF           0.42    0.29     0.33    0.57    0.29       0.34
           (a)                        (b)                         (c)                                    M/LSTM-Att           0.39    0.33     0.33    0.39    0.35       0.34
semantic groups, performed significantly better than the other two                                       M/CNN-MM             0.52    0.37     0.41    0.51    0.37       0.41
models for the multiclass task6 .
   We further list detailed results on each of the labels in multi-
class settings in Table 2. The highest performance according to F1                                5    CONCLUSIONS AND FUTURE WORK
measure are on Eating disorder, Diets & food, Body shape & Exer-                                  In this paper, we report the annotation process of Reddit posts
cises and Other label. The unbalanced distribution of labels highly                               and comments by self-declared users with Anorexia Nervosa. We
influences the performance. Hence, the high performance on the                                    define an annotation scheme of fine-grained labels according to
Other labels. CNN-MM performed well on the labels that contained                                  topics related to the diagnosis of Anorexia Nervosa. We show that
terms related to the medical semantic groups in Meta-Map in their                                 our annotation is rather robust as the Fleiss’ Kappa agreement
assigned posts, e.g., Body parts terms and Eating disorders terms.                                values are in an acceptable range. We further test the possibility of
However, CNN-MM performed better than the other models on the                                     predicting post topics automatically. The classification results show
Medication Label despite the few items in training data due to the                                that predicting one main label for long posts is tricky to perform
medication terms used for features encoding. However, the perfor-                                 accurately. Hence, making use of the multiple-label annotations to
mance on the General mental health label is not as expected. This                                 predict multiple labels for each post can be a possible solution in
might be due to the fact that this label involves multiple domains                                addition to specifying sentence-level annotations. The annotation
that confused the classifiers. In other words, the confusion matrice                              scheme provided in this paper is related to the topics of which self-
in Figure 2(a) shows that the LSTM-Att model confused General                                     declared Reddit users of AN mention more frequently than the other
mental health label (2) mostly with Physical pain & sickness label (7)                            users. However, these topics are not enough to distinguish risky
besides the Other label (8). This confusion can be due to that many                               users as this might lead to false-positive predictions because many
posts that mention general mental health issues like lack of sleep                                users use these communities because they want to help someone
also mention pain aspects, which made the classifier choose the                                   related to them who are diagnosed with AN. Onward, in our future
Physical pain label (7). LSTM-Att also shows better performance                                   work, we will explore the possibility of employing the topics to
on Family & Friends label (4) as it uses terms related to this topic                              predict risky users. The prediction models can be enriched with the
to weigh the post terms, unlike the CNN model (Figure 2(b)) with                                  sequential development of emotions and stances that accompany
which Meta-Map does not have such terms.                                                          the topics [5, 8]. Furthermore, the different features combinations
   The classification results show that terms play an important                                   allow the estimation of the severeness level of the targeted illness.
role in identifying the topics. This is shown by the improvement                                  Also, what can be quite interesting is how to make these models
in performance when achieved when supporting the models with                                      adapt and be diverse to learn different forms of mental illnesses.
targeted related terms in comparison with using TF-IDF features
alone (Figure 2(c)). Nevertheless, the problem is that a post, es-                                ACKNOWLEDGMENTS
pecially the longer ones, can discuss many topics. Therefore, we
                                                                                                  This work was funded by the Deutsche Forschungsgemeinschaft
suggest ML models with multiple outputs of labels scores. Besides,
                                                                                                  (DFG, German Research Foundation) - GRK 2167, Research Train-
the labeling process can be enhanced by highlighting the related
                                                                                                  ing Group ”User-Centred Social Media". The work has been also
sentences according to each label rather than labeling longer posts
                                                                                                  partially supported by the Spanish Ministry of Science and Innova-
to make the labeled text more focused.
                                                                                                  tion within the projects PROSA-MED (TIN2016-77820-C3-2-R) and
6 McNemar’s test, p < 0.0125 after Bonferroni correction.                                         EXTRAE-II (IMIENS 2019).
Anorexia Topical Trends in Self-declared Reddit Users


REFERENCES                                                                                     [23] Shaina J Sowles, Monique McLeary, Allison Optican, Elizabeth Cahn, Melissa J
 [1] Ahmet Aker, Alfred Sliwa, Fahim Dalvi, and Kalina Bontcheva. 2019. Rumour                      Krauss, Ellen E Fitzsimmons-Craft, Denise E Wilfley, and Patricia A Cavazos-
     verification through recurring information and an inner-attention mechanism.                   Rehg. 2018. A content analysis of an online pro-eating disorder community on
     Online Social Networks and Media 13 (2019), 100045.                                            Reddit. Body image 24 (2018), 137–144.
 [2] Alan R Aronson. 2001. Effective mapping of biomedical text to the UMLS Metathe-           [24] Christian Stab, Tristan Miller, and Iryna Gurevych. 2018. Cross-topic argument
     saurus: the MetaMap program.. In Proceedings of the AMIA Symposium. American                   mining from heterogeneous sources using attention-based neural networks. arXiv
     Medical Informatics Association, 17.                                                           preprint arXiv:1802.05758 (2018).
 [3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural ma-                      [25] Andrew Toulis and Lukasz Golab. 2017. Social Media Mining to Understand
     chine translation by jointly learning to align and translate. arXiv preprint                   Public Mental Health. In VLDB Workshop on Data Management and Analytics for
     arXiv:1409.0473 (2014).                                                                        Medicine and Healthcare. Springer, 55–70.
 [4] Victoria Bobicev and Marina Sokolova. 2017. Inter-Annotator Agreement in                  [26] Tao Wang, Markus Brede, Antonella Ianni, and Emmanouil Mentzakis. 2018.
     Sentiment Analysis: Machine Learning Perspective.. In RANLP. 97–102.                           Social interactions in online eating disorder communities: A network perspective.
 [5] Craig J Bryan, Jonathan E Butner, Sungchoon Sinclair, Anna Belle O Bryan,                      PloS one 13, 7 (2018).
     Christina M Hesse, and Andree E Rose. 2018. Predictors of emerging suicide                [27] Andrew Yates, Arman Cohan, and Nazli Goharian. 2017. Depression and self-
     death among military personnel on social media networks. Suicide and Life-                     harm risk assessment in online forums. arXiv preprint arXiv:1709.01848 (2017).
     Threatening Behavior 48, 4 (2018), 413–430.                                               [28] Wu Youyou, Michal Kosinski, and David Stillwell. 2015. Computer-based person-
 [6] Patricia A Cavazos-Rehg, Melissa J Krauss, Shaina J Costello, Nina Kaiser, Eliza-              ality judgments are more accurate than those made by humans. Proceedings of
     beth S Cahn, Ellen E Fitzsimmons-Craft, and Denise E Wilfley. 2019. “I just want               the National Academy of Sciences 112, 4 (2015), 1036–1040.
     to be skinny.”: A content analysis of tweets expressing eating disorder symptoms.         [29] Sicheng Zhou, Yunpeng Zhao, Rubina Rizvi, Jiang Bian, Ann F Haynos, and Rui
     PloS one 14, 1 (2019), e0207506.                                                               Zhang. 2019. Analysis of Twitter to Identify Topics Related to Eating Disorder
 [7] Stevie Chancellor, Zhiyuan Lin, Erica L Goodman, Stephanie Zerwas, and Mun-                    Symptoms. In 2019 IEEE International Conference on Healthcare Informatics (ICHI).
     mun De Choudhury. 2016. Quantifying and predicting mental illness severity                     IEEE, 1–4.
     in online pro-eating disorder communities. In Proceedings of the 19th ACM Con-            [30] Ayah Zirikly, Philip Resnik, Ozlem Uzuner, and Kristy Hollingshead. 2019.
     ference on Computer-Supported Cooperative Work & Social Computing. ACM,                        CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts.
     1171–1184.                                                                                     In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical
 [8] Xuetong Chen, Martin D Sykora, Thomas W Jackson, and Suzanne Elayan. 2018.                     Psychology. 24–33.
     What about mood swings: Identifying depression on twitter with temporal mea-
     sures of emotions. In Companion Proceedings of the The Web Conference 2018.
     International World Wide Web Conferences Steering Committee, 1653–1660.
 [9] Arman Cohan, Bart Desmet, Andrew Yates, Luca Soldaini, Sean MacAvaney, and
     Nazli Goharian. 2018. SMHD: a large-scale resource for exploring online language
     usage for multiple mental health conditions. arXiv preprint arXiv:1806.05258
     (2018).
[10] Arman Cohan, Sydney Young, Andrew Yates, and Nazli Goharian. 2017. Triaging
     content severity in online mental health forums. Journal of the Association for
     Information Science and Technology 68, 11 (2017), 2675–2689.
[11] Pricewaterhouse Coopers. 2015. The costs of eating disorders: Social, health and
     economic impacts. B-eat, Norwich (2015).
[12] Glen Coppersmith, Mark Dredze, and Craig Harman. 2014. Quantifying men-
     tal health signals in Twitter. In Proceedings of the workshop on computational
     linguistics and clinical psychology: From linguistic signal to clinical reality. 51–60.
[13] Munmun De Choudhury, Michael Gamon, Scott Counts, and Eric Horvitz. 2013.
     Predicting depression via social media. In Seventh international AAAI conference
     on weblogs and social media.
[14] Barbara Silveira Fraga, Ana Paula Couto da Silva, and Fabricio Murai. 2018.
     Online Social Networks in Health Care: A Study of Mental Disorders on Reddit.
     In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE,
     568–573.
[15] Sharath Chandra Guntuku, David B Yaden, Margaret L Kern, Lyle H Ungar, and
     Johannes C Eichstaedt. 2017. Detecting depression and mental illness on social
     media: an integrative review. Current Opinion in Behavioral Sciences 18 (2017),
     43–49.
[16] Michal Kosinski, David Stillwell, and Thore Graepel. 2013. Private traits and
     attributes are predictable from digital records of human behavior. Proceedings of
     the National Academy of Sciences 110, 15 (2013), 5802–5805.
[17] James Lake and Mason Spain Turner. 2017. Urgent need for improved mental
     health care and a more collaborative model of care. The Permanente Journal 21
     (2017).
[18] David E. Losada and Fabio Crestani. 2016. A Test Collection for Research on De-
     pression and Language use. In Conference Labs of the Evaluation Forum. Springer,
     28–39. https://doi.org/10.1007/978-3-319-44564-9_3
[19] David E. Losada, Fabio Crestani, and Javier Parapar. 2019. Overview of eRisk
     2019: Early Risk Prediction on the Internet. In Experimental IR Meets Multilin-
     guality, Multimodality, and Interaction. 10th International Conference of the CLEF
     Association, CLEF 2019. Springer International Publishing, Lugano, Switzerland.
[20] Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia
     medica: Biochemia medica 22, 3 (2012), 276–282.
[21] Danielle Mowery, Hilary Smith, Tyler Cheney, Greg Stoddard, Glen Coppersmith,
     Craig Bryan, and Mike Conway. 2017. Understanding depressive symptoms
     and psychosocial stressors on Twitter: a corpus-based study. Journal of medical
     Internet research 19, 2 (2017), e48.
[22] Philip Resnik, William Armstrong, Leonardo Claudino, Thang Nguyen, Viet-An
     Nguyen, and Jordan Boyd-Graber. 2015. Beyond LDA: exploring supervised topic
     modeling for depression-related language in Twitter. In Proceedings of the 2nd
     Workshop on Computational Linguistics and Clinical Psychology: From Linguistic
     Signal to Clinical Reality. 99–107.