The psycho-env corpus: research articles annotated for knowledge discovery on
               correlating mental diseases and environmental factors
              Hui Wang1 , Quan Sun2 , Anika Oellrich3 , Honghan Wu3 and Richard Dobson3
      1
        Institute of Psychiatry, Psychology & Neuroscience, King’s College London, United Kingdom
                    2
                      Department of Informatics, King’s College London, United Kingdom
     3
        Department of Biostatistics and Medical Informatics, King’s College London, United Kingdom
               {hui.1.wang, quan.sun, anika.oellrich, honghan.wu, richard.j.dobson}@kcl.ac.uk

                          Abstract                                      patient screening and planning treatment strategies [Rutter,
                                                                        2005].
     While the published scientific literature is used
     in a biomedical context such as building gene                         While the published scientific literature is used in a
     networks for disease gene discovery, it seems to                   biomedical context such as building gene networks for dis-
     be an undervalued resource with respect to men-                    ease gene discovery [Lage et al., 2007] or symptom net-
     tal illnesses. It has been rarely explored for                     works of inheritable human disorders [Zhou et al., 2014], it
     the purpose of gaining psychopathology insights.                   seems to be an undervalued resource with respect to men-
     This limits our capability of better understand-                   tal illnesses. It has been rarely explored for the purpose
     ing the underlying mechanisms of mental disor-                     of gaining psychopathology insights. The potential of this
     ders. In this paper we describe the psycho-env                     resource lies within the amount and variety of data avail-
     corpus, which aims at annotating published stud-                   able: all journals that publish scientific results are covered
     ies for facilitating knowledge discovery on patholo-               mostly since 1966, though some even date back to 1809.
     gies of mental diseases. Specifically, this corpus                 Although there is a body of work trying to identify “ex-
     focuses on the correlations between mental dis-                    tended” phenotypes [Oellrich et al., 2016; Groza et al., 2015;
     eases and environmental factors. We report the                     Collier et al., 2015], however, none of these efforts included
     first preliminary work of psycho-env on annotat-                   environmental factors, which are necessary to understand
     ing 20 articles about two mental illnesses (bipo-                  gene-phenotype relationships. In order to make use of this
     lar disorder and depression) and two particular                    tremendous resource for finding potential environmental fac-
     environmental factors - light and sunlight. The                    tors that (i) cause, (ii) contribute to and (iii) influence the ori-
     corpus is available at https://github.com/                         gin and pathology of mental illnesses, (AI backed) automated
     KHP-Informatics/psycho-env.                                        methods are needed to digest the large quantities of existing
                                                                        data.
1    Introduction                                                          In order to facilitate this endeavour, data collection and an-
The success stories of cognitive computing (e.g., IBM Wat-              notation would be required to identify relevant studies and
son’s Jeopardy game) and deep learning (e.g., DeepMind’s                the representation of environmental factors in the published
AlphaGo) have sparked a massive wave of using artificial                literature. In this paper we describe the psycho-env corpus1 ,
intelligence (AI) to improve numerous aspects of our daily              which is a manually curated dataset from the abstracts of
life. Not surprisingly, healthcare is among the hottest areas.          20 published studies on associations between two mental ill-
For example, IBM Watson is now utilised in decision sup-                nesses (bipolar disorder and depression) and one particular
port for lung cancer at the Memorial Sloan Kettering Cancer             environmental factor - light. We believe this is the first effort
Center. However, AI models require data to derive better un-            to produce curated corpus for knowledge discovery on asso-
derstanding of the underlying mechanisms of diseases before             ciations between mental illness and environmental factors.
they can really improve existing treatments or increase the re-
covery rate. Unfortunately, the lack of data is a major hurdle            In the next section, we introduce the article selection, an-
in many areas of the clinical domain, such as understanding             notation process, annotation tool used and data format of an-
the pathologies of mental illnesses.                                    notations. In section 3, we describe the psycho-env corpus
   As with other diseases, it has been established that mental          and discuss the limitation of this work. Finally, we conclude
illnesses are influenced in their origins and pathology by envi-        our work in section 4.
ronmental factors. For example, it has been found that higher
rates of schizophrenia occur in people of Caribbean origin
than general population living in the UK [Fung et al., 2006].
To date, no complete list of environmental factors for all ex-            1
                                                                            https://github.com/KHP-Informatics/
isting mental illnesses has been compiled that can be used for          psycho-env


                                                                   36
2     Materials and methods
                                                                                     Table 1: Articles in psycho-env corpus
2.1    Article selection                                                  Mental disorders and light factor   Articles
                                                                          Sunlight to bipolar disorder           7
In this preliminary study, we limited our scope on two                    Light to bipolar disorder              5
types of mental disorders (i.e., bipolar and depression) and              Sunlight to depression                 4
one particular environmental factor - light (including sun-               Light to depression                    4
light and light in general). A manual retrieval method was
adopted to search and select articles from various biblio-
graphic databases and search engines. This was to ensure that            2.2 Annotation guidelines and process
we were able to identify the most relevant and representative            When reviewing the articles, curators were asked to extract
investigations in this domain for the pilot study. The search            the following information to create a correlation between
and selection process are briefly described in the following.            mental illnesses and environmental factors. When combined
                                                                         together, the annotated items should be able to a) capture the
Literature search                                                        most important aspects for deriving the correlations and b)
The bibliographic databases and search engines used were                 form a concise description of the study. For well-defined clin-
MEDLINE (accessed via PubMed search engine), Web of                      ical concepts like disorders, phenotypes and clinical measure-
Science and Google Scholar. The aim was to look for rel-                 ments, the curators were asked to map them to UMLS (Uni-
evant and representative research articles including clinical            fied Medical Language System)2 concepts using a UMLS
studies, case reports and clinical trials published during the           search tool.
period from May 1877 to May 2017.                                          1. The most important finding(s) of the study (e.g., Bipo-
   The terms used for searching disorders included: bipolar,                   lar inpatients in E rooms (exposed to direct sunlight in
manic and depression, while terms for environmental factors                    the morning) had a mean 3.67-day shorter hospital stay
included sunlight, “light therapy” and phototherapy. In some                   than patients in W rooms [Benedetti et al., 2001]).
situations, extra constrains were added to narrow down the                 2. Environmental factors. Although this preliminary study
search results, e.g., clinical trial, case reports and etc.                    focused on light only, other types of environmental fac-
   In general, we found PubMed combined with Google                            tors might need to be annotated as well because they
Scholar can produce the most comprehensive list for our                        were used in the study to derive or measure light factors,
searches. For example, when searching sunlight and bipolar                     such as “latitudes 6.3 to 63.4 degrees from the equa-
disorder, PubMed results contained 7 relevant hits, Google                     tor”. Type of environmental factors including, but not
Scholar had 6, and Web of Science gave 5. All combined,                        limited to: sunlight exposure, seasonal pattern, sunlight
there were 8 distinct relevant hits. The overlap between                       in springtime, natural light, 36 collection sites from 23
PubMed and Google Scholar was 5 - PubMed brought in 2                          countries, and monthly climate variables.
new results and Google Scholar added 1, while all results                  3. Environmental factor classification or measurement.
from Web of Science were covered by other two search ser-                      This type of information includes the conceptual clas-
vices.                                                                         sification or quantity metrics for environmental factors
   Also, we found the terminologies used in the literature are                 investigated in the study, such as meteorological data on
quite heterogenous. For example, when denoting the usage of                    light intensity, the amount of sunlight exposure (i.e. in-
light in the therapy, many different terms were used - bright-                 solation), maximum monthly increase in solar insolation
light therapy, light therapy and phototherapy. Therefore, we                   and etc.
found it necessary to follow the reference graph of articles to
                                                                           4. Mental disorders. As mentioned earlier, two types of
check and include more articles or search terms.
                                                                               diseases were to be curated in this work: bipolar and de-
Article selection                                                              pression disorders. Any diseases that are specific types
                                                                               of these diseases need to be annotated, which include,
The studies were selected based on the following inclusion                     but not limited to, bipolar I disorder, recurrent depres-
criteria:                                                                      sion, non-seasonal depression, and rapid cycling bipolar.
    • published as an original article in a peer-reviewed jour-            5. Investigation aspects of disorders - the aspects of dis-
      nal                                                                      ease pathologies or phenotypes that were investigated in
                                                                               the study, such as the onset of bipolar disorder, mood
    • designed as a clinical trial, pilot study or case report                 swings, length of hospitalization and plasma melatonin
    • used light or sunlight as one of the investigation aspects               levels.
      or treatment alternatives                                            6. Diagnosis methods (if available), such as Young Mania
                                                                               Rating Scale (YMRS).
    • subjects were diagnosed as bipolar disorder or depres-
      sion                                                                 7. Patient cohort information including number of patients,
                                                                               patient demographic information, and control/case set-
    Table 1 contains the distribution of the psycho-env articles.              tings.
                                                                            2
                                                                                https://www.nlm.nih.gov/research/umls/


                                                                    37
Figure 1: PsychoEnv Annotator User Interface: yellow high-                             Table 2: Annotation data format
lights are the annotated texts; grey popups are labels (types)           Article URI               The web URL of the article’s
of the highlights; The popup dialog allows to add/change la-                                       web version, e.g.            https:
bels and delete annotations.                                                                       //www.ncbi.nlm.nih.gov/
                                                                                                   pubmed/24953482
                                                                         Annotation node locator   The locator is composed of two compo-
                                                                                                   nents:
                                                                                                     1. a jQuery3 selector that locates the
                                                                                                        parent element of the text node,
                                                                                                        where the annotation appears;
                                                                                                     2. an integer number that indicates
                                                                                                        the index of the text node within
  8. Data collection methods and data sources, such as pa-                                              its parent’s children list.
      tient records and/or direct interviews and NASA Surface
                                                                                                   For example: a locator can be { Selec-
      Meteorology and Solar Energy (SSE) database.
                                                                                                   tor: ABSTRACTTEXT:eq(1), Index: 0 }
  9. Data analysis methodologies, such as Autoregressive In-             Annotation offsets        The offsets have two integer compo-
      tegrated Moving Average (ARIMA) method.                                                      nents: start offset and end offset, where
   To the best of our knowledge, this is the first attempt to                                      start offset indicates the start position
                                                                                                   of the annotated text in its annotation
curate literature in this particular domain. A large part of
                                                                                                   node’s text content and end offset indi-
the curation is unknown to us, for example, what aspects of                                        cates the end position.
diseases were studied and how they were quantified, what ter-            Text                      The text content of the annotation
minologies were used to describe both clinical and environ-              Type                      The type of the annotation
mental concepts, how environmental factors were measured
and etc. Considering this underdeveloped nature, we adopted
an agile curation process, which was designed to be adaptive              • Structure preserving: compared to most existing annota-
and able to achieve continuous improvement. The idea was                    tion tools, PsychoEnv annotator is featured by its unique
borrowed from the agile software development. Technically,                  capability of locating annotations on the XHTML DOM
articles were partitioned into several subsets and curations                tree of the articles’ web pages (see annotation node lo-
were conducted on each subset at a time. After each curation                cator in table 2). This associates the annotations with
step, a curator meetup would be arranged to discuss problems                semi-structured DOM trees and, in turn, brings these tree
encountered and the lessons learned, and subsequently pro-                  structures as additional and easy-to-consume features to
pose amendments on the curation guidelines for improving                    software models.
the next rounds. We found this iterative process and efficient
                                                                           The annotation data format is a 5-element tuple as de-
inter-curator communications very helpful and effective.
                                                                        scribed in Table 2.
2.3   Annotation tool and annotation data format
A browser based annotation tool, PsychoEnv annotator,                   3 Results and discussion
was used for annotating articles. The tool is backed
                                                                        3.1 Corpus description
with an automated article highlighting service described
in [Wu et al., 2017]. PsychoEnv annotator is available on               The psycho-env corpus resulted in 27 annotated text nodes
Github: https://github.com/KHP-Informatics/                             that mark mental disorder mentions, 30 annotated text nodes
psycho-env. Figure 1 is a screenshot of PsychoEnv anno-                 that mark environmental factors, 25 annotated text nodes that
tator being used for annotating a PubMed article. Features of           mark environmental factor classifications/measurements and
the tool include:                                                       23 annotated sentences marked as important findings. These
   • Easy to setup: the annotation tool is a Chrome exten-              numbers are summarized in Table 3 which also shows the av-
      sion and the backend service is cloud based. Any article          erage number of annotations and range of annotations per ar-
      with an online XHTML version (e.g., PubMed article                ticle in the 20 articles in the corpus.
      abstracts) is available for annotating immediately with-             The psycho-env corpus was selected to represent bipolar
      out the need of any preprocessing.                                and depression disorders associated with two environmental
                                                                        factors - sunlight and (general) light. The aim was to have a
   • Easy to use: all curation operations are browser based,            similar coverage on each of the four sub-domains (as shown
      which minimises the learning curve of curation process.           in Table 1) so that we could cover relatively diverse topics
      In addition, the free text labelling allows project-wise          within a preliminary study. We summarised the major types
      acronyms, which speeds up the process.                            of annotations in table 4. Duplicated instances have been re-
   • Easy to share: associating annotations with web-                   moved using a syntax approach - string comparison . The
      addressed articles makes the annotations directly retriev-        first observation is that the environmental concepts seem to
      able either for the browser visualisation by using Psy-           be very heterogeneous (1.4 per article for light factors and
      choEnv annotator or for software agents by RESTful                2.05 per article for light measurements) even when we lim-
      API calls.                                                        ited the scope on light only. However, a close inspection on

                                                                   38
                                                                        curated from abstracts of 20 articles. Both the annotation tool
Table 3: Three major annotation types; averages were com-               and the corpus are open source and publicly available.
puted over the set of articles that contained that annotation
type.
    Type                      # anns       # articles   Avg.
                                                                        Acknowledgments
    Mental disorders          39           20           1.95            The work was supported by NIHR Biomedical Research Cen-
    Environmental factors     33           19           1.73            tre for Mental Health, the Biomedical Research Unit for De-
    Environmental classifi-   43           18           2.38            mentia at the South London, the Maudsley NHS Founda-
    cation / measurement                                                tion Trust and Kings College London, and European Union’s
                                                                        Horizon 2020 research and innovation programme under
                                                                        grant agreement No 644753(KConnect).
Table 4: Major annotation types and their distinct instance
numbers.                                                                References
    Annotation Type                               Number
    Mental disorders                              31
                                                                        [Benedetti et al., 2001] Francesco      Benedetti,    Cristina
    Disorder phenotypes                           21                       Colombo, Barbara Barbini, Euridice Campori, and Enrico
    Phenotype measurements                        17                       Smeraldi. Morning sunlight reduces length of hospitaliza-
    Diagnosis criteria                            7                        tion in bipolar depression. Journal of affective disorders,
    Environmental factors (Light)                 28                       62(3):221–223, 2001.
    Environmental factors (Other)                 6                     [Collier et al., 2015] Nigel Collier, Anika Oellrich, and Tu-
    Light/Sunlight classification / measurement   41                       dor Groza. Concept selection for phenotypes and dis-
    Analysis methodologies                        7                        eases using learn to rank. Journal of biomedical semantics,
                                                                           6(1):24, 2015.
the list of instances revealed that many different terms might          [Fung et al., 2006] WL Alan Fung, Dinesh Bhugra, and Pe-
mean the same concepts. This suggests the necessity of hav-                ter B Jones. Ethnicity and mental health: the example of
ing a consistent terminology so that different mentions of the             schizophrenia in migrant populations across europe. Psy-
same instances can be mapped. The second interesting obser-                chiatry, 5(11):396–401, 2006.
vation is that the numbers of specific disorders, phenotypes            [Groza et al., 2015] Tudor Groza, Sebastian Köhler, Sandra
and their measurements are relative large considering only 2               Doelken, Nigel Collier, Anika Oellrich, Damian Smedley,
disorders were selected. This suggests very little overlaps be-            Francisco M Couto, Gareth Baynam, Andreas Zankl, and
tween studies, which might imply that the curation could be                Peter N Robinson. Automatic concept recognition using
very efficient in terms delivering new knowledge.                          the human phenotype ontology reference and test suite cor-
3.2      Discussion                                                        pora. Database, 2015:bav005, 2015.
The main purpose of this preliminary study is to conduct a              [Lage et al., 2007] Kasper Lage, E Olof Karlberg, Zenia M
small scale case study on limited types of mental illnesses                Størling, Páll I Olason, Anders G Pedersen, Olga Rigina,
and environmental factors. Therefore, the number of docu-                  Anders M Hinsby, Zeynep Tümer, Flemming Pociot, Niels
ments annotated is rather small. But it has resulted with a                Tommerup, et al. A human phenome-interactome network
very valuable experience, which gave us a good understand-                 of protein complexes implicated in genetic disorders. Na-
ing about the quality and representation of environmental fac-             ture biotechnology, 25(3):309–316, 2007.
tors and their associations with mental disorders. Particularly,        [Oellrich et al., 2016] Anika Oellrich, Nigel Collier, Tu-
the typed annotations as summarised in table 4 can be used                 dor Groza, Dietrich Rebholz-Schuhmann, Nigam Shah,
to populate controlled vocabularies or ontologies to represent             Olivier Bodenreider, Mary Regina Boland, Ivo Georgiev,
knowledge in this domain.                                                  Hongfang Liu, Kevin Livingston, et al. The digital
   The corpus covers four subdomains of associations of men-               revolution in phenotyping. Briefings in bioinformatics,
tal disorders and environmental factors as depicted in table 1.            17(5):819–830, 2016.
The authors are confident that they have covered the most               [Rutter, 2005] Michael Rutter. How the environment af-
representative studies in the top 3 subdomains. However, re-               fects mental health. The British Journal of Psychiatry,
garding the last subdomain - Light to depression, due to a                 186(1):4–6, 2005.
relatively large body of available studies, the selected four
articles might not cover the most representative studies.               [Wu et al., 2017] Honghan Wu, Anika Oellrich, Christine
                                                                           Girges, Bernard de Bono, Tim J.P. Hubbard, and
4      Conclusion                                                          Richard J.B. Dobson. Automated PDF highlighting to
                                                                           support faster curation of literature for Parkinson’s and
In order to facilitate knowledge discovery on the pathologies              Alzheimer’s disease. Database, 2017(1):bax027, 2017.
of mental disorders, we initiated work on psycho-env corpus,
                                                                        [Zhou et al., 2014] XueZhong Zhou, Jörg Menche, Albert-
which is dedicated to curating the associations between men-
tal illnesses and environmental factors from published litera-             László Barabási, and Amitabh Sharma.             Human
ture. The first version reported in this paper focused on bipo-            symptoms–disease network. Nature communications, 5,
lar and depression disorders associated with lights, and was               2014.


                                                                   39
40