The psycho-env corpus: research articles annotated for knowledge discovery on correlating mental diseases and environmental factors Hui Wang1 , Quan Sun2 , Anika Oellrich3 , Honghan Wu3 and Richard Dobson3 1 Institute of Psychiatry, Psychology & Neuroscience, King’s College London, United Kingdom 2 Department of Informatics, King’s College London, United Kingdom 3 Department of Biostatistics and Medical Informatics, King’s College London, United Kingdom {hui.1.wang, quan.sun, anika.oellrich, honghan.wu, richard.j.dobson}@kcl.ac.uk Abstract patient screening and planning treatment strategies [Rutter, 2005]. While the published scientific literature is used in a biomedical context such as building gene While the published scientific literature is used in a networks for disease gene discovery, it seems to biomedical context such as building gene networks for dis- be an undervalued resource with respect to men- ease gene discovery [Lage et al., 2007] or symptom net- tal illnesses. It has been rarely explored for works of inheritable human disorders [Zhou et al., 2014], it the purpose of gaining psychopathology insights. seems to be an undervalued resource with respect to men- This limits our capability of better understand- tal illnesses. It has been rarely explored for the purpose ing the underlying mechanisms of mental disor- of gaining psychopathology insights. The potential of this ders. In this paper we describe the psycho-env resource lies within the amount and variety of data avail- corpus, which aims at annotating published stud- able: all journals that publish scientific results are covered ies for facilitating knowledge discovery on patholo- mostly since 1966, though some even date back to 1809. gies of mental diseases. Specifically, this corpus Although there is a body of work trying to identify “ex- focuses on the correlations between mental dis- tended” phenotypes [Oellrich et al., 2016; Groza et al., 2015; eases and environmental factors. We report the Collier et al., 2015], however, none of these efforts included first preliminary work of psycho-env on annotat- environmental factors, which are necessary to understand ing 20 articles about two mental illnesses (bipo- gene-phenotype relationships. In order to make use of this lar disorder and depression) and two particular tremendous resource for finding potential environmental fac- environmental factors - light and sunlight. The tors that (i) cause, (ii) contribute to and (iii) influence the ori- corpus is available at https://github.com/ gin and pathology of mental illnesses, (AI backed) automated KHP-Informatics/psycho-env. methods are needed to digest the large quantities of existing data. 1 Introduction In order to facilitate this endeavour, data collection and an- The success stories of cognitive computing (e.g., IBM Wat- notation would be required to identify relevant studies and son’s Jeopardy game) and deep learning (e.g., DeepMind’s the representation of environmental factors in the published AlphaGo) have sparked a massive wave of using artificial literature. In this paper we describe the psycho-env corpus1 , intelligence (AI) to improve numerous aspects of our daily which is a manually curated dataset from the abstracts of life. Not surprisingly, healthcare is among the hottest areas. 20 published studies on associations between two mental ill- For example, IBM Watson is now utilised in decision sup- nesses (bipolar disorder and depression) and one particular port for lung cancer at the Memorial Sloan Kettering Cancer environmental factor - light. We believe this is the first effort Center. However, AI models require data to derive better un- to produce curated corpus for knowledge discovery on asso- derstanding of the underlying mechanisms of diseases before ciations between mental illness and environmental factors. they can really improve existing treatments or increase the re- covery rate. Unfortunately, the lack of data is a major hurdle In the next section, we introduce the article selection, an- in many areas of the clinical domain, such as understanding notation process, annotation tool used and data format of an- the pathologies of mental illnesses. notations. In section 3, we describe the psycho-env corpus As with other diseases, it has been established that mental and discuss the limitation of this work. Finally, we conclude illnesses are influenced in their origins and pathology by envi- our work in section 4. ronmental factors. For example, it has been found that higher rates of schizophrenia occur in people of Caribbean origin than general population living in the UK [Fung et al., 2006]. To date, no complete list of environmental factors for all ex- 1 https://github.com/KHP-Informatics/ isting mental illnesses has been compiled that can be used for psycho-env 36 2 Materials and methods Table 1: Articles in psycho-env corpus 2.1 Article selection Mental disorders and light factor Articles Sunlight to bipolar disorder 7 In this preliminary study, we limited our scope on two Light to bipolar disorder 5 types of mental disorders (i.e., bipolar and depression) and Sunlight to depression 4 one particular environmental factor - light (including sun- Light to depression 4 light and light in general). A manual retrieval method was adopted to search and select articles from various biblio- graphic databases and search engines. This was to ensure that 2.2 Annotation guidelines and process we were able to identify the most relevant and representative When reviewing the articles, curators were asked to extract investigations in this domain for the pilot study. The search the following information to create a correlation between and selection process are briefly described in the following. mental illnesses and environmental factors. When combined together, the annotated items should be able to a) capture the Literature search most important aspects for deriving the correlations and b) The bibliographic databases and search engines used were form a concise description of the study. For well-defined clin- MEDLINE (accessed via PubMed search engine), Web of ical concepts like disorders, phenotypes and clinical measure- Science and Google Scholar. The aim was to look for rel- ments, the curators were asked to map them to UMLS (Uni- evant and representative research articles including clinical fied Medical Language System)2 concepts using a UMLS studies, case reports and clinical trials published during the search tool. period from May 1877 to May 2017. 1. The most important finding(s) of the study (e.g., Bipo- The terms used for searching disorders included: bipolar, lar inpatients in E rooms (exposed to direct sunlight in manic and depression, while terms for environmental factors the morning) had a mean 3.67-day shorter hospital stay included sunlight, “light therapy” and phototherapy. In some than patients in W rooms [Benedetti et al., 2001]). situations, extra constrains were added to narrow down the 2. Environmental factors. Although this preliminary study search results, e.g., clinical trial, case reports and etc. focused on light only, other types of environmental fac- In general, we found PubMed combined with Google tors might need to be annotated as well because they Scholar can produce the most comprehensive list for our were used in the study to derive or measure light factors, searches. For example, when searching sunlight and bipolar such as “latitudes 6.3 to 63.4 degrees from the equa- disorder, PubMed results contained 7 relevant hits, Google tor”. Type of environmental factors including, but not Scholar had 6, and Web of Science gave 5. All combined, limited to: sunlight exposure, seasonal pattern, sunlight there were 8 distinct relevant hits. The overlap between in springtime, natural light, 36 collection sites from 23 PubMed and Google Scholar was 5 - PubMed brought in 2 countries, and monthly climate variables. new results and Google Scholar added 1, while all results 3. Environmental factor classification or measurement. from Web of Science were covered by other two search ser- This type of information includes the conceptual clas- vices. sification or quantity metrics for environmental factors Also, we found the terminologies used in the literature are investigated in the study, such as meteorological data on quite heterogenous. For example, when denoting the usage of light intensity, the amount of sunlight exposure (i.e. in- light in the therapy, many different terms were used - bright- solation), maximum monthly increase in solar insolation light therapy, light therapy and phototherapy. Therefore, we and etc. found it necessary to follow the reference graph of articles to 4. Mental disorders. As mentioned earlier, two types of check and include more articles or search terms. diseases were to be curated in this work: bipolar and de- Article selection pression disorders. Any diseases that are specific types of these diseases need to be annotated, which include, The studies were selected based on the following inclusion but not limited to, bipolar I disorder, recurrent depres- criteria: sion, non-seasonal depression, and rapid cycling bipolar. • published as an original article in a peer-reviewed jour- 5. Investigation aspects of disorders - the aspects of dis- nal ease pathologies or phenotypes that were investigated in the study, such as the onset of bipolar disorder, mood • designed as a clinical trial, pilot study or case report swings, length of hospitalization and plasma melatonin • used light or sunlight as one of the investigation aspects levels. or treatment alternatives 6. Diagnosis methods (if available), such as Young Mania Rating Scale (YMRS). • subjects were diagnosed as bipolar disorder or depres- sion 7. Patient cohort information including number of patients, patient demographic information, and control/case set- Table 1 contains the distribution of the psycho-env articles. tings. 2 https://www.nlm.nih.gov/research/umls/ 37 Figure 1: PsychoEnv Annotator User Interface: yellow high- Table 2: Annotation data format lights are the annotated texts; grey popups are labels (types) Article URI The web URL of the article’s of the highlights; The popup dialog allows to add/change la- web version, e.g. https: bels and delete annotations. //www.ncbi.nlm.nih.gov/ pubmed/24953482 Annotation node locator The locator is composed of two compo- nents: 1. a jQuery3 selector that locates the parent element of the text node, where the annotation appears; 2. an integer number that indicates the index of the text node within 8. Data collection methods and data sources, such as pa- its parent’s children list. tient records and/or direct interviews and NASA Surface For example: a locator can be { Selec- Meteorology and Solar Energy (SSE) database. tor: ABSTRACTTEXT:eq(1), Index: 0 } 9. Data analysis methodologies, such as Autoregressive In- Annotation offsets The offsets have two integer compo- tegrated Moving Average (ARIMA) method. nents: start offset and end offset, where To the best of our knowledge, this is the first attempt to start offset indicates the start position of the annotated text in its annotation curate literature in this particular domain. A large part of node’s text content and end offset indi- the curation is unknown to us, for example, what aspects of cates the end position. diseases were studied and how they were quantified, what ter- Text The text content of the annotation minologies were used to describe both clinical and environ- Type The type of the annotation mental concepts, how environmental factors were measured and etc. Considering this underdeveloped nature, we adopted an agile curation process, which was designed to be adaptive • Structure preserving: compared to most existing annota- and able to achieve continuous improvement. The idea was tion tools, PsychoEnv annotator is featured by its unique borrowed from the agile software development. Technically, capability of locating annotations on the XHTML DOM articles were partitioned into several subsets and curations tree of the articles’ web pages (see annotation node lo- were conducted on each subset at a time. After each curation cator in table 2). This associates the annotations with step, a curator meetup would be arranged to discuss problems semi-structured DOM trees and, in turn, brings these tree encountered and the lessons learned, and subsequently pro- structures as additional and easy-to-consume features to pose amendments on the curation guidelines for improving software models. the next rounds. We found this iterative process and efficient The annotation data format is a 5-element tuple as de- inter-curator communications very helpful and effective. scribed in Table 2. 2.3 Annotation tool and annotation data format A browser based annotation tool, PsychoEnv annotator, 3 Results and discussion was used for annotating articles. The tool is backed 3.1 Corpus description with an automated article highlighting service described in [Wu et al., 2017]. PsychoEnv annotator is available on The psycho-env corpus resulted in 27 annotated text nodes Github: https://github.com/KHP-Informatics/ that mark mental disorder mentions, 30 annotated text nodes psycho-env. Figure 1 is a screenshot of PsychoEnv anno- that mark environmental factors, 25 annotated text nodes that tator being used for annotating a PubMed article. Features of mark environmental factor classifications/measurements and the tool include: 23 annotated sentences marked as important findings. These • Easy to setup: the annotation tool is a Chrome exten- numbers are summarized in Table 3 which also shows the av- sion and the backend service is cloud based. Any article erage number of annotations and range of annotations per ar- with an online XHTML version (e.g., PubMed article ticle in the 20 articles in the corpus. abstracts) is available for annotating immediately with- The psycho-env corpus was selected to represent bipolar out the need of any preprocessing. and depression disorders associated with two environmental factors - sunlight and (general) light. The aim was to have a • Easy to use: all curation operations are browser based, similar coverage on each of the four sub-domains (as shown which minimises the learning curve of curation process. in Table 1) so that we could cover relatively diverse topics In addition, the free text labelling allows project-wise within a preliminary study. We summarised the major types acronyms, which speeds up the process. of annotations in table 4. Duplicated instances have been re- • Easy to share: associating annotations with web- moved using a syntax approach - string comparison . The addressed articles makes the annotations directly retriev- first observation is that the environmental concepts seem to able either for the browser visualisation by using Psy- be very heterogeneous (1.4 per article for light factors and choEnv annotator or for software agents by RESTful 2.05 per article for light measurements) even when we lim- API calls. ited the scope on light only. However, a close inspection on 38 curated from abstracts of 20 articles. Both the annotation tool Table 3: Three major annotation types; averages were com- and the corpus are open source and publicly available. puted over the set of articles that contained that annotation type. Type # anns # articles Avg. Acknowledgments Mental disorders 39 20 1.95 The work was supported by NIHR Biomedical Research Cen- Environmental factors 33 19 1.73 tre for Mental Health, the Biomedical Research Unit for De- Environmental classifi- 43 18 2.38 mentia at the South London, the Maudsley NHS Founda- cation / measurement tion Trust and Kings College London, and European Union’s Horizon 2020 research and innovation programme under grant agreement No 644753(KConnect). Table 4: Major annotation types and their distinct instance numbers. References Annotation Type Number Mental disorders 31 [Benedetti et al., 2001] Francesco Benedetti, Cristina Disorder phenotypes 21 Colombo, Barbara Barbini, Euridice Campori, and Enrico Phenotype measurements 17 Smeraldi. Morning sunlight reduces length of hospitaliza- Diagnosis criteria 7 tion in bipolar depression. Journal of affective disorders, Environmental factors (Light) 28 62(3):221–223, 2001. Environmental factors (Other) 6 [Collier et al., 2015] Nigel Collier, Anika Oellrich, and Tu- Light/Sunlight classification / measurement 41 dor Groza. Concept selection for phenotypes and dis- Analysis methodologies 7 eases using learn to rank. Journal of biomedical semantics, 6(1):24, 2015. the list of instances revealed that many different terms might [Fung et al., 2006] WL Alan Fung, Dinesh Bhugra, and Pe- mean the same concepts. This suggests the necessity of hav- ter B Jones. Ethnicity and mental health: the example of ing a consistent terminology so that different mentions of the schizophrenia in migrant populations across europe. Psy- same instances can be mapped. The second interesting obser- chiatry, 5(11):396–401, 2006. vation is that the numbers of specific disorders, phenotypes [Groza et al., 2015] Tudor Groza, Sebastian Köhler, Sandra and their measurements are relative large considering only 2 Doelken, Nigel Collier, Anika Oellrich, Damian Smedley, disorders were selected. This suggests very little overlaps be- Francisco M Couto, Gareth Baynam, Andreas Zankl, and tween studies, which might imply that the curation could be Peter N Robinson. Automatic concept recognition using very efficient in terms delivering new knowledge. the human phenotype ontology reference and test suite cor- 3.2 Discussion pora. Database, 2015:bav005, 2015. The main purpose of this preliminary study is to conduct a [Lage et al., 2007] Kasper Lage, E Olof Karlberg, Zenia M small scale case study on limited types of mental illnesses Størling, Páll I Olason, Anders G Pedersen, Olga Rigina, and environmental factors. Therefore, the number of docu- Anders M Hinsby, Zeynep Tümer, Flemming Pociot, Niels ments annotated is rather small. But it has resulted with a Tommerup, et al. A human phenome-interactome network very valuable experience, which gave us a good understand- of protein complexes implicated in genetic disorders. Na- ing about the quality and representation of environmental fac- ture biotechnology, 25(3):309–316, 2007. tors and their associations with mental disorders. Particularly, [Oellrich et al., 2016] Anika Oellrich, Nigel Collier, Tu- the typed annotations as summarised in table 4 can be used dor Groza, Dietrich Rebholz-Schuhmann, Nigam Shah, to populate controlled vocabularies or ontologies to represent Olivier Bodenreider, Mary Regina Boland, Ivo Georgiev, knowledge in this domain. Hongfang Liu, Kevin Livingston, et al. The digital The corpus covers four subdomains of associations of men- revolution in phenotyping. Briefings in bioinformatics, tal disorders and environmental factors as depicted in table 1. 17(5):819–830, 2016. The authors are confident that they have covered the most [Rutter, 2005] Michael Rutter. How the environment af- representative studies in the top 3 subdomains. However, re- fects mental health. The British Journal of Psychiatry, garding the last subdomain - Light to depression, due to a 186(1):4–6, 2005. relatively large body of available studies, the selected four articles might not cover the most representative studies. [Wu et al., 2017] Honghan Wu, Anika Oellrich, Christine Girges, Bernard de Bono, Tim J.P. Hubbard, and 4 Conclusion Richard J.B. Dobson. Automated PDF highlighting to support faster curation of literature for Parkinson’s and In order to facilitate knowledge discovery on the pathologies Alzheimer’s disease. Database, 2017(1):bax027, 2017. of mental disorders, we initiated work on psycho-env corpus, [Zhou et al., 2014] XueZhong Zhou, Jörg Menche, Albert- which is dedicated to curating the associations between men- tal illnesses and environmental factors from published litera- László Barabási, and Amitabh Sharma. Human ture. The first version reported in this paper focused on bipo- symptoms–disease network. Nature communications, 5, lar and depression disorders associated with lights, and was 2014. 39 40