=Paper= {{Paper |id=Vol-1650/smbm16GutierrezSacristan |storemode=property |title=Text Mining and Expert Curation to Develop a Database on Psychiatric Diseases and their Genes |pdfUrl=https://ceur-ws.org/Vol-1650/smbm16GutierrezSacristan.pdf |volume=Vol-1650 |authors=Alba Gutiérrez Sacristán,Álex Bravo,Marta Portero,Olga Valverde,Antonio Armario,M.C. Blanco-Gandía,Adriana Farré,Lierni Fernández-Ibarrondo,Francina Fonseca,Jesús Giraldo,Angela Leis,Anna Mané,M.A. Mayer,Sandra Montagud-Romero,Roser Nadal,Jordi Ortiz,Francisco Javier Pavon,Ezequiel Perez,Marta Rodríguez-Arias,Antonia Serrano,Marta Torrens,Vincent Warnault,Ferran Sanz,Laura I. Furlong |dblpUrl=https://dblp.org/rec/conf/smbm/Gutierrez-Sacristan16 }} ==Text Mining and Expert Curation to Develop a Database on Psychiatric Diseases and their Genes== https://ceur-ws.org/Vol-1650/smbm16GutierrezSacristan.pdf
Text mining and expert curation to develop a database on psychiatric
                     diseases and their genes
Alba Gutiérrez-Sacristán               Àlex Bravo                     Marta Portero
         GRIB                               GRIB                           GReNeC
      IMIM - UPF                        IMIM - UPF                        IMIM - UPF

     Olga Valverde                  Antonio Armario               M.Carmen Blanco-Gandı́a
       GReNeC                      Universitat Autònoma           Universitat de Valencia
      IMIM - UPF                       de Barcelona

     Adriana Farré            Lierni Fernández-Ibarrondo             Francina Fonseca
    Parc de Salut Mar           Program in Cancer Research             Parc de Salut Mar
          UAB                             IMIM                               UAB

     Jesús Giraldo                     Angela Leis                      Anna Mané
  Universitat Autònoma                   GRIB                         Parc de Salut Mar
      de Barcelona                      IMIM - UPF                           UAB

   Miguel A. Mayer              Sandra Montagud-Romero                    Roser Nadal
       GRIB                       Universitat de Valencia          Institut de Neurociències
     IMIM - UPF                                                               UAB

      Jordi Ortiz                 Francisco Javier Pavón               Ezequiel Perez
   School of Medicine             Instituto de Investigación          Parc de Salut Mar
          UAB                       Biomédica de Málaga                    UAB

Marta Rodrı́guez-Arias                Antonia Serrano                   Marta Torrens
Universitat de Valencia           Instituto de Investigación          Parc de Salut Mar
                                    Biomédica de Málaga                    UAB

   Vincent Warnault                     Ferran Sanz                    Laura I. Furlong
       GReNeC                              GRIB                             GRIB
      IMIM - UPF                        IMIM - UPF                       IMIM - UPF

               Abstract                              information in a structured manner, pro-
                                                     viding a set of analysis and visualiza-
During the last years there has been a               tion tools. In this communication we de-
growing research in the genetics of psy-             scribe the protocol we put in place for the
chiatric diseases.    However, there is              sustainable update of this knowledge re-
still a limited understanding of the cel-            source. It includes the recruitment of a
lular and molecular mechanisms leading               team of experts to perform the curation of
to these diseases, which has hampered                the data previously extracted by text min-
the application of this wealth of knowl-             ing. Annotation guidelines and a web-
edge into the clinical practice to im-               based annotation tool were developed to
prove diagnosis and treatment of psy-                support curators tasks. A curation work-
chiatric patients. PsyGeNET (http://                 flow was designed including a pilot phase,
www.psygenet.org/) has been devel-                   and two rounds of curation and analy-
oped to improve the understanding of psy-            sis phases. We report the results of the
chiatric diseases, by facilitating the ac-           application of this workflow to the task
cess to the vast amount of their genetic             of curation of gene-disease associations
    for PsyGeNET, including the analysis of
    the inter-annotator agreement, and suggest
    that this model is a suitable approach for
    the sustainable development and update of
    knowledge resources.


1   Introduction

Psychiatric disorders have a great impact on mor-
bidity and mortality (Murray and Lopez, 2013;
Whiteford et al., 2013). According to the World
Health Organization (WHO), one of every four
people will suffer mental or neurological disorders
(Kessler et al., 2005; Baldacchino et al., 2009). It
has been suggested that most psychiatric disorders
display a strong genetic component (Sullivan et
                                                              Figure 1: PsyGeNET curation workflow
al., 2012). During the last years there has been
a growing research in the genetics of psychiatric
disorders, and its findings have been reported on       2     Methods
hundreds of thousands of publications. This liter-
ature constitutes a rich and diverse source of infor-   2.1    Curation team
mation essential for any psychiatric research line.     A team of 22 experts from different domains (such
However, the huge amount and continuous growth          as psychiatry, neuroscience, medicine, psychology
of the number of publications refrain scientists to     and biology) was recruited from the Spanish Net-
efficiently explore such large volume of data.          work of Addiction and other collaborators of the
   PsyGeNET (Psychiatric disorders Gene associ-         coordination team (Research Group on Integrative
ation NETwork) (Gutiérrez-Sacristán et al., 2015)     Biomedical Informatics (GRIB)) to participate in
has been developed to establish a curated resource      the curation process. The incentives for participa-
on psychiatric diseases and their associated genes.     tion were to be part of the PsyGeNET team and
PsyGeNET integrates knowledge extracted from            to be co-authors in the publication(s) originated
the scientific literature by text-mining which has      from the project. The curators were trained dur-
been curated by experts in psychiatry and neuro-        ing an initial session where the PsyGeNET anno-
sciences.                                               tation guidelines were presented and then during
   In this communication we describe the process        the Pilot phase. Communication with the coor-
put in place for the update of the PsyGeNET             dination team through e-mail was established to
database. This involved i) the recruitment of a         resolve questions during the curation process. In
team of experts to curate the information extracted     addition, on-line and f2f meeting were organized
by text-mining; ii) the extraction of information of    after key points of the curation process (analysis
gene-disease associations (GDAs) from the litera-       phases) to share experiences among all curators
ture using the text mining system BeFree (Bravo         and solve curation issues.
et al., 2015), iii) the development of a curation
workflow (Figure 1), iv) the development of a           2.2    Defining the Psychiatric Diseases in terms
web-based annotation tool in order to facilitate               of UMLS concepts
the curation task and v) the definition of detailed     In PsyGeNET, the psychiatric diseases are identi-
guidelines to assist the curation task.                 fied by UMLS Metathesaurus concepts. Three ex-
   In particular, we present the results of the Pilot   perts reviewed the terminology included in more
phase and Curation I phase of the workflow, in-         than 2,000 UMLS concepts related to the psychi-
cluding the analysis of the inter-annotator agree-      atric disorders of interest, and assigned them to the
ment, and suggest that this protocol is a suitable      following psychiatric disease categories (DCs): 1)
approach for the sustainable development and up-        Depressive disorders, 2) Bipolar disorders and re-
date of knowledge resources.                            lated disorders, 3) Substance/drug induced depres-
sive disorder, 4) Schizophrenia spectrum and other      the association types. Finally, it also included a
psychotic disorders, 5) Drug-induced psychosis,         tutorial on how to use the PsyGeNET annotation
6) Alcohol use disorders, 7) Cannabis use disor-        tool. The goal of the curation was to validate the
ders and 8) Cocaine use disorders. This informa-        association of a gene to a particular disease. We
tion was used both for text mining of gene-disease      consider that a gene is associated to a disease if
associations by BeFree (see below) and for identi-      the gene or the product of the gene plays a role
fication of disease classes during the curation.        in the disease pathogenesis, or is a marker for the
                                                        disease. The PsyGeNET annotation tool was used
2.3   Text mining of gene-disease associations          to help in this curation task. For each gene-disease
BeFree, a text-mining tool that exploits mor-           association identified by text mining, the annota-
phosyntactic information from the text to identify      tion tool displayed the evidence that supports the
relationships between biomedical concepts, was          association, more concretely the abstracts and the
used to identify associations between genes and         sentences in which the gene-disease association
the psychiatric diseases of interest from a corpus      is stated. Then, by inspecting the evidence, the
of ∼1M of MEDLINE abstracts focused on hu-              curator had to determine the type of association
man genetic diseases. The diseases were identified      (Association, No Association, False, Error and
using the UMLS concepts that define each disor-         Not Clear). The types of association are described
der, whereas an in-house developed gene dictio-         as follows: i) Association: the publication clearly
nary was used to identify the genes, as described       states that there is an association between the gene
in (Bravo et al., 2015). The identified disorders       and the disease - it can be a causative association
were grouped according to the eight psychiatric         (e.g. a mutation in the gene causes the disease),
disease categories (described in section 2.2). As       or a marker association (e.g. a SNP in the gene
a result, BeFree identified 6,349 associations be-      identified in a GWAS study); ii) No Association:
tween genes and DCs (gene-disease category as-          the publications clearly states that there is no
sociation or GDCA) supported by 4,065 publica-          association between the gene and the disease (e.g.
tions. A subset of the associations was initially       a publication that reports a negative finding on
evaluated by our group to identify the most fre-        the association between the gene and the disease),
quent text mining errors. For instance, the word        iii) False Association: The gene and the disease
depression is often used in other context in addi-      are found co-occurring in a sentence, but there is
tion to psychiatry. This initial evaluation was per-    no clear evidence from the publication that the
formed to identify this kind of errors and improve      gene plays a role or is a marker of the disease and
the text mining system before the identification of     iv) Error: when there is a text mining error in
GDCAs. We then applied a number of filters to re-       the correct identification of the gene and/or the
duce the size of the curation task and make it feasi-   disease.
ble with the resources at hand. For instance, we re-
moved associations already present in curated re-          Table 1 shows some examples of the associ-
sources (DisGeNET (Piñero et al., 2015) and the        ation types considered in PsyGeNET. In the ex-
previous release of PsyGeNET ), kept only those         ample for False Association, the study is on chil-
associations published recently (after year 2,000)      dren that do not meet the criteria for the disease
in journals with Impact Factor greater than 1, and      (FASD) therefore the association between the gene
we did not take into account reviews. After this        and the disease has to be classified as false. In
process we obtained 2,507 GDCAs, which were             the example of Error, note that in this abstract
submitted to expert curation.                           OCT is not a gene but an acronym of optical co-
                                                        herence tomography (OCT). The document de-
2.4   Annotation Guidelines
                                                        scribing the guidelines is available on the Psy-
The PsyGeNET annotation guidelines were                 GeNET web page (http://www.psygenet.
developed with the purpose of guiding the manual        org/Psytool_manual_v5.0.pdf). Here
curation process. The guidelines included the           we provide the general instructions for the cu-
definition of a gene-disease association, how it        ration of the gene-disease associations in Psy-
should be classified according to the level of          GeNET:
evidence, what information should be considered
for the annotation and provided real examples of          1. The curation has to be performed at abstract
Association       PMID     Sentence                          and HGNC), and highlights the sentences in which
Type
Association      267012 The D-amino acid oxidase ac-         BeFree identified an association between the gene
                        tivator gene (G72) has been          and the disease under consideration. If required,
                        found associated with several        the curator can access the full text article using the
                        psychiatric disorders such as
                        schizophrenia, major depres-         PubMed hyperlink. The curator is also asked to se-
                        sion, and bipolar disorder.          lect a sentence that best represents their validation
No Association 17692928 There was no association be-         decision, if available. This was implemented in
                        tween TPH-2 gene variants and
                        MD in the same population that       order to collect example sentences to improve the
                        had shown a strong association       performance of the BeFree system. In addition,
                        with TPH-1.                          the tool also provides a progress bar indicating the
False          25225167 Two children referred for sus-
                        picion of FASD (neither of           number of validations and associations performed
                        which were exposed to alcohol        by the expert, and allows to review previous anno-
                        or met the criteria for FASD)
                        had a pathogenic microstruc-
                                                             tations. We refer to a validation to each publica-
                        tural chromosomal rearrange-         tion supporting a particular GDCA. Note that each
                        ment (del16p11.2 of 542 KB           publication can have more than one GDCA.
                        and dup1q44 of 915 KB).
Error          21174530 OCT demonstrated loss of
                        foveal depression with distor-
                        tion of the foveal architecture in
                        the macula in all patients

Table 1: Examples of Association types. Disease
and genes that have to be evaluated are highlighted
in the sentence in green and orange, respectively.

      level. For those cases in which abstract is not
      clear enough, the full text article should be
      reviewed.
  2. Annotate only relationships between the gene
     and disease. Other types of relationships
     should not be annotated.
  3. Annotate relationships according to the pro-
     vided categories: association, no association,
     error, and false.
2.5   Annotation tool
A user-friendly web-based tool was developed to
assist both the definition of the psychiatric disor-
ders of interest and curation of gene-disease as-
sociations. The tool was designed to support a
multi-user environment by user and password as-
signment. Figure 2 shows a screenshot of the tool
for the curation of GDCAs. The tool shows the
GDCA to be evaluated (in this example the as-
sociation between the ETNPPL gene and Bipolar
disorders class), and a publication at a time. The
curator has to review the publication and decide if                 Figure 2: Annotation web-based tool
the association of the gene and the disease class
holds, and decide on the association type using
the drop-down menu. To aid the curators task, the            2.6   Curation workflow
tool displays the terminology for the gene accord-           We put in place a curation workflow including a
ing to standard resources (NCBI Gene, UniProt                pilot phase and two curation and analysis phases
(see Figure 1). During the pilot phase, the initial
training of the curators was carried out including
how to use the curation tool. A set of 100 abstracts
was validated and analyzed during the pilot phase.
After this process both the curation tool and the
annotation guidelines were improved and the first
curation phase was launched (Curation Phase I),
to evaluate 2,507 GDCAs identified by text min-
ing and supported by 4,065 publications. The re-
sults of the curation were analyzed to estimate the
inter-annotator agreement at the level of abstract.
The validations for which an agreement was not
found in Curation Phase I are then reviewed by a
third expert during Curation Phase II (results not
reported here). Four experts are participating in
this phase. Only the validations for which agree-
ment of at least 2 experts is found will be included
in the database.                                       Figure 3: Psychiatric disease categories and the
                                                       number of associated genes.
3   Results and discussion
Firstly, three experts reviewed the terminology of     tion tool, and reviewing the PsyGeNET annota-
2,523 UMLS concepts related to psychiatric dis-        tion guidelines. One hundred publications were
orders of interest. As a result, 1,942 UMLS con-       reviewed during the Pilot phase, distributed in 10
cepts were assigned to one of the 8 disease cat-       publications per 2 experts. The average agree-
egories, being alcohol use disorder, depression        ment between the experts pairs in the Pilot Phase
and schizophrenia defined by more than 300 con-        was 60%. The main sources of discrepancies were
cepts (321, 368 and 488, respectively). On the         the handling of speculations, the proper identifica-
other hand, 581 UMLS concepts were excluded            tion of text mining errors, in particular for genes,
at this stage. Then, BeFree was used to iden-          and the distinction between False and Error As-
tify gene-disease associations from the literature     sociation types. The annotation tool was modi-
based on the above disease definition and a sub-       fied to show the terminology of the genes in order
set of the associations focused on the disorders       to help the curators to find potential errors in the
of interest was selected (see methods section 2.3).    identification of genes, and by improving the Re-
The 2,507 genes associated to DCs identified by        view function. Then, the proper curation (Curation
BeFree were submitted to expert curation. These        Phase I in the workflow in Figure 1) was launched
genes were unevenly distributed across the disease     and it was completed in 33 days. During Curation
categories, being schizophrenia the disease cate-      Phase I, 2,507 GDCAs supported by 4,065 publi-
gory with more associations followed by depres-        cations were reviewed by the curators. Each expert
sion and alcohol use disorders (see Figure 3).         was assigned with a set of approx. 275 GDCAs
   Of note, most of the GDCAs were supported by        (corresponding to 450 publications) according to
only one publication (70.6 %). We included up to       their field of expertise (e.g. Major depression vs
the 5 most recent publications for each GDCA for       Schizophrenia). Some curators evaluated associa-
the validation process. This led to 242-284 GD-        tions from all the disease categories, while others
CAs to be validated by each curator, depending         focused in a single category. The results of the
on the disease category. Since most of GDCAs           curation phase I were analyzed to identify agree-
are supported by only one publication, the num-        ments and disagreements between the experts. Ta-
ber of publications to be reviewed by the cura-        ble 2 shows the number of abstracts validated by
tors ranged between 322 and 491. Before start-         each curator team (composed of two experts) and
ing the curation of the 2,507 GDCAs, a Pilot cura-     the agreement achieved. The average agreement
tion phase was designed with the purpose of train-     between all the experts was 68.95%, higher that
ing the curators, testing the PsyGeNET annota-         the one obtained in the Pilot Phase. For one cura-
tor team the agreement was higher (89%) than for
the rest of the teams. We can attribute this higher
agreement to the fact that there was some commu-
nication between the two experts to discuss on the
curation criteria during the Curation Phase I.
  Teams Validations Agreem. Disagr. % Agreem.
  Team 1    494       325    169      65.79
  Team 2    319       194    125      60.89
  Team 3    489       342    147      69.94
  Team 4    450       402     48      89.33
  Team 5    492       308    184      62.60
  Team 6    508       341    167      67.12
  Team 7    463       317    146      68.46
  Team 8    516       363    153      70.35
  Team 9    334       221    113      66.17

    Table 2: Agreement for each expert pair.
                                                        Figure 4: Summary of the agreement results.
   From the validations in which agreement was
                                                        Each bar in the barplot represents the number of
found (2,813 validations), 1,880 were classified as
                                                        validations annotated as: Association, No associa-
Association or No Association; 901 were classi-
                                                        tion, False, Error and Not clear, respectively.
fied as False or Error, and only in 32 of them, the
evidence extracted from the publication was not
enough to classify them within any of the previous
categories, falling in the not clear category (Fig-     analysis in some studies (e.g. GWAS studies).
ure 4). The set of 1,880 validations will be part       In the first case, the decision on the association
of the next release of PsyGeNET. Notably, an im-        type will depend on the expertise of the curator
portant fraction of these associations (24.7%) are      on animal model research in psychiatry, that was
classified as No association, meaning that there is     not the same among the team of experts. In the
evidence reporting negative findings on the associ-     other three cases the experts expressed difficulties
ation between the gene and the disease. This high-      in correctly identifying if an association has to be
lights the importance of recording negative find-       annotated or not. Overall, although the curation
ings from the literature in knowledge resources.        task was very focused to the domain of genetics
On the other hand, collecting these information is      of psychiatric diseases, the wide variety of studies
relevant for the development of corpora for train-      covered by the publications (GWAs studies,
ing text mining systems able to identify negative       sequencing studies, animal models, etc) require
findings regarding gene-disease associations from       an equivalent diversity of expertise among the
the literature.                                         experts. We think that this complexity in the task
   We observe that for 30% of the total GDCAs           is one of the main reasons for the inter-annotator
validated, agreement between curators was not           agreement achieved. Ongoing work includes
found. A substantial fraction of the disagreements      revisiting the annotation guidelines to further
involved the annotation of an association as False      clarify the curation issues raised, in order to
by one of the experts (53.28%, see Figure 5).           improve the agreement in the annotations.
The results of Curation Phase I were discussed
with the experts in order to identify the main             In recent years, many efforts have been made
difficulties during the annotation. The main            to develop and contribute with novel corpora
sources of the discrepancies between curators           in the biomedical domain. Nevertheless, the
were the following: i) difficulty in assessing if       number of corpora annotated with information
the studies using animal models captures well the       on gene-disease associations is particularly low
disease pathophysiology, ii) the studies focused        (Neves, 2014). For example, the Craven cor-
on pharmacogenomics or response to drug treat-          pus (Craven et al., 1999), contains annotations
ments, iii) studies assessing disease phenotypes        of gene-disease associations, but there is no in-
(e.g. low mood) in otherwise normal populations,        formation on data quality such as inter-annotator
and iv) the assessment of validity of the statistical   agreement in the original publication. The EU-
ADR corpus (Van Mulligen et al., 2012) in-            among two experts in the first phase of curation.
cludes associations between genes and diseases        Currently, this involves 1,252 validations, which
from 100 MEDLINE abstracts, with an inter-            are being reviewed by a third expert (ongoing
annotator agreement of 86%. Wiegers et al.            work at the time of writing). Finally, the infor-
presented the manual curation of chemical-gene-       mation that will be included in PsyGeNET are the
disease network for the Comparative Toxicoge-         associations in which at least two experts agreed
nomics Database (CTD) (Wiegers et al., 2009).         on the annotation.
For this study 112 articles were distributed be-
tween three curators (each one revised less than      4   Conclusions
60 articles), achieving an inter-annotator agree-
                                                      In this communication we report the development
ment of 77%. The CoMAGC corpus (Lee et al.,
                                                      of a protocol for the sustainable update of a knowl-
2013), focused on genes associated to prostate,
                                                      edge resource on the genetics of psychiatric dis-
breast and ovarian cancer, is based on 821 sen-
                                                      eases, PysGeNET. We combined state-of-the-art
tences. The authors report an agreement 72%. In
                                                      text-mining, data filtering and curation by a com-
another study, agreement over 70% was reported
                                                      munity of domain experts for the release of a new
in the development of a sentence-based corpus on
                                                      version of the database. We designed a proto-
prostate cancer-gene associations (Chun et al.,
                                                      col that includes curators’ training and the iter-
2006). In summary, compared to other corpora an-
                                                      ative improvement of both the tools and annota-
notation initiatives, our inter-annotator agreement
                                                      tion guidelines. The proposed approach is allow-
results are lower. As described in the paragraphs
                                                      ing to update the database in a timely manner with
above, we think that the agreement obtained is due
                                                      expert-validated information. Importantly, our cu-
to the complexity of the annotation task. In addi-
                                                      ration protocol included the identification of neg-
tion, the large number of experts (for instance, 22
                                                      ative findings from the literature. Note that 24.7%
in our case vs 5 in the case of the EU-ADR cor-
                                                      of the GDCAs were classified as No association,
pus) and also the large size of our corpus (4,065
                                                      indicating the importance of properly annotating
publications vs approx. 100 in EU-ADR and CTD
                                                      this information in a knowledge resource. This in-
corpora) could also explain the lower agreement
                                                      formation will be taken into account for the rank-
obtained compared to other curation initiatives.
                                                      ing of the gene-disease association in the next re-
                                                      lease of PsyGeNET. In addition, the corpus of an-
                                                      notated sentences and abstracts developed during
                                                      the curation constitutes a valuable resource for the
                                                      development and evaluation of relation extraction
                                                      systems. In this era of biomedical big data, we
                                                      present this approach involving the expert com-
                                                      munity for the curation of the information as a
                                                      suitable approach for the development and main-
                                                      tenance of knowledge resources.

                                                      5   Fundings
                                                      We received support from ISCIII-FEDER
Figure 5: Summary of the disagreement results at      (PI13/00082, CP10/00524), IMI-JU under grants
the abstract level. Each cell in the heatmap rep-     agreements n 115002 (eTOX), n 115191 (Open
resents the number of abstracts in which disagree-    PHACTS)], n 115372 (EMIF) and n 115735
ment was found for each pair of experts. The dark-    (iPiE), resources of which are composed of
est the blue, the higher is the disagreement. For     nancial contribution from the EU-FP7 (FP7/2007-
example, there were 100 abstracts that one expert     2013) and EFPIA companies in kind contribution,
annotated as Association while the paired expert      and the EU H2020 Programme 2014-2020 under
annotated as No association.                          grant agreements no. 634143 (MedBioinformat-
                                                      ics) and no. 676559 (Elixir-Excelerate). The
  The Curation Phase II is aimed at reviewing         Research Programme on Biomedical Informatics
the associations in which no agreement was found      (GRIB) is a node of the Spanish National Institute
 of Bioinformatics (INB).                                       Laura I Furlong. 2015. Disgenet: a discovery
                                                                platform for the dynamical exploration of human
                                                                diseases and their genes. Database, 2015:bav028.
 References                                                 [Sullivan et al.2012] Patrick F Sullivan, Mark J Daly,
[Baldacchino et al.2009] A Baldacchino, N Groussard-            and Michael O’Donovan. 2012. Genetic archi-
    Escaffre, C Clancy, C Lack, K Sieroslavrska, C-L            tectures of psychiatric disorders: the emerging pic-
    Hodges, L-B Merinder, T Greacen, M Sorsa, H Lai-            ture and its implications. Nature Reviews Genetics,
    jarvi, et al. 2009. Epidemiological issues in comor-        13(8):537–551.
    bidity: lessons learnt from a pan-european isadora
    project. Mental Health and Substance Use: Dual          [Van Mulligen et al.2012] Erik M Van Mulligen, Annie
    Diagnosis, 2(2):88–100.                                    Fourrier-Reglat, David Gurwitz, Mariam Molokhia,
                                                               Ainhoa Nieto, Gianluca Trifiro, Jan A Kors, and
[Bravo et al.2015] Àlex Bravo, Janet Piñero, Núria          Laura I Furlong. 2012. The eu-adr corpus: anno-
    Queralt-Rosinach, Michael Rautschka, and Laura I           tated drugs, diseases, targets, and their relationships.
    Furlong. 2015. Extraction of relations between             Journal of biomedical informatics, 45(5):879–884.
    genes and diseases from text and large-scale data
                                                            [Whiteford et al.2013] Harvey A Whiteford, Louisa
    analysis: implications for translational research.
                                                               Degenhardt, Jürgen Rehm, Amanda J Baxter, Alize J
    BMC bioinformatics, 16(1):1.
                                                               Ferrari, Holly E Erskine, Fiona J Charlson, Rosana E
[Chun et al.2006] Hong-Woo Chun, Yoshimasa Tsu-                Norman, Abraham D Flaxman, Nicole Johns, et al.
   ruoka, Jin-Dong Kim, Rie Shiba, Naoki Nagata,               2013. Global burden of disease attributable to men-
   Teruyoshi Hishiki, and Jun’ichi Tsujii. 2006. Au-           tal and substance use disorders: findings from the
   tomatic recognition of topic-classified relations be-       global burden of disease study 2010. The Lancet,
   tween prostate cancer and genes using medline ab-           382(9904):1575–1586.
   stracts. BMC bioinformatics, 7(3):1.
                                                            [Wiegers et al.2009] Thomas C Wiegers, Allan P Davis,
[Craven et al.1999] Mark Craven, Johan Kumlien, et al.         K Bretonnel Cohen, Lynette Hirschman, and Car-
    1999. Constructing biological knowledge bases by           olyn J Mattingly. 2009. Text mining and manual
    extracting information from text sources. In ISMB,         curation of chemical-gene-disease networks for the
    volume 1999, pages 77–86.                                  comparative toxicogenomics database (ctd). BMC
                                                               bioinformatics, 10(1):326.
[Gutiérrez-Sacristán et al.2015] Alba       Gutiérrez-
   Sacristán, Solène Grosdidier, Olga Valverde, Marta
   Torrens, Àlex Bravo, Janet Piñero, Ferran Sanz, and
   Laura I Furlong. 2015. Psygenet: a knowledge
   platform on psychiatric disorders and their genes.
   Bioinformatics, page btv301.

[Kessler et al.2005] Ronald C Kessler,        Patricia
   Berglund, Olga Demler, Robert Jin, Kathleen R
   Merikangas, and Ellen E Walters. 2005. Life-
   time prevalence and age-of-onset distributions
   of dsm-iv disorders in the national comorbidity
   survey replication. Archives of general psychiatry,
   62(6):593–602.

[Lee et al.2013] Hee-Jin Lee, Sang-Hyung Shim, Mi-
   Ryoung Song, Hyunju Lee, and Jong C Park. 2013.
   Comagc: a corpus with multi-faceted annotations
   of gene-cancer relations. BMC bioinformatics,
   14(1):1.

[Murray and Lopez2013] Christopher JL Murray and
   Alan D Lopez. 2013. Measuring the global bur-
   den of disease. New England Journal of Medicine,
   369(5):448–457.

[Neves2014] Mariana Neves. 2014. An analysis
   on the entity annotations in biological corpora.
   F1000Research, 3.

[Piñero et al.2015] Janet Piñero, Núria Queralt-
     Rosinach, Àlex Bravo, Jordi Deu-Pons, Anna
     Bauer-Mehren, Martin Baron, Ferran Sanz, and