=Paper=
{{Paper
|id=Vol-1650/smbm16GutierrezSacristan
|storemode=property
|title=Text Mining and Expert Curation to Develop a Database on Psychiatric Diseases and their Genes
|pdfUrl=https://ceur-ws.org/Vol-1650/smbm16GutierrezSacristan.pdf
|volume=Vol-1650
|authors=Alba Gutiérrez Sacristán,Álex Bravo,Marta Portero,Olga Valverde,Antonio Armario,M.C. Blanco-Gandía,Adriana Farré,Lierni Fernández-Ibarrondo,Francina Fonseca,Jesús Giraldo,Angela Leis,Anna Mané,M.A. Mayer,Sandra Montagud-Romero,Roser Nadal,Jordi Ortiz,Francisco Javier Pavon,Ezequiel Perez,Marta Rodríguez-Arias,Antonia Serrano,Marta Torrens,Vincent Warnault,Ferran Sanz,Laura I. Furlong
|dblpUrl=https://dblp.org/rec/conf/smbm/Gutierrez-Sacristan16
}}
==Text Mining and Expert Curation to Develop a Database on Psychiatric Diseases and their Genes==
Text mining and expert curation to develop a database on psychiatric
diseases and their genes
Alba Gutiérrez-Sacristán Àlex Bravo Marta Portero
GRIB GRIB GReNeC
IMIM - UPF IMIM - UPF IMIM - UPF
Olga Valverde Antonio Armario M.Carmen Blanco-Gandı́a
GReNeC Universitat Autònoma Universitat de Valencia
IMIM - UPF de Barcelona
Adriana Farré Lierni Fernández-Ibarrondo Francina Fonseca
Parc de Salut Mar Program in Cancer Research Parc de Salut Mar
UAB IMIM UAB
Jesús Giraldo Angela Leis Anna Mané
Universitat Autònoma GRIB Parc de Salut Mar
de Barcelona IMIM - UPF UAB
Miguel A. Mayer Sandra Montagud-Romero Roser Nadal
GRIB Universitat de Valencia Institut de Neurociències
IMIM - UPF UAB
Jordi Ortiz Francisco Javier Pavón Ezequiel Perez
School of Medicine Instituto de Investigación Parc de Salut Mar
UAB Biomédica de Málaga UAB
Marta Rodrı́guez-Arias Antonia Serrano Marta Torrens
Universitat de Valencia Instituto de Investigación Parc de Salut Mar
Biomédica de Málaga UAB
Vincent Warnault Ferran Sanz Laura I. Furlong
GReNeC GRIB GRIB
IMIM - UPF IMIM - UPF IMIM - UPF
Abstract information in a structured manner, pro-
viding a set of analysis and visualiza-
During the last years there has been a tion tools. In this communication we de-
growing research in the genetics of psy- scribe the protocol we put in place for the
chiatric diseases. However, there is sustainable update of this knowledge re-
still a limited understanding of the cel- source. It includes the recruitment of a
lular and molecular mechanisms leading team of experts to perform the curation of
to these diseases, which has hampered the data previously extracted by text min-
the application of this wealth of knowl- ing. Annotation guidelines and a web-
edge into the clinical practice to im- based annotation tool were developed to
prove diagnosis and treatment of psy- support curators tasks. A curation work-
chiatric patients. PsyGeNET (http:// flow was designed including a pilot phase,
www.psygenet.org/) has been devel- and two rounds of curation and analy-
oped to improve the understanding of psy- sis phases. We report the results of the
chiatric diseases, by facilitating the ac- application of this workflow to the task
cess to the vast amount of their genetic of curation of gene-disease associations
for PsyGeNET, including the analysis of
the inter-annotator agreement, and suggest
that this model is a suitable approach for
the sustainable development and update of
knowledge resources.
1 Introduction
Psychiatric disorders have a great impact on mor-
bidity and mortality (Murray and Lopez, 2013;
Whiteford et al., 2013). According to the World
Health Organization (WHO), one of every four
people will suffer mental or neurological disorders
(Kessler et al., 2005; Baldacchino et al., 2009). It
has been suggested that most psychiatric disorders
display a strong genetic component (Sullivan et
Figure 1: PsyGeNET curation workflow
al., 2012). During the last years there has been
a growing research in the genetics of psychiatric
disorders, and its findings have been reported on 2 Methods
hundreds of thousands of publications. This liter-
ature constitutes a rich and diverse source of infor- 2.1 Curation team
mation essential for any psychiatric research line. A team of 22 experts from different domains (such
However, the huge amount and continuous growth as psychiatry, neuroscience, medicine, psychology
of the number of publications refrain scientists to and biology) was recruited from the Spanish Net-
efficiently explore such large volume of data. work of Addiction and other collaborators of the
PsyGeNET (Psychiatric disorders Gene associ- coordination team (Research Group on Integrative
ation NETwork) (Gutiérrez-Sacristán et al., 2015) Biomedical Informatics (GRIB)) to participate in
has been developed to establish a curated resource the curation process. The incentives for participa-
on psychiatric diseases and their associated genes. tion were to be part of the PsyGeNET team and
PsyGeNET integrates knowledge extracted from to be co-authors in the publication(s) originated
the scientific literature by text-mining which has from the project. The curators were trained dur-
been curated by experts in psychiatry and neuro- ing an initial session where the PsyGeNET anno-
sciences. tation guidelines were presented and then during
In this communication we describe the process the Pilot phase. Communication with the coor-
put in place for the update of the PsyGeNET dination team through e-mail was established to
database. This involved i) the recruitment of a resolve questions during the curation process. In
team of experts to curate the information extracted addition, on-line and f2f meeting were organized
by text-mining; ii) the extraction of information of after key points of the curation process (analysis
gene-disease associations (GDAs) from the litera- phases) to share experiences among all curators
ture using the text mining system BeFree (Bravo and solve curation issues.
et al., 2015), iii) the development of a curation
workflow (Figure 1), iv) the development of a 2.2 Defining the Psychiatric Diseases in terms
web-based annotation tool in order to facilitate of UMLS concepts
the curation task and v) the definition of detailed In PsyGeNET, the psychiatric diseases are identi-
guidelines to assist the curation task. fied by UMLS Metathesaurus concepts. Three ex-
In particular, we present the results of the Pilot perts reviewed the terminology included in more
phase and Curation I phase of the workflow, in- than 2,000 UMLS concepts related to the psychi-
cluding the analysis of the inter-annotator agree- atric disorders of interest, and assigned them to the
ment, and suggest that this protocol is a suitable following psychiatric disease categories (DCs): 1)
approach for the sustainable development and up- Depressive disorders, 2) Bipolar disorders and re-
date of knowledge resources. lated disorders, 3) Substance/drug induced depres-
sive disorder, 4) Schizophrenia spectrum and other the association types. Finally, it also included a
psychotic disorders, 5) Drug-induced psychosis, tutorial on how to use the PsyGeNET annotation
6) Alcohol use disorders, 7) Cannabis use disor- tool. The goal of the curation was to validate the
ders and 8) Cocaine use disorders. This informa- association of a gene to a particular disease. We
tion was used both for text mining of gene-disease consider that a gene is associated to a disease if
associations by BeFree (see below) and for identi- the gene or the product of the gene plays a role
fication of disease classes during the curation. in the disease pathogenesis, or is a marker for the
disease. The PsyGeNET annotation tool was used
2.3 Text mining of gene-disease associations to help in this curation task. For each gene-disease
BeFree, a text-mining tool that exploits mor- association identified by text mining, the annota-
phosyntactic information from the text to identify tion tool displayed the evidence that supports the
relationships between biomedical concepts, was association, more concretely the abstracts and the
used to identify associations between genes and sentences in which the gene-disease association
the psychiatric diseases of interest from a corpus is stated. Then, by inspecting the evidence, the
of ∼1M of MEDLINE abstracts focused on hu- curator had to determine the type of association
man genetic diseases. The diseases were identified (Association, No Association, False, Error and
using the UMLS concepts that define each disor- Not Clear). The types of association are described
der, whereas an in-house developed gene dictio- as follows: i) Association: the publication clearly
nary was used to identify the genes, as described states that there is an association between the gene
in (Bravo et al., 2015). The identified disorders and the disease - it can be a causative association
were grouped according to the eight psychiatric (e.g. a mutation in the gene causes the disease),
disease categories (described in section 2.2). As or a marker association (e.g. a SNP in the gene
a result, BeFree identified 6,349 associations be- identified in a GWAS study); ii) No Association:
tween genes and DCs (gene-disease category as- the publications clearly states that there is no
sociation or GDCA) supported by 4,065 publica- association between the gene and the disease (e.g.
tions. A subset of the associations was initially a publication that reports a negative finding on
evaluated by our group to identify the most fre- the association between the gene and the disease),
quent text mining errors. For instance, the word iii) False Association: The gene and the disease
depression is often used in other context in addi- are found co-occurring in a sentence, but there is
tion to psychiatry. This initial evaluation was per- no clear evidence from the publication that the
formed to identify this kind of errors and improve gene plays a role or is a marker of the disease and
the text mining system before the identification of iv) Error: when there is a text mining error in
GDCAs. We then applied a number of filters to re- the correct identification of the gene and/or the
duce the size of the curation task and make it feasi- disease.
ble with the resources at hand. For instance, we re-
moved associations already present in curated re- Table 1 shows some examples of the associ-
sources (DisGeNET (Piñero et al., 2015) and the ation types considered in PsyGeNET. In the ex-
previous release of PsyGeNET ), kept only those ample for False Association, the study is on chil-
associations published recently (after year 2,000) dren that do not meet the criteria for the disease
in journals with Impact Factor greater than 1, and (FASD) therefore the association between the gene
we did not take into account reviews. After this and the disease has to be classified as false. In
process we obtained 2,507 GDCAs, which were the example of Error, note that in this abstract
submitted to expert curation. OCT is not a gene but an acronym of optical co-
herence tomography (OCT). The document de-
2.4 Annotation Guidelines
scribing the guidelines is available on the Psy-
The PsyGeNET annotation guidelines were GeNET web page (http://www.psygenet.
developed with the purpose of guiding the manual org/Psytool_manual_v5.0.pdf). Here
curation process. The guidelines included the we provide the general instructions for the cu-
definition of a gene-disease association, how it ration of the gene-disease associations in Psy-
should be classified according to the level of GeNET:
evidence, what information should be considered
for the annotation and provided real examples of 1. The curation has to be performed at abstract
Association PMID Sentence and HGNC), and highlights the sentences in which
Type
Association 267012 The D-amino acid oxidase ac- BeFree identified an association between the gene
tivator gene (G72) has been and the disease under consideration. If required,
found associated with several the curator can access the full text article using the
psychiatric disorders such as
schizophrenia, major depres- PubMed hyperlink. The curator is also asked to se-
sion, and bipolar disorder. lect a sentence that best represents their validation
No Association 17692928 There was no association be- decision, if available. This was implemented in
tween TPH-2 gene variants and
MD in the same population that order to collect example sentences to improve the
had shown a strong association performance of the BeFree system. In addition,
with TPH-1. the tool also provides a progress bar indicating the
False 25225167 Two children referred for sus-
picion of FASD (neither of number of validations and associations performed
which were exposed to alcohol by the expert, and allows to review previous anno-
or met the criteria for FASD)
had a pathogenic microstruc-
tations. We refer to a validation to each publica-
tural chromosomal rearrange- tion supporting a particular GDCA. Note that each
ment (del16p11.2 of 542 KB publication can have more than one GDCA.
and dup1q44 of 915 KB).
Error 21174530 OCT demonstrated loss of
foveal depression with distor-
tion of the foveal architecture in
the macula in all patients
Table 1: Examples of Association types. Disease
and genes that have to be evaluated are highlighted
in the sentence in green and orange, respectively.
level. For those cases in which abstract is not
clear enough, the full text article should be
reviewed.
2. Annotate only relationships between the gene
and disease. Other types of relationships
should not be annotated.
3. Annotate relationships according to the pro-
vided categories: association, no association,
error, and false.
2.5 Annotation tool
A user-friendly web-based tool was developed to
assist both the definition of the psychiatric disor-
ders of interest and curation of gene-disease as-
sociations. The tool was designed to support a
multi-user environment by user and password as-
signment. Figure 2 shows a screenshot of the tool
for the curation of GDCAs. The tool shows the
GDCA to be evaluated (in this example the as-
sociation between the ETNPPL gene and Bipolar
disorders class), and a publication at a time. The
curator has to review the publication and decide if Figure 2: Annotation web-based tool
the association of the gene and the disease class
holds, and decide on the association type using
the drop-down menu. To aid the curators task, the 2.6 Curation workflow
tool displays the terminology for the gene accord- We put in place a curation workflow including a
ing to standard resources (NCBI Gene, UniProt pilot phase and two curation and analysis phases
(see Figure 1). During the pilot phase, the initial
training of the curators was carried out including
how to use the curation tool. A set of 100 abstracts
was validated and analyzed during the pilot phase.
After this process both the curation tool and the
annotation guidelines were improved and the first
curation phase was launched (Curation Phase I),
to evaluate 2,507 GDCAs identified by text min-
ing and supported by 4,065 publications. The re-
sults of the curation were analyzed to estimate the
inter-annotator agreement at the level of abstract.
The validations for which an agreement was not
found in Curation Phase I are then reviewed by a
third expert during Curation Phase II (results not
reported here). Four experts are participating in
this phase. Only the validations for which agree-
ment of at least 2 experts is found will be included
in the database. Figure 3: Psychiatric disease categories and the
number of associated genes.
3 Results and discussion
Firstly, three experts reviewed the terminology of tion tool, and reviewing the PsyGeNET annota-
2,523 UMLS concepts related to psychiatric dis- tion guidelines. One hundred publications were
orders of interest. As a result, 1,942 UMLS con- reviewed during the Pilot phase, distributed in 10
cepts were assigned to one of the 8 disease cat- publications per 2 experts. The average agree-
egories, being alcohol use disorder, depression ment between the experts pairs in the Pilot Phase
and schizophrenia defined by more than 300 con- was 60%. The main sources of discrepancies were
cepts (321, 368 and 488, respectively). On the the handling of speculations, the proper identifica-
other hand, 581 UMLS concepts were excluded tion of text mining errors, in particular for genes,
at this stage. Then, BeFree was used to iden- and the distinction between False and Error As-
tify gene-disease associations from the literature sociation types. The annotation tool was modi-
based on the above disease definition and a sub- fied to show the terminology of the genes in order
set of the associations focused on the disorders to help the curators to find potential errors in the
of interest was selected (see methods section 2.3). identification of genes, and by improving the Re-
The 2,507 genes associated to DCs identified by view function. Then, the proper curation (Curation
BeFree were submitted to expert curation. These Phase I in the workflow in Figure 1) was launched
genes were unevenly distributed across the disease and it was completed in 33 days. During Curation
categories, being schizophrenia the disease cate- Phase I, 2,507 GDCAs supported by 4,065 publi-
gory with more associations followed by depres- cations were reviewed by the curators. Each expert
sion and alcohol use disorders (see Figure 3). was assigned with a set of approx. 275 GDCAs
Of note, most of the GDCAs were supported by (corresponding to 450 publications) according to
only one publication (70.6 %). We included up to their field of expertise (e.g. Major depression vs
the 5 most recent publications for each GDCA for Schizophrenia). Some curators evaluated associa-
the validation process. This led to 242-284 GD- tions from all the disease categories, while others
CAs to be validated by each curator, depending focused in a single category. The results of the
on the disease category. Since most of GDCAs curation phase I were analyzed to identify agree-
are supported by only one publication, the num- ments and disagreements between the experts. Ta-
ber of publications to be reviewed by the cura- ble 2 shows the number of abstracts validated by
tors ranged between 322 and 491. Before start- each curator team (composed of two experts) and
ing the curation of the 2,507 GDCAs, a Pilot cura- the agreement achieved. The average agreement
tion phase was designed with the purpose of train- between all the experts was 68.95%, higher that
ing the curators, testing the PsyGeNET annota- the one obtained in the Pilot Phase. For one cura-
tor team the agreement was higher (89%) than for
the rest of the teams. We can attribute this higher
agreement to the fact that there was some commu-
nication between the two experts to discuss on the
curation criteria during the Curation Phase I.
Teams Validations Agreem. Disagr. % Agreem.
Team 1 494 325 169 65.79
Team 2 319 194 125 60.89
Team 3 489 342 147 69.94
Team 4 450 402 48 89.33
Team 5 492 308 184 62.60
Team 6 508 341 167 67.12
Team 7 463 317 146 68.46
Team 8 516 363 153 70.35
Team 9 334 221 113 66.17
Table 2: Agreement for each expert pair.
Figure 4: Summary of the agreement results.
From the validations in which agreement was
Each bar in the barplot represents the number of
found (2,813 validations), 1,880 were classified as
validations annotated as: Association, No associa-
Association or No Association; 901 were classi-
tion, False, Error and Not clear, respectively.
fied as False or Error, and only in 32 of them, the
evidence extracted from the publication was not
enough to classify them within any of the previous
categories, falling in the not clear category (Fig- analysis in some studies (e.g. GWAS studies).
ure 4). The set of 1,880 validations will be part In the first case, the decision on the association
of the next release of PsyGeNET. Notably, an im- type will depend on the expertise of the curator
portant fraction of these associations (24.7%) are on animal model research in psychiatry, that was
classified as No association, meaning that there is not the same among the team of experts. In the
evidence reporting negative findings on the associ- other three cases the experts expressed difficulties
ation between the gene and the disease. This high- in correctly identifying if an association has to be
lights the importance of recording negative find- annotated or not. Overall, although the curation
ings from the literature in knowledge resources. task was very focused to the domain of genetics
On the other hand, collecting these information is of psychiatric diseases, the wide variety of studies
relevant for the development of corpora for train- covered by the publications (GWAs studies,
ing text mining systems able to identify negative sequencing studies, animal models, etc) require
findings regarding gene-disease associations from an equivalent diversity of expertise among the
the literature. experts. We think that this complexity in the task
We observe that for 30% of the total GDCAs is one of the main reasons for the inter-annotator
validated, agreement between curators was not agreement achieved. Ongoing work includes
found. A substantial fraction of the disagreements revisiting the annotation guidelines to further
involved the annotation of an association as False clarify the curation issues raised, in order to
by one of the experts (53.28%, see Figure 5). improve the agreement in the annotations.
The results of Curation Phase I were discussed
with the experts in order to identify the main In recent years, many efforts have been made
difficulties during the annotation. The main to develop and contribute with novel corpora
sources of the discrepancies between curators in the biomedical domain. Nevertheless, the
were the following: i) difficulty in assessing if number of corpora annotated with information
the studies using animal models captures well the on gene-disease associations is particularly low
disease pathophysiology, ii) the studies focused (Neves, 2014). For example, the Craven cor-
on pharmacogenomics or response to drug treat- pus (Craven et al., 1999), contains annotations
ments, iii) studies assessing disease phenotypes of gene-disease associations, but there is no in-
(e.g. low mood) in otherwise normal populations, formation on data quality such as inter-annotator
and iv) the assessment of validity of the statistical agreement in the original publication. The EU-
ADR corpus (Van Mulligen et al., 2012) in- among two experts in the first phase of curation.
cludes associations between genes and diseases Currently, this involves 1,252 validations, which
from 100 MEDLINE abstracts, with an inter- are being reviewed by a third expert (ongoing
annotator agreement of 86%. Wiegers et al. work at the time of writing). Finally, the infor-
presented the manual curation of chemical-gene- mation that will be included in PsyGeNET are the
disease network for the Comparative Toxicoge- associations in which at least two experts agreed
nomics Database (CTD) (Wiegers et al., 2009). on the annotation.
For this study 112 articles were distributed be-
tween three curators (each one revised less than 4 Conclusions
60 articles), achieving an inter-annotator agree-
In this communication we report the development
ment of 77%. The CoMAGC corpus (Lee et al.,
of a protocol for the sustainable update of a knowl-
2013), focused on genes associated to prostate,
edge resource on the genetics of psychiatric dis-
breast and ovarian cancer, is based on 821 sen-
eases, PysGeNET. We combined state-of-the-art
tences. The authors report an agreement 72%. In
text-mining, data filtering and curation by a com-
another study, agreement over 70% was reported
munity of domain experts for the release of a new
in the development of a sentence-based corpus on
version of the database. We designed a proto-
prostate cancer-gene associations (Chun et al.,
col that includes curators’ training and the iter-
2006). In summary, compared to other corpora an-
ative improvement of both the tools and annota-
notation initiatives, our inter-annotator agreement
tion guidelines. The proposed approach is allow-
results are lower. As described in the paragraphs
ing to update the database in a timely manner with
above, we think that the agreement obtained is due
expert-validated information. Importantly, our cu-
to the complexity of the annotation task. In addi-
ration protocol included the identification of neg-
tion, the large number of experts (for instance, 22
ative findings from the literature. Note that 24.7%
in our case vs 5 in the case of the EU-ADR cor-
of the GDCAs were classified as No association,
pus) and also the large size of our corpus (4,065
indicating the importance of properly annotating
publications vs approx. 100 in EU-ADR and CTD
this information in a knowledge resource. This in-
corpora) could also explain the lower agreement
formation will be taken into account for the rank-
obtained compared to other curation initiatives.
ing of the gene-disease association in the next re-
lease of PsyGeNET. In addition, the corpus of an-
notated sentences and abstracts developed during
the curation constitutes a valuable resource for the
development and evaluation of relation extraction
systems. In this era of biomedical big data, we
present this approach involving the expert com-
munity for the curation of the information as a
suitable approach for the development and main-
tenance of knowledge resources.
5 Fundings
We received support from ISCIII-FEDER
Figure 5: Summary of the disagreement results at (PI13/00082, CP10/00524), IMI-JU under grants
the abstract level. Each cell in the heatmap rep- agreements n 115002 (eTOX), n 115191 (Open
resents the number of abstracts in which disagree- PHACTS)], n 115372 (EMIF) and n 115735
ment was found for each pair of experts. The dark- (iPiE), resources of which are composed of
est the blue, the higher is the disagreement. For nancial contribution from the EU-FP7 (FP7/2007-
example, there were 100 abstracts that one expert 2013) and EFPIA companies in kind contribution,
annotated as Association while the paired expert and the EU H2020 Programme 2014-2020 under
annotated as No association. grant agreements no. 634143 (MedBioinformat-
ics) and no. 676559 (Elixir-Excelerate). The
The Curation Phase II is aimed at reviewing Research Programme on Biomedical Informatics
the associations in which no agreement was found (GRIB) is a node of the Spanish National Institute
of Bioinformatics (INB). Laura I Furlong. 2015. Disgenet: a discovery
platform for the dynamical exploration of human
diseases and their genes. Database, 2015:bav028.
References [Sullivan et al.2012] Patrick F Sullivan, Mark J Daly,
[Baldacchino et al.2009] A Baldacchino, N Groussard- and Michael O’Donovan. 2012. Genetic archi-
Escaffre, C Clancy, C Lack, K Sieroslavrska, C-L tectures of psychiatric disorders: the emerging pic-
Hodges, L-B Merinder, T Greacen, M Sorsa, H Lai- ture and its implications. Nature Reviews Genetics,
jarvi, et al. 2009. Epidemiological issues in comor- 13(8):537–551.
bidity: lessons learnt from a pan-european isadora
project. Mental Health and Substance Use: Dual [Van Mulligen et al.2012] Erik M Van Mulligen, Annie
Diagnosis, 2(2):88–100. Fourrier-Reglat, David Gurwitz, Mariam Molokhia,
Ainhoa Nieto, Gianluca Trifiro, Jan A Kors, and
[Bravo et al.2015] Àlex Bravo, Janet Piñero, Núria Laura I Furlong. 2012. The eu-adr corpus: anno-
Queralt-Rosinach, Michael Rautschka, and Laura I tated drugs, diseases, targets, and their relationships.
Furlong. 2015. Extraction of relations between Journal of biomedical informatics, 45(5):879–884.
genes and diseases from text and large-scale data
[Whiteford et al.2013] Harvey A Whiteford, Louisa
analysis: implications for translational research.
Degenhardt, Jürgen Rehm, Amanda J Baxter, Alize J
BMC bioinformatics, 16(1):1.
Ferrari, Holly E Erskine, Fiona J Charlson, Rosana E
[Chun et al.2006] Hong-Woo Chun, Yoshimasa Tsu- Norman, Abraham D Flaxman, Nicole Johns, et al.
ruoka, Jin-Dong Kim, Rie Shiba, Naoki Nagata, 2013. Global burden of disease attributable to men-
Teruyoshi Hishiki, and Jun’ichi Tsujii. 2006. Au- tal and substance use disorders: findings from the
tomatic recognition of topic-classified relations be- global burden of disease study 2010. The Lancet,
tween prostate cancer and genes using medline ab- 382(9904):1575–1586.
stracts. BMC bioinformatics, 7(3):1.
[Wiegers et al.2009] Thomas C Wiegers, Allan P Davis,
[Craven et al.1999] Mark Craven, Johan Kumlien, et al. K Bretonnel Cohen, Lynette Hirschman, and Car-
1999. Constructing biological knowledge bases by olyn J Mattingly. 2009. Text mining and manual
extracting information from text sources. In ISMB, curation of chemical-gene-disease networks for the
volume 1999, pages 77–86. comparative toxicogenomics database (ctd). BMC
bioinformatics, 10(1):326.
[Gutiérrez-Sacristán et al.2015] Alba Gutiérrez-
Sacristán, Solène Grosdidier, Olga Valverde, Marta
Torrens, Àlex Bravo, Janet Piñero, Ferran Sanz, and
Laura I Furlong. 2015. Psygenet: a knowledge
platform on psychiatric disorders and their genes.
Bioinformatics, page btv301.
[Kessler et al.2005] Ronald C Kessler, Patricia
Berglund, Olga Demler, Robert Jin, Kathleen R
Merikangas, and Ellen E Walters. 2005. Life-
time prevalence and age-of-onset distributions
of dsm-iv disorders in the national comorbidity
survey replication. Archives of general psychiatry,
62(6):593–602.
[Lee et al.2013] Hee-Jin Lee, Sang-Hyung Shim, Mi-
Ryoung Song, Hyunju Lee, and Jong C Park. 2013.
Comagc: a corpus with multi-faceted annotations
of gene-cancer relations. BMC bioinformatics,
14(1):1.
[Murray and Lopez2013] Christopher JL Murray and
Alan D Lopez. 2013. Measuring the global bur-
den of disease. New England Journal of Medicine,
369(5):448–457.
[Neves2014] Mariana Neves. 2014. An analysis
on the entity annotations in biological corpora.
F1000Research, 3.
[Piñero et al.2015] Janet Piñero, Núria Queralt-
Rosinach, Àlex Bravo, Jordi Deu-Pons, Anna
Bauer-Mehren, Martin Baron, Ferran Sanz, and