=Paper=
{{Paper
|id=Vol-230/paper-7
|storemode=property
|title=Integration of Hybrid Bio-Ontologies using Bayesian Networks for Knowledge Discovery
|pdfUrl=https://ceur-ws.org/Vol-230/07-mcgarry.pdf
|volume=Vol-230
|dblpUrl=https://dblp.org/rec/conf/ijcai/McGarryGMW07
}}
==Integration of Hybrid Bio-Ontologies using Bayesian Networks for Knowledge Discovery==
Integration of Hybrid Bio-Ontologies using Bayesian Networks for Knowledge
Discovery
Ken McGarry∗ , Sheila Garfield∗ , Nick Morris† and Stefan Wermter∗
∗
Department of Computing and Technology, University of Sunderland, UK
†
Institute for Cell and Molecular Biosciences, University of Newcastle, UK
{ken.mcgarry,sheila.garfield,stefan.wermter}@sunderland.ac.uk, n.j.morris@ncl.ac.uk
Abstract Unless the researchers are familiar with the journals where
the new results are published, they would be unlikely to en-
This paper describes how high level biological counter this information. Given, the fragmented and highly
knowledge obtained from ontologies such as the specialized nature of biological research, this may seldom
Gene Ontology (GO) can be integrated with low occur. Therefore the need for automated extraction of knowl-
level information extracted from a Bayesian net- edge from the literature is well motivated. However, recent
work trained on protein interaction data. We can advances in text analytics combines techniques from infor-
automatically generate a biological ontology by mation retrieval (IR) and information extraction (IE) which
text mining the type II diabetes research literature. allows researchers to explore the relevant literature more ef-
The ontology is populated with the entities and fectively [Mack and Henenberger, 2002]. However, these
relationships from protein-to-protein interactions. techniques require knowledge discovery methods to uncover
New, previously unrelated information is extracted complex embedded structures, relationships and connections
from the growing body of research literature and in- between seemingly unrelated facts that typically exist in the
corporated with knowledge already known on this biomedical literature [Tiffin et al., 2005].
subject from the gene ontology and databases such
as BIND and BioGRID. We integrate the ontology
within the probabilistic framework of Bayesian net- keywords relating to
PUBMED
biomedical
works which enables reasoning and prediction of insulin resistance
text
protein function. abstracts +
main text
Validate with Preprocessing, and
information extraction of
1 Introduction existing knowledge of
pathways and
interactions
text data
The large amounts of genomic and proteomic data that are keywords
generated by biological experiments is now enabling deeper Organising keyword data
biomedical
insights into cellular and molecular function. New technolo- ontologies, (entities and relations) into
hierarchial structure
knowledge bases
gies such as microarrays and electrophoresis gels are pro- (GO,BIND &
KEGG)
viding vast quantities of experimental data at unprecedented onto-structure
rates. All of this information needs to be stored and carefully Relative frequencies and
probabilities calculations
annotated. With each new experiment providing details of Gaps in knowledge
defined and experimental
for Bayesian Network CPT's
(conditional probability
new protein-to-protein interactions, new biological pathways procedures to follow tables)
and new genes it is essential that these discoveries are made
CPT's
available to the scientific community. To this end, online sci- inference with BN on
Transfer of knowledge into
entific databases are now in place that disseminate these re- proteins/genes without
annotations
Bayesian Network format
sults. These databases such as the popular Gene Ontology
(GO) are updated at intervals to reflect the latest develop-
ments [Ashburner, 2000]. Figure 1: Overview of methodology and information extrac-
The updating is done by experts who manually revise each tion process
entry by reading the research literature and annotating the
database collections accordingly. If necessary, they will con- Our particular research area is that of diabetes, in partic-
tact the experimenters to resolve any ambiguities or problems. ular the effects of insulin resistance on protein expression
In terms of data quality, the databases are quite reliable and and insulin regulated protein trafficking in fat cells. In re-
robust. Unfortunately, hand annotation is a slow process and cent years there has been a dramatic worldwide increase of
the databases are lagging behind the experimental work by a those suffering with diabetes. In the year 2000, there were
considerable margin. This prevents researchers from imme- 171 million cases and by 2030 the World Health Organization
diately accessing the most recent discoveries. (WHO) has predicted there will be 366 million people suffer-
ing from this condition (www.who.int/diabetes/f acts/). The algorithm encodes through regular expressions tem-
The WHO data is for diagnosed cases but the undiagnosed plates for recognizing the types of “action” words that typ-
cases are estimated by the WHO at 14.6 million alone for the ically occur in biological texts. We discuss this process in
US. more detail in section 4. However, the main problem that our
In this paper we present our results of how we automati- algorithm considers is to discover in advance the kind of in-
cally generate a viable ontology based on information extrac- formation that can be encountered. Rather than attempt to
tion of keywords from the research literature. The keywords parse the entire corpus we exploit certain linguistic regulari-
define the entities and relationships of important genes, gene ties and search for specific semantic relations that need only
relationships, protein-to-protein interactions operate and co- be defined once. The algorithm takes into account a vari-
exist in biological processes related to insulin resistance. Fur- able distance between related terms i.e. longer passages of
thermore, the ontology is cast within a probabilistic frame- text, and therefore provides a much more reliable identifica-
work using Bayesian networks which are used for the in- tion of the relationships. Seeking up two words difference has
ferencing and prediction of protein function. Figure 1 gives empirically shown to be a reasonable trade-off of accuracy
the overall methodology for the extraction of information and versus computational complexity. Examples of relationships
construction of the ontology. include:
The remainder of this paper is structured as follows; sec- • A inhibits B
tion two outlines our information extraction scheme for iden-
tifying the entities and relationships of interest, section three • A activates B
provides an overview of biological ontologies and gives de- • A interacts with B
tails of how we use Bayesian networks for inference and rea- • A suppresses B
soning. Section four discusses our methodology and experi-
mental results, section five reviews the related work and our 3 Biological Ontologies and Bayesian
claim for novelty and finally section six presents the conclu-
sions. Networks
In this section we briefly motivate the need for ontologies and
define their limitations with respect to the biological field and
2 Information Extraction for knowledge discovery. Ontologies describe the concepts
Unstructured text is a very flexible and powerful means of and relationships that exist for a particular area of interest.
communication, it allows us to describe quite complex con- They are very useful for the semantic labeling of concepts
cepts. The semantic meaning of a sentence can be expressed or definitions [Grivell, 2002; Bard and Rhee, 2004]. This
in many different ways but it is this flexibility which is the process ensures that entities which are equivalent to other en-
cause of difficulty for algorithmic sentence analysis by com- tities in separate databases are identified as referring to the
puters. One technique of overcoming this problem is to use same concepts. Even if these entities have different names or
information extraction (IE) to seek out the important entities forms they can still be identified by semantic labeling. The
in the text and the relationships between them [Hearst, 1992; role of semantics therefore is much deeper than matching the
Rosario and Hearst, 2004]. The IE process can involve encod- co-occurrence of a tag or label, since it defines the relation-
ing patterns by hand such as regular expressions to search for ship that exists between concepts. Figure 3 shows the struc-
the required entities and relations or to use semi-automated ture and elements of the gene ontology that are pertinent to
machine learning techniques [Nahm and Mooney, 2002; our study. The first entry refers to GO:0008150 and is one of
Krauthammer and Nenadic, 2004]. The algorithm we devel- the three top level structures (biological process, physiologi-
oped is shown in figure 2. cal process and cellular process) in the gene ontology hierar-
chy; the last number (GO:0015758) defines the relationships
for the glucose transport pathway. The numbers in brackets
Inputs: Abstract file A, String str refer to the number of entries at that particular level.
Outputs: Keyword file B The use of ontologies in biology for the semantic integra-
tion of heterogeneous data is receiving increased attention,
Load file A however problems occur because of the dynamic, changing
While unprocessed “abstracts” in A
Remove end of line characters
nature of biological knowledge [McGarry et al., 2006]. These
Read each line into str difficulties arise from the highly complex structures that are
Search string for concept term expensive and problematic to update and maintain [Blaschke
If contains phrase (the | a | an) + 2words(and |) + 2words and Valencia, 2002]. Another, related problem is that current
write word preceding key phrase and string after key phrase to B ontologies have a rather limited vocabulary and cannot ex-
elseif str contains phrase (the | a | an) + 1word(and |) + 2words press the richness of biological information. Little attention
write word preceding key phrase and string after key phrase to B has been paid to defining the relations, much of the research
elseif str contains phrase (the | a | an) + 2words effort and complexity of structure has concentrated on defin-
write word preceding key phrase and string after key phrase to B ing the terms. Other considerations that are important are the
close A and B spatial and temporal characteristics of the entities.
Furthermore, ontologies such DAML+OIL, OWL and
Figure 2: Information extraction algorithm RDF are based on crisp logic and have difficulty managing
Accession: GO:0015758 the number of parent nodes, they are usually represented in
Ontology: biological process table format. The nodes are assumed to be discrete or cate-
Synonyms: None gorical values, however, continuous values may be discretised
Definition: movement of the hexose monosaccharide glucose into, [Korb and Nicholson, 2004].
out of, within or between cells.
GO:0008150 : biological process ( 127987 ) 1 Y
P (X1 , ..., Xn ) = πj [Cj ] (3)
GO:0009987 : cellular process ( 78769 ) Z j
GO:0050875 : cellular physiological process ( 71999 )
GO:0006810 : transport ( 21084 )
GO:0008643 : carbohydrate transport ( 498 ) Diagnostic reasoning Predictive reasoning
GO:0015749 : monosaccharide transport ( 206 ) Query Evidence
GO:0008645 : hexose transport ( 168 )
ACE GLUT4 ACE GLUT4
GO:0015758 : glucose transport ( 115 )
IR IR Query
Query
Figure 3: GO structure for Glut4 protein within glucose trans- ADRB3 GLUT1 ADRB3 GLUT1
port pathway Evidence Query Query
uncertainty; incomplete data and noisy information that is Figure 4: The advantages of Bayesian networks include a
encountered in many domains, especially the bioinformatic graphical representation of the structure i.e. the intercon-
field. Our research is concerned with Type 2 diabetes, in or- nection relationships between the variables of interest and
der to develop a suitable ontology it is necessary to identify they allow for causal discovery or causal interpretation. The
the relevant entities within the domain, their attributes and the example shows the relationships between the insulin resis-
relationships that exist between these entities. tance problem and how it is affected by the proteins ACE and
3.1 Bayesian networks for Ontology Inference and GLUT4 and the effects of insulin resistance upon other pro-
Integration teins such as GLUT1 and ADRB3.
The integration of sub-symbolic and symbolic computation In figure 4, the various possibilities for inferencing are
has received considerable interest over the years [McGarry et shown within the insulin resistance domain. The first net-
al., 1999]. Within this framework the Bayesian approach can work shows the diagnostic reasoning approach which enables
be seen as both a learning mechanism and as a knowledge the relationships between symptoms and causes to be evalu-
representation technique. ated, thus when given some evidence regarding the presence
Bayes theorem is shown in equation 1 and presents the of Glut4 we can update our beliefs about the likelihood of
probability of the hypothesis (H) conditionalised on evidence IR being present. When using predictive reasoning we can
(E). derive new information about effects given some new infor-
mation regarding the causes.
P (E | H)P (H)
P (H | E) = (1) 4 Methods and Results
P (E | H)P (H) + P (E | ¬H)P (¬H)
We reviewed the literature associated with Type 2 diabetes,
where: P (H | E) defines the probability of a hypothesis the initial focus associated with protein interaction in diabetes
conditioned on certain evidence, P (E | H) is the likelihood, and from this review a list of “events” indicative of protein
P (H) is the probability of the hypothesis prior to obtaining interactions was identified, eg, activate, inhibit and modulate.
any evidence, is the P (E) evidence. Therefore, according to This list was used as the starting point to help identify which
Bayesian theory we can update our beliefs regarding the hy- entities are involved in each type of action or relation. Af-
pothesis when provided with new evidence that is conditional ter identifying the names of possible event relations the focus
upon using probabilities and is called conditionalization. moved to identifying potential entities involved in these re-
The conditional probability distributions (CPD) are de- lations. In order to complete this task a suitable dataset was
scribed by P (Xi | Ui ), where Xi represents node i and Ui are required. A search of the PubMed database was conducted
its parent nodes. We must specify the prior probabilities of the and 6113 abstracts, related to Type 2 diabetes were used; this
nodes and the conditional probabilities of the nodes given all dataset is used throughout each subsequent stage of this work.
the combinations of their ancestor nodes. The joint distribu- Initially a count was made of the number of times each of the
tion of random variables is given by X = {X1 , ..., Xn } and action words occurred in this sample dataset. Some of the
together with the CPD values is used to calculate the choice words, eg, “acetylate” and “destabilize” did not occur at all,
of Xi and is given by : while other words such as “interaction” and “suppression”
Y occurred more frequently.
P (X1 , ..., Xn ) = P (Xi | Ui ) (2) We now explain how the various parts of our system func-
i tion together, the information extraction technique synthe-
The CPD’s values are easy enough to calculate and infer- sizes the entities and relationships from the literature ab-
ence but require the number of parameters is dependent upon stracts and generates the structure for a specific ontology on
Table 1: Biological keywords
Action Word No Action Word No Action Word No
acetylate 0 inhibit 109 phosphorylates 5
acetylated 1 inhibited 95 phosphorylation 362
acetylates 0 inhibition 222 regulate 62
acetylation 0 inhibits 59 regulated 62
activate 47 interact 34 regulates 35
activated 69 interacted 0 regulation 333
activates 18 interacting 14 stabilization 6
activation 435 interaction 213 stabilize 3
bind 31 interactions 101 stabilized 3
binding 914 interacts 7 stabilizes 3
binds 16 modulate 74 suppress 56
bound 31 modulated 23 suppressed 116
destabilization 0 modulates 25 suppresses 13
destabilize 1 modulation 59 suppression 386
destabilized 0 phosphorylate 13 target 235
destabilizes 0 phosphorylated 15
insulin resistance. We then use the ontologies structure to sense to do so. The application of incorrect beliefs will pro-
build a Bayesian network for the purposes of inference and duce unreliable estimates of the true posterior regardless of
prediction of new protein-to-protein interactions. The rela- the abundance of the likelihood evidence. Equation 4 shows
tive frequencies of the keywords (entities and relationships) how we modify the BN with prior knowledge (causal inter-
are used to construct the conditional probability tables which vention) from the extracted ontology [Chrisman et al., 2003].
define the parent/child node relationships.
4.1 The Extracted Ontology and Bayesian network P (Xi,j = z | parM (x), M, θ : Xi,j = Z, ...) = 1 (4)
Mapping where parM are the parameters within the model, Xi,j are
Initially, one of these action words, “interaction” was selected the known effects of the parents of a given node, θ is the con-
to identify possible entities involved in a relation. The word ditional probability conditionalized and represents the causal
“interaction” however generally forms part of a phrase such conditions. The biological knowledge is incorporated into the
as “interaction between”, “interaction of”, and “interaction BN by specifying the probability for the existence of each
with”, and therefore each of these phrases would be used by potential connection (edge) between them. We assume inde-
the algorithm to search for potential entities. The first phrase pendence between edges and the variables in the BN are also
used was “interaction between”. Examples of the resulting assumed to be discrete, this ensures that the calculations are
phrases extracted are provided in the table 2. computationally tractable.
Figure 5 shows the structure of a section of our ontology.
The nodes are the entities and the arcs determine the relation-
Table 2: Biological keywords extracted for the ontology for ships between them. The numbers in brackets preceded by
the phrase “interaction between” “GO:” are the probabilities of the term occurring in the GO
Preceding word Following words ontology, the numbers.
the thyroid function and insulin sensitivity For example the following abstract fragment captures
the dysregulated fat and glucose metabolism knowledge about several proteins and their interactions:
strong insulin resistance and serum “Overexpression of the cytosolic domain of syntaxin 6
significant obesity and insulin resistance did not affect insulin-stimulated glucose transport, but
possible BMI and the adiponectin gene increased basal deGlc transport and cell surface Glut4 lev-
els. Moreover, the syntaxin 6 cytosolic domain significantly
Ultimately, the successful application of Bayesian tech- reduced the rate of Glut4 reinternalization after insulin with-
niques is dependent on the use of prior knowledge to improve drawal and perturbed subendosomal Glut4 sorting; the cor-
the estimation of the posterior. If a prior belief exists about a responding domains of syntaxins 8 and 12 were without ef-
situation then we can use this information to pre-structure our fect.”
BN. For example if a particular gene (IPA) is known to reg- We encountered difficulties with negative implications, i.e.
ulate several target genes (GDH, GL4, HK2), we would then the “did not” and “without effect” phrases negate the occur-
assign this relationship within the BN by setting the edges rence of the relationship but would be taken by the informa-
between these two entities and setting the values in the con- tion extraction algorithm as a positive relationship. A more
ditional probability table to define the structural prior accord- elaborate NLP technique or further crafting of specific regular
ingly. This is a powerful strategy, but only when it makes expression templates would reduce this effect.
node 2
node 4
1
is_a
node 1 interacts
TM:000146 TM:000148 0.9
with
TM:000145 interacts 0.8
with
node 3 0.7
node 5
reduces
Semantic content
0.6
TM:000147 TM:000149
0.5
interacts
with node 6 0.4
0.3
TM:000150
0.2
TextMine Ontology
0.1 GO Ontology
Figure 5: Fragment of the ontology (entities and relations) 0
extracted from the literature 0 0.2 0.4 0.6
Functional Similarity
0.8 1
4.2 Validation against Existing Knowledge
Figure 6: Comparison of the semantic richness of vocabulary
We determined a base line accuracy for our system by “re- of the GO and Text Mine ontologies.
discovering” known protein-to-protein interactions from the
literature and validating the relationships through accessing
a number of online database and ontology repositories. The from the research articles is missing. We suspect that as on-
most up to date and complete is the gene ontology (GO), we tologies such as GO increase in the number of entities, the
compare extracted relationships from our ontology with the relationships between will take on increased value. However,
GO structure. To determine the accuracy, we apply the well without incorporating the semantic similarity of the entities
known information retrieval measures of recall and precision. any increase in size will reduce the ontology to free text.
We define recall as the percentage of entity relations repre-
sented in the GO and correctly identified. We define precision 5 Related Work
as the the percentage of relations found in GO and returned
Research into the automatic generation of ontologies from
by our system.
textual data has received limited attention to date, notable ex-
The recall and precision are calculated by: ceptions are the work of Blaschke and Valencia, which used
recall = T P/(T P + T N ), clustering techniques at a document level [Blaschke and Va-
precision = T P/(T P + F P ), lencia, 2002]. The majority of the research attempts to alle-
where: TP=true positives such as , FP= false positives, TN= viate partial gaps in the knowledge or to repair incorrect an-
true negatives and FN= false negatives. notations in existing ontologies [Missikoff et al., 2003; Wols-
tencroft et al., 2005]. Using probabilistic techniques to model
Table 3: Recall and Precision of IE on protein-to-protein in- ontologies is receiving increased attention but this is for man-
teraction data ually curated ontologies [Mitra et al., 2005; Smith et al.,
Keyword TP TN FP FN Recall Precision 2005]. The modeling of biological networks with bayesian
interact 100 171 20 32 37 83 networks using genomic data has seen considerable attention
bind 200 167 17 14 54 92 in recent years [Ong et al., 2002]. The initial work on inte-
promote 240 188 17 15 56 93 grating heterogeneous data within a bayesian network frame-
inhibit 230 178 12 19 56 95 work was led by Friedman and Segal [Friedman et al., 2000;
Segal et al., 2001]. This work proved that Bayesian networks
could be trained on genomic data to reconstruct the relation-
We should note that certain errors in GO have been iden- ships between genes. The work by Pan et al is the most sim-
tified, inconsistencies and even spelling mistakes. We have ilar to ours, however the authors used Bayesian networks to
also identified that certain GO terms are too general and a integrate two ontologies from similar problem domains [Pan
more specific term would have been more appropriate. Thus et al., 2005]. Comparisons between the semantic similarity
entries with low semantic similarity but high functional simi- and genetic sequence similarity of ontologies has been con-
larity can be identified. Figure 6 presents the results of a com- ducted by Lord [Lord et al., 2003]. We found this work par-
parison between the semantic richness between GO and our ticulary useful as motivation for the development of a richer
extracted ontology. We define the semantic richness measure vocabulary to define entity relationships.
to be based on the correlations between functional similarity
and semantic content, a detailed description of this approach
can be found in [Lord et al., 2003]. 6 Conclusions
The GO ontology structure is extremely limited with to- The fusion of low level information from sub-symbolic tech-
tal reliance on 00 is a00 type links. This means that a large niques with logic or higher order structures is critically de-
amount of semantic information that was originally available pendent on the level of granularity used. The nodes of our
Bayesian networks are robust to semantic topic drift or catas- [Lord et al., 2003] P. Lord, R. Stevens, A. Brass, and
trophic interference which typically occurs when MLP or C. Goble. Investigating semantic similarity measures
other neural feed-forward techniques are trained in dynamic across the gene ontology: the relationship between se-
situations using heterogeneous data. In the case of our bioin- quence and annotation. Bioinformatics, 19:1275–1283,
formatics work we use Bayesian networks to learn from data 2003.
but also to map existing ontological relations to new Bayesian [Mack and Henenberger, 2002] R. Mack and M. Henen-
network structures. Clearly, further work is needed, how- berger. Text-based knowledge discovery: search and min-
ever, we have extended the current knowledge of automat- ing of life-sciences documents. Drug Discovery Today,
ically generating and integrating ontologies from low level 7:11, 2002.
data. The utilization of ontologies as a framework for guid-
ing the knowledge discovery process has to date received lit- [McGarry et al., 1999] K. McGarry, S. Wermter, and J. Mac-
tle attention. The experimental results presented in this pa- Intyre. Hybrid neural systems: from simple coupling to
per led us to conclude that a principled approach such as the fully integrated neural networks. Neural Computing Sur-
Bayesian framework can successfully integrate and represent veys, 2(1):62–93, 1999.
heterogeneous data and knowledge. [McGarry et al., 2006] K. McGarry, S. Garfield, and N. Mor-
ris. Recent trends in knowledge and data integration for the
7 Acknowledgements life sciences. Expert Systems: the Journal of Knowledge
This work was part supported by a Research Development Engineering, 23(5):337–348, 2006.
Fellowship funded by HEFCE and the Biosystems Informat- [Missikoff et al., 2003] M. Missikoff, P. Velardi, and P. Fab-
ics Institute (Bii). riani. Text mining techniques to automatically enrich a do-
main ontology. Applied Intelligence, 18:323–340, 2003.
References [Mitra et al., 2005] P. Mitra, N. Noy, and A. Jaiswal. Ontol-
[Ashburner, 2000] M. Ashburner. Gene ontology: tool for ogy mapping discovery with uncertainty. In Fourth Inter-
the unification of biology. Nature Genetics, 25:25–29, national Semantic Web Conference (ISWC), 2005.
2000. [Nahm and Mooney, 2002] U. Nahm and R. Mooney. Text
[Bard and Rhee, 2004] J. Bard and S. Rhee. Ontologies in mining with information extraction. In U. Nahm and R.
biology: design applications and future challenges. Nature Mooney. Text Mining with Information Extraction. In Pro-
Reviews Genetics, 5:213–222, 2004. ceedings of the AAAI 2002 Spring Symposium on Mining
[Blaschke and Valencia, 2002] C. Blaschke and A. Valen- Answers from Texts and Knowledge Bases., 2002.
cia. Automatic ontology construction from the literature. [Ong et al., 2002] I. Ong, J. Glasner, and D. Page. Modelling
Genome Informatics, 13:201–213, 2002. regulatory pathways in E. coli from time series expression
[Chrisman et al., 2003] L. Chrisman, P. Langley, S. Bray, profiles. Bioinformatics, 18(1):241–248, 2002.
and A. Pohorille. Incorporating biological knowledge into [Pan et al., 2005] R. Pan, Z. Ding, Y. Yu, and Y. Peng. A
evaluation of causal regulatory hypothesis. In Proceedings bayesian network approach to ontology mapping. In ISWC
of the Pacific Symposium on Biocomputing, pages 128– 2005 4th International Semantic Web Conference, pages
139, Kauai, Hawaii., 2003. 563–577, Galway, Ireland, 2005.
[Friedman et al., 2000] N. Friedman, M. Linial, I. Nachman, [Rosario and Hearst, 2004] B. Rosario and M. Hearst. Clas-
and D. Pe’er. Using bayesian networks to analyze expres- sifying semantic relations in bioscience texts. In Proceed-
sion data. Journal of Computational Biology, 7(3-4):601– ings of the 42nd Annual Meeting of the Association for
620, 2000. Computational Linguistics (ACL2004), pages 430–437,
[Grivell, 2002] L. Grivell. Mining the bibliome: search- 2004.
ing for a needle in a haystack?: new computing tools [Segal et al., 2001] E. Segal, B. Tasker, A. Gasch, N. Fried-
are needed to effectively scan the growing amount of sci- man, and D. Koller. Rich probabilistic models for gene
entific literature for useful information. EMBO Reports, expression. Bioinformatics, 17(1):243–252, 2001.
3(31):200–203, 2002.
[Smith et al., 2005] B. Smith, W. Ceusters, and J. Kohler.
[Hearst, 1992] M. Hearst. Automatic acquisition of hy- Relations in biomedical ontologies. Genome Biology,
ponyms from large text corpora. In Proceedings of the 6(5):46–58, 2005.
14th conference on Computational linguistics, pages 539–
545, 1992. [Tiffin et al., 2005] N. Tiffin, J. Kelso, A. Powell, H. Pan,
V. Bajic, and W. Hide. Integration of text and data-mining
[Korb and Nicholson, 2004] K. Korb and A. Nicholson. using ontologies successfully selects disease gene candi-
Bayesian Artificial Intelligence. Chapman and Hall/CRC, dates. Nucleic Acids Research, 33(5):1544–1552, 2005.
2004.
[Wolstencroft et al., 2005] K. Wolstencroft, R. McEntire,
[Krauthammer and Nenadic, 2004] M. Krauthammer and
R. Stevens, L. Tabernero, and A. Brass. Constructing
G. Nenadic. Term identification in the biomedical liter- ontology-driven protein family databases. Bioinformatics,
ature. Journal of Biomedical Informatics, 37:512–526, 21(8):1685–1692, 2005.
2004.