=Paper= {{Paper |id=Vol-230/paper-7 |storemode=property |title=Integration of Hybrid Bio-Ontologies using Bayesian Networks for Knowledge Discovery |pdfUrl=https://ceur-ws.org/Vol-230/07-mcgarry.pdf |volume=Vol-230 |dblpUrl=https://dblp.org/rec/conf/ijcai/McGarryGMW07 }} ==Integration of Hybrid Bio-Ontologies using Bayesian Networks for Knowledge Discovery== https://ceur-ws.org/Vol-230/07-mcgarry.pdf
    Integration of Hybrid Bio-Ontologies using Bayesian Networks for Knowledge
                                     Discovery
                  Ken McGarry∗ , Sheila Garfield∗ , Nick Morris† and Stefan Wermter∗
                  ∗
                  Department of Computing and Technology, University of Sunderland, UK
                †
                  Institute for Cell and Molecular Biosciences, University of Newcastle, UK
           {ken.mcgarry,sheila.garfield,stefan.wermter}@sunderland.ac.uk, n.j.morris@ncl.ac.uk

                         Abstract                                    Unless the researchers are familiar with the journals where
                                                                  the new results are published, they would be unlikely to en-
     This paper describes how high level biological               counter this information. Given, the fragmented and highly
     knowledge obtained from ontologies such as the               specialized nature of biological research, this may seldom
     Gene Ontology (GO) can be integrated with low                occur. Therefore the need for automated extraction of knowl-
     level information extracted from a Bayesian net-             edge from the literature is well motivated. However, recent
     work trained on protein interaction data. We can             advances in text analytics combines techniques from infor-
     automatically generate a biological ontology by              mation retrieval (IR) and information extraction (IE) which
     text mining the type II diabetes research literature.        allows researchers to explore the relevant literature more ef-
     The ontology is populated with the entities and              fectively [Mack and Henenberger, 2002]. However, these
     relationships from protein-to-protein interactions.          techniques require knowledge discovery methods to uncover
     New, previously unrelated information is extracted           complex embedded structures, relationships and connections
     from the growing body of research literature and in-         between seemingly unrelated facts that typically exist in the
     corporated with knowledge already known on this              biomedical literature [Tiffin et al., 2005].
     subject from the gene ontology and databases such
     as BIND and BioGRID. We integrate the ontology
     within the probabilistic framework of Bayesian net-                            keywords relating to
                                                                                                                     PUBMED
                                                                                                                    biomedical
     works which enables reasoning and prediction of                                 insulin resistance
                                                                                                                       text

     protein function.                                                                                                       abstracts +
                                                                                                                              main text

                                                                                         Validate with          Preprocessing, and
                                                                                                             information extraction of
1   Introduction                                                                    existing knowledge of
                                                                                        pathways and
                                                                                         interactions
                                                                                                                     text data


The large amounts of genomic and proteomic data that are                                                                     keywords

generated by biological experiments is now enabling deeper                                                   Organising keyword data
                                                                                         biomedical
insights into cellular and molecular function. New technolo-                             ontologies,        (entities and relations) into
                                                                                                               hierarchial structure
                                                                                      knowledge bases
gies such as microarrays and electrophoresis gels are pro-                              (GO,BIND &
                                                                                           KEGG)
viding vast quantities of experimental data at unprecedented                                                                onto-structure


rates. All of this information needs to be stored and carefully                                               Relative frequencies and
                                                                                                             probabilities calculations
annotated. With each new experiment providing details of                       Gaps in knowledge
                                                                            defined and experimental
                                                                                                            for Bayesian Network CPT's
                                                                                                              (conditional probability
new protein-to-protein interactions, new biological pathways                  procedures to follow                     tables)

and new genes it is essential that these discoveries are made
                                                                                                                               CPT's


available to the scientific community. To this end, online sci-               inference with BN on
                                                                                                            Transfer of knowledge into
entific databases are now in place that disseminate these re-                proteins/genes without
                                                                                   annotations
                                                                                                            Bayesian Network format

sults. These databases such as the popular Gene Ontology
(GO) are updated at intervals to reflect the latest develop-
ments [Ashburner, 2000].                                          Figure 1: Overview of methodology and information extrac-
   The updating is done by experts who manually revise each       tion process
entry by reading the research literature and annotating the
database collections accordingly. If necessary, they will con-       Our particular research area is that of diabetes, in partic-
tact the experimenters to resolve any ambiguities or problems.    ular the effects of insulin resistance on protein expression
In terms of data quality, the databases are quite reliable and    and insulin regulated protein trafficking in fat cells. In re-
robust. Unfortunately, hand annotation is a slow process and      cent years there has been a dramatic worldwide increase of
the databases are lagging behind the experimental work by a       those suffering with diabetes. In the year 2000, there were
considerable margin. This prevents researchers from imme-         171 million cases and by 2030 the World Health Organization
diately accessing the most recent discoveries.                    (WHO) has predicted there will be 366 million people suffer-
ing from this condition (www.who.int/diabetes/f acts/).                  The algorithm encodes through regular expressions tem-
The WHO data is for diagnosed cases but the undiagnosed               plates for recognizing the types of “action” words that typ-
cases are estimated by the WHO at 14.6 million alone for the          ically occur in biological texts. We discuss this process in
US.                                                                   more detail in section 4. However, the main problem that our
   In this paper we present our results of how we automati-           algorithm considers is to discover in advance the kind of in-
cally generate a viable ontology based on information extrac-         formation that can be encountered. Rather than attempt to
tion of keywords from the research literature. The keywords           parse the entire corpus we exploit certain linguistic regulari-
define the entities and relationships of important genes, gene        ties and search for specific semantic relations that need only
relationships, protein-to-protein interactions operate and co-        be defined once. The algorithm takes into account a vari-
exist in biological processes related to insulin resistance. Fur-     able distance between related terms i.e. longer passages of
thermore, the ontology is cast within a probabilistic frame-          text, and therefore provides a much more reliable identifica-
work using Bayesian networks which are used for the in-               tion of the relationships. Seeking up two words difference has
ferencing and prediction of protein function. Figure 1 gives          empirically shown to be a reasonable trade-off of accuracy
the overall methodology for the extraction of information and         versus computational complexity. Examples of relationships
construction of the ontology.                                         include:
   The remainder of this paper is structured as follows; sec-            • A inhibits B
tion two outlines our information extraction scheme for iden-
tifying the entities and relationships of interest, section three        • A activates B
provides an overview of biological ontologies and gives de-              • A interacts with B
tails of how we use Bayesian networks for inference and rea-             • A suppresses B
soning. Section four discusses our methodology and experi-
mental results, section five reviews the related work and our         3 Biological Ontologies and Bayesian
claim for novelty and finally section six presents the conclu-
sions.                                                                  Networks
                                                                       In this section we briefly motivate the need for ontologies and
                                                                       define their limitations with respect to the biological field and
2 Information Extraction                                               for knowledge discovery. Ontologies describe the concepts
Unstructured text is a very flexible and powerful means of             and relationships that exist for a particular area of interest.
communication, it allows us to describe quite complex con-             They are very useful for the semantic labeling of concepts
cepts. The semantic meaning of a sentence can be expressed             or definitions [Grivell, 2002; Bard and Rhee, 2004]. This
in many different ways but it is this flexibility which is the         process  ensures that entities which are equivalent to other en-
cause of difficulty for algorithmic sentence analysis by com-          tities in separate databases are identified as referring to the
puters. One technique of overcoming this problem is to use             same concepts. Even if these entities have different names or
information extraction (IE) to seek out the important entities         forms they can still be identified by semantic labeling. The
in the text and the relationships between them [Hearst, 1992;          role of semantics therefore is much deeper than matching the
Rosario and Hearst, 2004]. The IE process can involve encod-           co-occurrence of a tag or label, since it defines the relation-
ing patterns by hand such as regular expressions to search for         ship that exists between concepts. Figure 3 shows the struc-
the required entities and relations or to use semi-automated           ture and elements of the gene ontology that are pertinent to
machine learning techniques [Nahm and Mooney, 2002;                    our study. The first entry refers to GO:0008150 and is one of
Krauthammer and Nenadic, 2004]. The algorithm we devel-                the three top level structures (biological process, physiologi-
oped is shown in figure 2.                                             cal process and cellular process) in the gene ontology hierar-
                                                                       chy; the last number (GO:0015758) defines the relationships
                                                                       for the glucose transport pathway. The numbers in brackets
Inputs: Abstract file A, String str                                    refer to the number of entries at that particular level.
Outputs: Keyword file B                                                    The use of ontologies in biology for the semantic integra-
                                                                       tion of heterogeneous data is receiving increased attention,
Load file A                                                            however problems occur because of the dynamic, changing
While unprocessed “abstracts” in A
   Remove end of line characters
                                                                       nature of biological knowledge [McGarry et al., 2006]. These
   Read each line into str                                             difficulties arise from the highly complex structures that are
   Search string for concept term                                      expensive and problematic to update and maintain [Blaschke
   If contains phrase (the | a | an) + 2words(and |) + 2words          and  Valencia, 2002]. Another, related problem is that current
      write word preceding key phrase and string after key phrase to B ontologies have a rather limited vocabulary and cannot ex-
   elseif str contains phrase (the | a | an) + 1word(and |) + 2words press the richness of biological information. Little attention
      write word preceding key phrase and string after key phrase to B has been paid to defining the relations, much of the research
   elseif str contains phrase (the | a | an) + 2words                  effort and complexity of structure has concentrated on defin-
      write word preceding key phrase and string after key phrase to B ing the terms. Other considerations that are important are the
close A and B                                                          spatial and temporal characteristics of the entities.
                                                                           Furthermore, ontologies such DAML+OIL, OWL and
            Figure 2: Information extraction algorithm                 RDF are based on crisp logic and have difficulty managing
Accession: GO:0015758                                                   the number of parent nodes, they are usually represented in
Ontology: biological process                                            table format. The nodes are assumed to be discrete or cate-
Synonyms: None                                                          gorical values, however, continuous values may be discretised
Definition: movement of the hexose monosaccharide glucose into,         [Korb and Nicholson, 2004].
            out of, within or between cells.
GO:0008150 : biological process ( 127987 )                                                                        1 Y
                                                                                            P (X1 , ..., Xn ) =       πj [Cj ]                        (3)
    GO:0009987 : cellular process ( 78769 )                                                                       Z j
         GO:0050875 : cellular physiological process ( 71999 )
             GO:0006810 : transport ( 21084 )
                 GO:0008643 : carbohydrate transport ( 498 )                   Diagnostic reasoning                       Predictive reasoning
                      GO:0015749 : monosaccharide transport ( 206 )            Query                                                       Evidence
                          GO:0008645 : hexose transport ( 168 )
                                                                            ACE                GLUT4                ACE                  GLUT4
                               GO:0015758 : glucose transport ( 115 )
                                                                                       IR                                        IR    Query
                                                                                              Query


Figure 3: GO structure for Glut4 protein within glucose trans-              ADRB3              GLUT1               ADRB3                 GLUT1
port pathway                                                                                  Evidence              Query                Query



uncertainty; incomplete data and noisy information that is              Figure 4: The advantages of Bayesian networks include a
encountered in many domains, especially the bioinformatic               graphical representation of the structure i.e. the intercon-
field. Our research is concerned with Type 2 diabetes, in or-           nection relationships between the variables of interest and
der to develop a suitable ontology it is necessary to identify          they allow for causal discovery or causal interpretation. The
the relevant entities within the domain, their attributes and the       example shows the relationships between the insulin resis-
relationships that exist between these entities.                        tance problem and how it is affected by the proteins ACE and
3.1   Bayesian networks for Ontology Inference and                      GLUT4 and the effects of insulin resistance upon other pro-
      Integration                                                       teins such as GLUT1 and ADRB3.
The integration of sub-symbolic and symbolic computation                   In figure 4, the various possibilities for inferencing are
has received considerable interest over the years [McGarry et           shown within the insulin resistance domain. The first net-
al., 1999]. Within this framework the Bayesian approach can             work shows the diagnostic reasoning approach which enables
be seen as both a learning mechanism and as a knowledge                 the relationships between symptoms and causes to be evalu-
representation technique.                                               ated, thus when given some evidence regarding the presence
   Bayes theorem is shown in equation 1 and presents the                of Glut4 we can update our beliefs about the likelihood of
probability of the hypothesis (H) conditionalised on evidence           IR being present. When using predictive reasoning we can
(E).                                                                    derive new information about effects given some new infor-
                                                                        mation regarding the causes.
                           P (E | H)P (H)
  P (H | E) =                                                  (1)      4     Methods and Results
                 P (E | H)P (H) + P (E | ¬H)P (¬H)
                                                                        We reviewed the literature associated with Type 2 diabetes,
   where: P (H | E) defines the probability of a hypothesis             the initial focus associated with protein interaction in diabetes
conditioned on certain evidence, P (E | H) is the likelihood,           and from this review a list of “events” indicative of protein
P (H) is the probability of the hypothesis prior to obtaining           interactions was identified, eg, activate, inhibit and modulate.
any evidence, is the P (E) evidence. Therefore, according to            This list was used as the starting point to help identify which
Bayesian theory we can update our beliefs regarding the hy-             entities are involved in each type of action or relation. Af-
pothesis when provided with new evidence that is conditional            ter identifying the names of possible event relations the focus
upon using probabilities and is called conditionalization.              moved to identifying potential entities involved in these re-
   The conditional probability distributions (CPD) are de-              lations. In order to complete this task a suitable dataset was
scribed by P (Xi | Ui ), where Xi represents node i and Ui are          required. A search of the PubMed database was conducted
its parent nodes. We must specify the prior probabilities of the        and 6113 abstracts, related to Type 2 diabetes were used; this
nodes and the conditional probabilities of the nodes given all          dataset is used throughout each subsequent stage of this work.
the combinations of their ancestor nodes. The joint distribu-           Initially a count was made of the number of times each of the
tion of random variables is given by X = {X1 , ..., Xn } and            action words occurred in this sample dataset. Some of the
together with the CPD values is used to calculate the choice            words, eg, “acetylate” and “destabilize” did not occur at all,
of Xi and is given by :                                                 while other words such as “interaction” and “suppression”
                                   Y                                    occurred more frequently.
               P (X1 , ..., Xn ) =   P (Xi | Ui )            (2)           We now explain how the various parts of our system func-
                                    i                                   tion together, the information extraction technique synthe-
  The CPD’s values are easy enough to calculate and infer-              sizes the entities and relationships from the literature ab-
ence but require the number of parameters is dependent upon             stracts and generates the structure for a specific ontology on
                                                     Table 1: Biological keywords
                            Action Word        No      Action Word        No Action Word            No
                            acetylate          0       inhibit            109 phosphorylates        5
                            acetylated         1       inhibited          95    phosphorylation     362
                            acetylates         0       inhibition         222 regulate              62
                            acetylation        0       inhibits           59    regulated           62
                            activate           47      interact           34    regulates           35
                            activated          69      interacted         0     regulation          333
                            activates          18      interacting        14    stabilization       6
                            activation         435     interaction        213 stabilize             3
                            bind               31      interactions       101 stabilized            3
                            binding            914     interacts          7     stabilizes          3
                            binds              16      modulate           74    suppress            56
                            bound              31      modulated          23    suppressed          116
                            destabilization    0       modulates          25    suppresses          13
                            destabilize        1       modulation         59    suppression         386
                            destabilized       0       phosphorylate      13    target              235
                            destabilizes       0       phosphorylated 15


insulin resistance. We then use the ontologies structure to          sense to do so. The application of incorrect beliefs will pro-
build a Bayesian network for the purposes of inference and           duce unreliable estimates of the true posterior regardless of
prediction of new protein-to-protein interactions. The rela-         the abundance of the likelihood evidence. Equation 4 shows
tive frequencies of the keywords (entities and relationships)        how we modify the BN with prior knowledge (causal inter-
are used to construct the conditional probability tables which       vention) from the extracted ontology [Chrisman et al., 2003].
define the parent/child node relationships.

4.1   The Extracted Ontology and Bayesian network                        P (Xi,j = z | parM (x), M, θ : Xi,j = Z, ...) = 1      (4)
      Mapping                                                           where parM are the parameters within the model, Xi,j are
Initially, one of these action words, “interaction” was selected     the known effects of the parents of a given node, θ is the con-
to identify possible entities involved in a relation. The word       ditional probability conditionalized and represents the causal
“interaction” however generally forms part of a phrase such          conditions. The biological knowledge is incorporated into the
as “interaction between”, “interaction of”, and “interaction         BN by specifying the probability for the existence of each
with”, and therefore each of these phrases would be used by          potential connection (edge) between them. We assume inde-
the algorithm to search for potential entities. The first phrase     pendence between edges and the variables in the BN are also
used was “interaction between”. Examples of the resulting            assumed to be discrete, this ensures that the calculations are
phrases extracted are provided in the table 2.                       computationally tractable.
                                                                        Figure 5 shows the structure of a section of our ontology.
                                                                     The nodes are the entities and the arcs determine the relation-
Table 2: Biological keywords extracted for the ontology for          ships between them. The numbers in brackets preceded by
the phrase “interaction between”                                     “GO:” are the probabilities of the term occurring in the GO
  Preceding word Following words                                     ontology, the numbers.
  the                 thyroid function and insulin sensitivity          For example the following abstract fragment captures
  the                 dysregulated fat and glucose metabolism        knowledge about several proteins and their interactions:
  strong              insulin resistance and serum                      “Overexpression of the cytosolic domain of syntaxin 6
  significant         obesity and insulin resistance                 did not affect insulin-stimulated glucose transport, but
  possible            BMI and the adiponectin gene                   increased basal deGlc transport and cell surface Glut4 lev-
                                                                     els. Moreover, the syntaxin 6 cytosolic domain significantly
   Ultimately, the successful application of Bayesian tech-          reduced the rate of Glut4 reinternalization after insulin with-
niques is dependent on the use of prior knowledge to improve         drawal and perturbed subendosomal Glut4 sorting; the cor-
the estimation of the posterior. If a prior belief exists about a    responding domains of syntaxins 8 and 12 were without ef-
situation then we can use this information to pre-structure our      fect.”
BN. For example if a particular gene (IPA) is known to reg-             We encountered difficulties with negative implications, i.e.
ulate several target genes (GDH, GL4, HK2), we would then            the “did not” and “without effect” phrases negate the occur-
assign this relationship within the BN by setting the edges          rence of the relationship but would be taken by the informa-
between these two entities and setting the values in the con-        tion extraction algorithm as a positive relationship. A more
ditional probability table to define the structural prior accord-    elaborate NLP technique or further crafting of specific regular
ingly. This is a powerful strategy, but only when it makes           expression templates would reduce this effect.
                                   node 2
                                                            node 4
                                                                                                 1
                                               is_a
         node 1     interacts
                                TM:000146                   TM:000148                           0.9
                       with

        TM:000145                               interacts                                       0.8
                                                   with
                                   node 3                                                       0.7
                                                            node 5
                    reduces




                                                                             Semantic content
                                                                                                0.6
                                TM:000147                   TM:000149
                                                                                                0.5
                                            interacts
                                               with         node 6                              0.4

                                                                                                0.3
                                                            TM:000150

                                                                                                0.2
                                                                                                                                        TextMine Ontology
                                                                                                0.1                                     GO Ontology

Figure 5: Fragment of the ontology (entities and relations)                                      0
extracted from the literature                                                                         0   0.2    0.4            0.6
                                                                                                                Functional Similarity
                                                                                                                                               0.8          1




4.2   Validation against Existing Knowledge
                                                                        Figure 6: Comparison of the semantic richness of vocabulary
We determined a base line accuracy for our system by “re-               of the GO and Text Mine ontologies.
discovering” known protein-to-protein interactions from the
literature and validating the relationships through accessing
a number of online database and ontology repositories. The              from the research articles is missing. We suspect that as on-
most up to date and complete is the gene ontology (GO), we              tologies such as GO increase in the number of entities, the
compare extracted relationships from our ontology with the              relationships between will take on increased value. However,
GO structure. To determine the accuracy, we apply the well              without incorporating the semantic similarity of the entities
known information retrieval measures of recall and precision.           any increase in size will reduce the ontology to free text.
We define recall as the percentage of entity relations repre-
sented in the GO and correctly identified. We define precision          5   Related Work
as the the percentage of relations found in GO and returned
                                                                        Research into the automatic generation of ontologies from
by our system.
                                                                        textual data has received limited attention to date, notable ex-
   The recall and precision are calculated by:                          ceptions are the work of Blaschke and Valencia, which used
recall = T P/(T P + T N ),                                              clustering techniques at a document level [Blaschke and Va-
precision = T P/(T P + F P ),                                           lencia, 2002]. The majority of the research attempts to alle-
where: TP=true positives such as , FP= false positives, TN=             viate partial gaps in the knowledge or to repair incorrect an-
true negatives and FN= false negatives.                                 notations in existing ontologies [Missikoff et al., 2003; Wols-
                                                                        tencroft et al., 2005]. Using probabilistic techniques to model
Table 3: Recall and Precision of IE on protein-to-protein in-           ontologies is receiving increased attention but this is for man-
teraction data                                                          ually curated ontologies [Mitra et al., 2005; Smith et al.,
  Keyword TP TN FP FN Recall Precision                                  2005]. The modeling of biological networks with bayesian
  interact     100 171 20 32           37        83                     networks using genomic data has seen considerable attention
  bind         200 167 17 14           54        92                     in recent years [Ong et al., 2002]. The initial work on inte-
  promote      240 188 17 15           56        93                     grating heterogeneous data within a bayesian network frame-
  inhibit      230 178 12 19           56        95                     work was led by Friedman and Segal [Friedman et al., 2000;
                                                                        Segal et al., 2001]. This work proved that Bayesian networks
                                                                        could be trained on genomic data to reconstruct the relation-
    We should note that certain errors in GO have been iden-            ships between genes. The work by Pan et al is the most sim-
tified, inconsistencies and even spelling mistakes. We have             ilar to ours, however the authors used Bayesian networks to
also identified that certain GO terms are too general and a             integrate two ontologies from similar problem domains [Pan
more specific term would have been more appropriate. Thus               et al., 2005]. Comparisons between the semantic similarity
entries with low semantic similarity but high functional simi-          and genetic sequence similarity of ontologies has been con-
larity can be identified. Figure 6 presents the results of a com-       ducted by Lord [Lord et al., 2003]. We found this work par-
parison between the semantic richness between GO and our                ticulary useful as motivation for the development of a richer
extracted ontology. We define the semantic richness measure             vocabulary to define entity relationships.
to be based on the correlations between functional similarity
and semantic content, a detailed description of this approach
can be found in [Lord et al., 2003].                                    6   Conclusions
    The GO ontology structure is extremely limited with to-             The fusion of low level information from sub-symbolic tech-
tal reliance on 00 is a00 type links. This means that a large           niques with logic or higher order structures is critically de-
amount of semantic information that was originally available            pendent on the level of granularity used. The nodes of our
Bayesian networks are robust to semantic topic drift or catas-   [Lord et al., 2003] P. Lord, R. Stevens, A. Brass, and
trophic interference which typically occurs when MLP or             C. Goble. Investigating semantic similarity measures
other neural feed-forward techniques are trained in dynamic         across the gene ontology: the relationship between se-
situations using heterogeneous data. In the case of our bioin-      quence and annotation. Bioinformatics, 19:1275–1283,
formatics work we use Bayesian networks to learn from data          2003.
but also to map existing ontological relations to new Bayesian   [Mack and Henenberger, 2002] R. Mack and M. Henen-
network structures. Clearly, further work is needed, how-           berger. Text-based knowledge discovery: search and min-
ever, we have extended the current knowledge of automat-            ing of life-sciences documents. Drug Discovery Today,
ically generating and integrating ontologies from low level         7:11, 2002.
data. The utilization of ontologies as a framework for guid-
ing the knowledge discovery process has to date received lit-    [McGarry et al., 1999] K. McGarry, S. Wermter, and J. Mac-
tle attention. The experimental results presented in this pa-       Intyre. Hybrid neural systems: from simple coupling to
per led us to conclude that a principled approach such as the       fully integrated neural networks. Neural Computing Sur-
Bayesian framework can successfully integrate and represent         veys, 2(1):62–93, 1999.
heterogeneous data and knowledge.                                [McGarry et al., 2006] K. McGarry, S. Garfield, and N. Mor-
                                                                    ris. Recent trends in knowledge and data integration for the
7 Acknowledgements                                                  life sciences. Expert Systems: the Journal of Knowledge
This work was part supported by a Research Development              Engineering, 23(5):337–348, 2006.
Fellowship funded by HEFCE and the Biosystems Informat-          [Missikoff et al., 2003] M. Missikoff, P. Velardi, and P. Fab-
ics Institute (Bii).                                                riani. Text mining techniques to automatically enrich a do-
                                                                    main ontology. Applied Intelligence, 18:323–340, 2003.
References                                                       [Mitra et al., 2005] P. Mitra, N. Noy, and A. Jaiswal. Ontol-
[Ashburner, 2000] M. Ashburner. Gene ontology: tool for             ogy mapping discovery with uncertainty. In Fourth Inter-
   the unification of biology. Nature Genetics, 25:25–29,           national Semantic Web Conference (ISWC), 2005.
   2000.                                                         [Nahm and Mooney, 2002] U. Nahm and R. Mooney. Text
[Bard and Rhee, 2004] J. Bard and S. Rhee. Ontologies in            mining with information extraction. In U. Nahm and R.
   biology: design applications and future challenges. Nature       Mooney. Text Mining with Information Extraction. In Pro-
   Reviews Genetics, 5:213–222, 2004.                               ceedings of the AAAI 2002 Spring Symposium on Mining
[Blaschke and Valencia, 2002] C. Blaschke and A. Valen-             Answers from Texts and Knowledge Bases., 2002.
   cia. Automatic ontology construction from the literature.     [Ong et al., 2002] I. Ong, J. Glasner, and D. Page. Modelling
   Genome Informatics, 13:201–213, 2002.                            regulatory pathways in E. coli from time series expression
[Chrisman et al., 2003] L. Chrisman, P. Langley, S. Bray,           profiles. Bioinformatics, 18(1):241–248, 2002.
   and A. Pohorille. Incorporating biological knowledge into     [Pan et al., 2005] R. Pan, Z. Ding, Y. Yu, and Y. Peng. A
   evaluation of causal regulatory hypothesis. In Proceedings       bayesian network approach to ontology mapping. In ISWC
   of the Pacific Symposium on Biocomputing, pages 128–             2005 4th International Semantic Web Conference, pages
   139, Kauai, Hawaii., 2003.                                       563–577, Galway, Ireland, 2005.
[Friedman et al., 2000] N. Friedman, M. Linial, I. Nachman,      [Rosario and Hearst, 2004] B. Rosario and M. Hearst. Clas-
   and D. Pe’er. Using bayesian networks to analyze expres-         sifying semantic relations in bioscience texts. In Proceed-
   sion data. Journal of Computational Biology, 7(3-4):601–         ings of the 42nd Annual Meeting of the Association for
   620, 2000.                                                       Computational Linguistics (ACL2004), pages 430–437,
[Grivell, 2002] L. Grivell. Mining the bibliome: search-            2004.
   ing for a needle in a haystack?: new computing tools          [Segal et al., 2001] E. Segal, B. Tasker, A. Gasch, N. Fried-
   are needed to effectively scan the growing amount of sci-        man, and D. Koller. Rich probabilistic models for gene
   entific literature for useful information. EMBO Reports,         expression. Bioinformatics, 17(1):243–252, 2001.
   3(31):200–203, 2002.
                                                                 [Smith et al., 2005] B. Smith, W. Ceusters, and J. Kohler.
[Hearst, 1992] M. Hearst. Automatic acquisition of hy-              Relations in biomedical ontologies. Genome Biology,
   ponyms from large text corpora. In Proceedings of the            6(5):46–58, 2005.
   14th conference on Computational linguistics, pages 539–
   545, 1992.                                                    [Tiffin et al., 2005] N. Tiffin, J. Kelso, A. Powell, H. Pan,
                                                                    V. Bajic, and W. Hide. Integration of text and data-mining
[Korb and Nicholson, 2004] K. Korb and A. Nicholson.                using ontologies successfully selects disease gene candi-
   Bayesian Artificial Intelligence. Chapman and Hall/CRC,          dates. Nucleic Acids Research, 33(5):1544–1552, 2005.
   2004.
                                                                 [Wolstencroft et al., 2005] K. Wolstencroft, R. McEntire,
[Krauthammer and Nenadic, 2004] M. Krauthammer and
                                                                    R. Stevens, L. Tabernero, and A. Brass. Constructing
   G. Nenadic. Term identification in the biomedical liter-         ontology-driven protein family databases. Bioinformatics,
   ature. Journal of Biomedical Informatics, 37:512–526,            21(8):1685–1692, 2005.
   2004.