1 Introduction

Integration of Hybrid Bio-Ontologies using Bayesian Networks for Knowledge Discovery

Ken McGarry¤

Sheila Garfield¤

Nick Morrisy

Stefan Wermter¤

stefan.wermterg@sunderland.ac.uk 0 0 Relative frequencies and probabilities calculations for Bayesian Network CPT's , conditional probability tables

relating to insulin resistance

1 Introduction

The large amounts of genomic and proteomic data that are generated by biological experiments is now enabling deeper insights into cellular and molecular function. New technologies such as microarrays and electrophoresis gels are providing vast quantities of experimental data at unprecedented rates. All of this information needs to be stored and carefully annotated. With each new experiment providing details of new protein-to-protein interactions, new biological pathways and new genes it is essential that these discoveries are made available to the scientific community. To this end, online scientific databases are now in place that disseminate these results. These databases such as the popular Gene Ontology (GO) are updated at intervals to reflect the latest developments [Ashburner, 2000].

The updating is done by experts who manually revise each entry by reading the research literature and annotating the database collections accordingly. If necessary, they will contact the experimenters to resolve any ambiguities or problems. In terms of data quality, the databases are quite reliable and robust. Unfortunately, hand annotation is a slow process and the databases are lagging behind the experimental work by a considerable margin. This prevents researchers from immediately accessing the most recent discoveries.

Unless the researchers are familiar with the journals where the new results are published, they would be unlikely to encounter this information. Given, the fragmented and highly specialized nature of biological research, this may seldom occur. Therefore the need for automated extraction of knowledge from the literature is well motivated. However, recent advances in text analytics combines techniques from information retrieval (IR) and information extraction (IE) which allows researchers to explore the relevant literature more effectively [Mack and Henenberger, 2002]. However, these techniques require knowledge discovery methods to uncover complex embedded structures, relationships and connections between seemingly unrelated facts that typically exist in the biomedical literature [Tiffin et al., 2005].

Validate with existing knowledge of pathways and interactions biomedical ontologies, knowledge bases (GO,BIND &

KEGG)

Gaps in knowledge defined and experimental procedures to follow inference with BN on proteins/genes without annotations

PUBMED biomedical text

CPT's Transfer of knowledge into Bayesian Network format

Our particular research area is that of diabetes, in particular the effects of insulin resistance on protein expression and insulin regulated protein trafficking in fat cells. In recent years there has been a dramatic worldwide increase of those suffering with diabetes. In the year 2000, there were 171 million cases and by 2030 the World Health Organization (WHO) has predicted there will be 366 million people suffering from this condition (www:who:int=diabetes=f acts=). The WHO data is for diagnosed cases but the undiagnosed cases are estimated by the WHO at 14.6 million alone for the US.

In this paper we present our results of how we automatically generate a viable ontology based on information extraction of keywords from the research literature. The keywords define the entities and relationships of important genes, gene relationships, protein-to-protein interactions operate and coexist in biological processes related to insulin resistance. Furthermore, the ontology is cast within a probabilistic framework using Bayesian networks which are used for the inferencing and prediction of protein function. Figure 1 gives the overall methodology for the extraction of information and construction of the ontology.

The remainder of this paper is structured as follows; section two outlines our information extraction scheme for identifying the entities and relationships of interest, section three provides an overview of biological ontologies and gives details of how we use Bayesian networks for inference and reasoning. Section four discusses our methodology and experimental results, section five reviews the related work and our claim for novelty and finally section six presents the conclusions.

The algorithm encodes through regular expressions templates for recognizing the types of “action” words that typically occur in biological texts. We discuss this process in more detail in section 4. However, the main problem that our algorithm considers is to discover in advance the kind of information that can be encountered. Rather than attempt to parse the entire corpus we exploit certain linguistic regularities and search for specific semantic relations that need only be defined once. The algorithm takes into account a variable distance between related terms i.e. longer passages of text, and therefore provides a much more reliable identification of the relationships. Seeking up two words difference has empirically shown to be a reasonable trade-off of accuracy versus computational complexity. Examples of relationships include: ² A inhibits B ² A activates B ² A interacts with B ² A suppresses B 3

Biological Ontologies and Bayesian Networks

In this section we briefly motivate the need for ontologies and 2 Information Extraction define their limitations with respect to the biological field and for knowledge discovery. Ontologies describe the concepts Unstructured text is a very flexible and powerful means of and relationships that exist for a particular area of interest. communication, it allows us to describe quite complex con- They are very useful for the semantic labeling of concepts cepts. The semantic meaning of a sentence can be expressed or definitions [Grivell, 2002; Bard and Rhee, 2004]. This in many different ways but it is this flexibility which is the process ensures that entities which are equivalent to other encause of difficulty for algorithmic sentence analysis by com- tities in separate databases are identified as referring to the puters. One technique of overcoming this problem is to use same concepts. Even if these entities have different names or information extraction (IE) to seek out the important entities forms they can still be identified by semantic labeling. The in the text and the relationships between them [Hearst, 1992; role of semantics therefore is much deeper than matching the Rosario and Hearst, 2004]. The IE process can involve encod- co-occurrence of a tag or label, since it defines the relationing patterns by hand such as regular expressions to search for ship that exists between concepts. Figure 3 shows the structhe required entities and relations or to use semi-automated ture and elements of the gene ontology that are pertinent to machine learning techniques [Nahm and Mooney, 2002; our study. The first entry refers to GO:0008150 and is one of Krauthammer and Nenadic, 2004]. The algorithm we devel- the three top level structures (biological process, physiologioped is shown in figure 2. cal process and cellular process) in the gene ontology hierarchy; the last number (GO:0015758) defines the relationships for the glucose transport pathway. The numbers in brackets Inputs: Abstract file A, String str refer to the number of entries at that particular level. Outputs: Keyword file B The use of ontologies in biology for the semantic integration of heterogeneous data is receiving increased attention, Load file A however problems occur because of the dynamic, changing WhRileemuonvperoencedsosefdlin“eabcshtarraacctste”rsin A nature of biological knowledge [McGarry et al., 2006]. These Read each line into str difficulties arise from the highly complex structures that are Search string for concept term expensive and problematic to update and maintain [Blaschke If contains phrase (the j a j an) + 2words(and j) + 2words and Valencia, 2002]. Another, related problem is that current write word preceding key phrase and string after key phrase to B ontologies have a rather limited vocabulary and cannot exelseif str contains phrase (the j a j an) + 1word(and j) + 2words press the richness of biological information. Little attention write word preceding key phrase and string after key phrase to B has been paid to defining the relations, much of the research elseif str contains phrase (the j a j an) + 2words effort and complexity of structure has concentrated on definwrite word preceding key phrase and string after key phrase to B ing the terms. Other considerations that are important are the close A and B spatial and temporal characteristics of the entities.

Furthermore, ontologies such DAML+OIL, OWL and

Figure 2: Information extraction algorithm RDF are based on crisp logic and have difficulty managing uncertainty; incomplete data and noisy information that is encountered in many domains, especially the bioinformatic field. Our research is concerned with Type 2 diabetes, in order to develop a suitable ontology it is necessary to identify the relevant entities within the domain, their attributes and the relationships that exist between these entities.

3.1 Bayesian networks for Ontology Inference and Integration

The integration of sub-symbolic and symbolic computation has received considerable interest over the years [McGarry et al., 1999]. Within this framework the Bayesian approach can be seen as both a learning mechanism and as a knowledge representation technique.

Bayes theorem is shown in equation 1 and presents the probability of the hypothesis (H) conditionalised on evidence (E).

P (H j E) =

P (E j H)P (H)

P (E j H)P (H) + P (E j :H)P (:H) where: P (H j E) defines the probability of a hypothesis conditioned on certain evidence, P (E j H) is the likelihood, P (H) is the probability of the hypothesis prior to obtaining any evidence, is the P (E) evidence. Therefore, according to Bayesian theory we can update our beliefs regarding the hypothesis when provided with new evidence that is conditional upon using probabilities and is called conditionalization.

The conditional probability distributions (CPD) are described by P (Xi j Ui), where Xi represents node i and Ui are its parent nodes. We must specify the prior probabilities of the nodes and the conditional probabilities of the nodes given all the combinations of their ancestor nodes. The joint distribution of random variables is given by X = fX1; :::; Xng and together with the CPD values is used to calculate the choice of Xi and is given by :

P (X1; :::; Xn) =

Y P (Xi j Ui) i

The CPD’s values are easy enough to calculate and inference but require the number of parameters is dependent upon (2)

Diagnostic reasoning

Query ACE ADRB3

GLUT4 Query GLUT1

Evidence (1) 4

Methods and Results

the number of parent nodes, they are usually represented in table format. The nodes are assumed to be discrete or categorical values, however, continuous values may be discretised [Korb and Nicholson, 2004].

P (X1; :::; Xn) = 1 Y ¼j [Cj ]

Z j

(3) ACE

Predictive reasoning

Evidence

GLUT4 ADRB3 Query

GLUT1 Query IR IR

Query

In figure 4, the various possibilities for inferencing are shown within the insulin resistance domain. The first network shows the diagnostic reasoning approach which enables the relationships between symptoms and causes to be evaluated, thus when given some evidence regarding the presence of Glut4 we can update our beliefs about the likelihood of IR being present. When using predictive reasoning we can derive new information about effects given some new information regarding the causes.

We reviewed the literature associated with Type 2 diabetes, the initial focus associated with protein interaction in diabetes and from this review a list of “events” indicative of protein interactions was identified, eg, activate, inhibit and modulate. This list was used as the starting point to help identify which entities are involved in each type of action or relation. After identifying the names of possible event relations the focus moved to identifying potential entities involved in these relations. In order to complete this task a suitable dataset was required. A search of the PubMed database was conducted and 6113 abstracts, related to Type 2 diabetes were used; this dataset is used throughout each subsequent stage of this work. Initially a count was made of the number of times each of the action words occurred in this sample dataset. Some of the words, eg, “acetylate” and “destabilize” did not occur at all, while other words such as “interaction” and “suppression” occurred more frequently.

We now explain how the various parts of our system function together, the information extraction technique synthesizes the entities and relationships from the literature abstracts and generates the structure for a specific ontology on Action Word acetylate acetylated acetylates acetylation activate activated activates activation bind binding binds bound destabilization destabilize destabilized destabilizes insulin resistance. We then use the ontologies structure to build a Bayesian network for the purposes of inference and prediction of new protein-to-protein interactions. The relative frequencies of the keywords (entities and relationships) are used to construct the conditional probability tables which define the parent/child node relationships. 4.1

The Extracted Ontology and Bayesian network Mapping

Initially, one of these action words, “interaction” was selected to identify possible entities involved in a relation. The word “interaction” however generally forms part of a phrase such as “interaction between”, “interaction of”, and “interaction with”, and therefore each of these phrases would be used by the algorithm to search for potential entities. The first phrase used was “interaction between”. Examples of the resulting phrases extracted are provided in the table 2. Ultimately, the successful application of Bayesian techniques is dependent on the use of prior knowledge to improve the estimation of the posterior. If a prior belief exists about a situation then we can use this information to pre-structure our BN. For example if a particular gene (IPA) is known to regulate several target genes (GDH, GL4, HK2), we would then assign this relationship within the BN by setting the edges between these two entities and setting the values in the conditional probability table to define the structural prior accordingly. This is a powerful strategy, but only when it makes sense to do so. The application of incorrect beliefs will produce unreliable estimates of the true posterior regardless of the abundance of the likelihood evidence. Equation 4 shows how we modify the BN with prior knowledge (causal intervention) from the extracted ontology [Chrisman et al., 2003].

P (Xi;j = z j parM (x); M; µ : Xi;j = Z; :::) = 1 (4) where parM are the parameters within the model, Xi;j are the known effects of the parents of a given node, µ is the conditional probability conditionalized and represents the causal conditions. The biological knowledge is incorporated into the BN by specifying the probability for the existence of each potential connection (edge) between them. We assume independence between edges and the variables in the BN are also assumed to be discrete, this ensures that the calculations are computationally tractable.

Figure 5 shows the structure of a section of our ontology. The nodes are the entities and the arcs determine the relationships between them. The numbers in brackets preceded by “GO:” are the probabilities of the term occurring in the GO ontology, the numbers.

For example the following abstract fragment captures knowledge about several proteins and their interactions: “Overexpression of the cytosolic domain of syntaxin 6 did not affect insulin-stimulated glucose transport, but increased basal deGlc transport and cell surface Glut4 levels. Moreover, the syntaxin 6 cytosolic domain significantly reduced the rate of Glut4 reinternalization after insulin withdrawal and perturbed subendosomal Glut4 sorting; the corresponding domains of syntaxins 8 and 12 were without effect.”

We encountered difficulties with negative implications, i.e. the “did not” and “without effect” phrases negate the occurrence of the relationship but would be taken by the information extraction algorithm as a positive relationship. A more elaborate NLP technique or further crafting of specific regular expression templates would reduce this effect. TM:000146

node 3 TM:000147 is_a interacts

with interacts with node 4 TM:000148 We determined a base line accuracy for our system by “rediscovering” known protein-to-protein interactions from the literature and validating the relationships through accessing a number of online database and ontology repositories. The most up to date and complete is the gene ontology (GO), we compare extracted relationships from our ontology with the GO structure. To determine the accuracy, we apply the well known information retrieval measures of recall and precision. We define recall as the percentage of entity relations represented in the GO and correctly identified. We define precision as the the percentage of relations found in GO and returned by our system.

The recall and precision are calculated by: recall = T P=(T P + T N ), precision = T P=(T P + F P ), where: TP=true positives such as , FP= false positives, TN= true negatives and FN= false negatives. We should note that certain errors in GO have been identified, inconsistencies and even spelling mistakes. We have also identified that certain GO terms are too general and a more specific term would have been more appropriate. Thus entries with low semantic similarity but high functional similarity can be identified. Figure 6 presents the results of a comparison between the semantic richness between GO and our extracted ontology. We define the semantic richness measure to be based on the correlations between functional similarity and semantic content, a detailed description of this approach can be found in [Lord et al., 2003].

The GO ontology structure is extremely limited with total reliance on 00is a00 type links. This means that a large amount of semantic information that was originally available 0.9 0.8 from the research articles is missing. We suspect that as ontologies such as GO increase in the number of entities, the relationships between will take on increased value. However, without incorporating the semantic similarity of the entities any increase in size will reduce the ontology to free text. 5

Related Work

Research into the automatic generation of ontologies from textual data has received limited attention to date, notable exceptions are the work of Blaschke and Valencia, which used clustering techniques at a document level [Blaschke and Valencia, 2002]. The majority of the research attempts to alleviate partial gaps in the knowledge or to repair incorrect annotations in existing ontologies [Missikoff et al., 2003; Wolstencroft et al., 2005]. Using probabilistic techniques to model ontologies is receiving increased attention but this is for manually curated ontologies [Mitra et al., 2005; Smith et al., 2005]. The modeling of biological networks with bayesian networks using genomic data has seen considerable attention in recent years [Ong et al., 2002]. The initial work on integrating heterogeneous data within a bayesian network framework was led by Friedman and Segal [Friedman et al., 2000; Segal et al., 2001]. This work proved that Bayesian networks could be trained on genomic data to reconstruct the relationships between genes. The work by Pan et al is the most similar to ours, however the authors used Bayesian networks to integrate two ontologies from similar problem domains [Pan et al., 2005]. Comparisons between the semantic similarity and genetic sequence similarity of ontologies has been conducted by Lord [Lord et al., 2003]. We found this work particulary useful as motivation for the development of a richer vocabulary to define entity relationships. 6

Conclusions

The fusion of low level information from sub-symbolic techniques with logic or higher order structures is critically dependent on the level of granularity used. The nodes of our Bayesian networks are robust to semantic topic drift or catastrophic interference which typically occurs when MLP or other neural feed-forward techniques are trained in dynamic situations using heterogeneous data. In the case of our bioinformatics work we use Bayesian networks to learn from data but also to map existing ontological relations to new Bayesian network structures. Clearly, further work is needed, however, we have extended the current knowledge of automatically generating and integrating ontologies from low level data. The utilization of ontologies as a framework for guiding the knowledge discovery process has to date received little attention. The experimental results presented in this paper led us to conclude that a principled approach such as the Bayesian framework can successfully integrate and represent heterogeneous data and knowledge. 7

Acknowledgements

This work was part supported by a Research Development Fellowship funded by HEFCE and the Biosystems Informatics Institute (Bii).

[Ashburner , 2000]

Ashburner . Gene ontology: tool for the unification of biology . Nature Genetics , 25 : 25 - 29 , 2000 .

[Bard and Rhee , 2004]

Bard and

Rhee . Ontologies in biology: design applications and future challenges . Nature Reviews Genetics , 5 : 213 - 222 , 2004 .

[Blaschke and Valencia , 2002]

Blaschke and

Valencia . Automatic ontology construction from the literature . Genome Informatics , 13 : 201 - 213 , 2002 .

[Chrisman et al., 2003 ]

Chrisman ,

Langley ,

Bray , and

Pohorille . Incorporating biological knowledge into evaluation of causal regulatory hypothesis . In Proceedings of the Pacific Symposium on Biocomputing , pages 128 - 139 , Kauai, Hawaii., 2003 .

[Friedman et al., 2000 ]

Friedman ,

Linial , I. Nachman , and D. Pe'er. Using bayesian networks to analyze expression data . Journal of Computational Biology , 7 ( 3 -4): 601 - 620 , 2000 .

[Grivell , 2002]

Grivell . Mining the bibliome: searching for a needle in a haystack?: new computing tools are needed to effectively scan the growing amount of scientific literature for useful information . EMBO Reports , 3 ( 31 ): 200 - 203 , 2002 .

[Hearst , 1992]

Hearst . Automatic acquisition of hyponyms from large text corpora . In Proceedings of the 14th conference on Computational linguistics , pages 539 - 545 , 1992 .

[Korb and Nicholson , 2004]

Korb and

Nicholson . Bayesian Artificial Intelligence. Chapman and Hall/CRC, 2004 .

[Krauthammer and Nenadic , 2004]

Krauthammer and

Nenadic . Term identification in the biomedical literature . Journal of Biomedical Informatics , 37 : 512 - 526 , 2004 .

[Lord et al., 2003 ]

Lord ,

Stevens ,

Brass , and

Goble . Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation . Bioinformatics , 19 : 1275 - 1283 , 2003 .

[Mack and Henenberger , 2002]

Mack and

Henenberger . Text-based knowledge discovery: search and mining of life-sciences documents . Drug Discovery Today , 7 : 11 , 2002 .

[McGarry et al., 1999 ]

McGarry ,

Wermter , and J. MacIntyre. Hybrid neural systems: from simple coupling to fully integrated neural networks . Neural Computing Surveys , 2 ( 1 ): 62 - 93 , 1999 .

[McGarry et al., 2006 ]

McGarry ,

Garfield , and

Morris . Recent trends in knowledge and data integration for the life sciences . Expert Systems: the Journal of Knowledge Engineering , 23 ( 5 ): 337 - 348 , 2006 .

[Missikoff et al., 2003 ]

Missikoff ,

Velardi , and

Fabriani . Text mining techniques to automatically enrich a domain ontology . Applied Intelligence , 18 : 323 - 340 , 2003 .

[Mitra et al., 2005 ]

Mitra ,

Noy , and

Jaiswal . Ontology mapping discovery with uncertainty . In Fourth International Semantic Web Conference (ISWC) , 2005 .

[Nahm and Mooney , 2002]

Nahm and

Mooney . Text mining with information extraction . In U. Nahm and

Mooney . Text Mining with Information Extraction . In Proceedings of the AAAI 2002 Spring Symposium on Mining Answers from Texts and Knowledge Bases ., 2002 .

[Ong et al., 2002 ]

Ong ,

Glasner , and

Page . Modelling regulatory pathways in E. coli from time series expression profiles . Bioinformatics , 18 ( 1 ): 241 - 248 , 2002 .

[Pan et al., 2005 ]

Pan ,

Ding ,

Yu , and

Peng . A bayesian network approach to ontology mapping . In ISWC 2005 4th International Semantic Web Conference , pages 563 - 577 , Galway, Ireland, 2005 .

[Rosario and Hearst , 2004]

Rosario and

Hearst . Classifying semantic relations in bioscience texts . In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL2004) , pages 430 - 437 , 2004 .

[Segal et al., 2001 ]

Segal ,

Tasker ,

Gasch ,

Friedman , and

Koller . Rich probabilistic models for gene expression . Bioinformatics , 17 ( 1 ): 243 - 252 , 2001 .

[Smith et al., 2005 ]

Smith ,

Ceusters , and

Kohler . Relations in biomedical ontologies . Genome Biology , 6 ( 5 ): 46 - 58 , 2005 .

[Tiffin et al., 2005 ]

Tiffin ,

Kelso ,

Powell ,

Pan ,

Bajic , and

Hide . Integration of text and data-mining using ontologies successfully selects disease gene candidates . Nucleic Acids Research , 33 ( 5 ): 1544 - 1552 , 2005 .

[Wolstencroft et al., 2005 ]

Wolstencroft , R. McEntire ,

Stevens ,

Tabernero , and

Brass . Constructing ontology-driven protein family databases . Bioinformatics , 21 ( 8 ): 1685 - 1692 , 2005 .