=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards an integrated knowledge system for capturing gene expression events
|pdfUrl=https://ceur-ws.org/Vol-897/session4-paper17.pdf
|volume=Vol-897
|dblpUrl=https://dblp.org/rec/conf/icbo/VenkatesanMK12
}}
==Towards an integrated knowledge system for capturing gene expression events==
Towards an integrated knowledge system for capturing gene
expression events
Aravind Venkatesan† Vladimir Mironov†* and Martin Kuiper
Department of Biology, NTNU, 7491 Trondheim, Norway
ABSTRACT drastic increase in the available information and a lack of
Transcriptional regulation of gene expression is an important adhering to accepted formal representations across all
mechanism in many biological processes. Aberrations in this disparate knowledge bases allows only a fraction of the
mechanism have been implicated in cancer and other diseases. knowledge to be easily considered in the analysis of new
Effective investigation of gene expression mechanisms requires a data, or causes a user to query many databases individually,
system-wide integration and assessment of all available knowledge sometimes even without the support of ontology terms that
of the underlying molecular networks. This calls for a method that would warrant a common semantics of queries in different
effectively manages and integrates the available data. We have databases. As discussed by Antezana et al. (2009),
built a semantic web based knowledge system that constitutes a application ontologies can facilitate the query process itself
significant step in this direction: the Gene Expression Knowledge as the ontology ensures a uniform semantics across all data.
Base (GeXKB). The GeXKB encompasses three application on-
tologies: the Gene Expression Ontology (GeXO), the Regulation
of Gene Expression Ontology (ReXO), and the Regulation of 1.1 Need for an integrated resource that captures
Transcription Ontology (ReTO). These three ontologies, respec- gene expression knowledge
tively, integrate gene expression information that is increasingly Transcriptional gene expression and its regulation depend
more specific, yet decreasing in coverage, from a variety of on a large variety of cellular processes that control the tim-
sources. The system is capable of answering complex biological ing and level of transcription of an individual gene, often in
questions with respect to gene expression and in this way facili- a cell- or condition specific manner. Regulation of the ex-
tates the formulation or assessment of new hypothesis. Here we pression of protein coding genes is extensively studied.
discuss the architecture of these ontologies and the data integration Gene expression falls into two main phases, i.e. transcrip-
process and provide examples demonstrating the utility thereof. tion and translation. During the process of transcription,
The knowledge base is freely available for download and can be proteins called transcription factors bind to specific DNA
queried through a SPARQL endpoint (http://www.semantic- sequence motifs (binding sites) of a gene, playing a key role
systems-biology.org/apo/). in initiating or inhibiting the formation of an active RNA
Polymerase II transcription complex. Active transcription
1 INTRODUCTION produces pre-mRNAs which are subsequently processed
Research in the Life Sciences is supported by a plethora of (removal of introns, and polyadenylation of the transcript)
databases (see overview at www.pathguide.org). Moreover, upon which mature mRNAs are transported from the nu-
the continuing advancements in functional genomics cleus to the cytoplasm where the mRNA is translated into a
technologies make it possible to create an overwhelming protein. Regulatory processes of gene expression occur at
amount of data in a single experiment. The many different levels, enabling the cell to adapt to different condi-
hypotheses that can be derived from such experiments must tions by controlling its structure and function. Furthermore,
be assessed against a multitude of information and the process of gene expression may also be influenced at the
knowledge bases, often represented in a variety of formats. epigenetic level, where nucleotide or protein modifications
Scientists therefore become increasingly dependent on can cause heritable changes in expression of otherwise iden-
sophisticated computer technologies to integrate and tical gene sequences. Abnormalities in the regulation of
manage all the available information. Furthermore, the gene expression can cause diseases such as the occurrence
of malignant cell proliferation.
*
To whom correspondence should be addressed: mironov@nt.ntnu.no The knowledge required to decipher the various processes
†
These authors contributed equally involved in gene expression continues to grow. However,
1
Venkatesan et al.
for a systems-wide understanding of gene regulation, there aspects of the gene expression process. To this end it should
is a need for efficiently capturing knowledge of this domain be able to provide answers to questions like:
in its entirety and to further facilitate efficient querying of • ‘Which are the proteins that act as chromatin
this data. For instance, the complex one-to-many relation- remodeling proteins and as modulators of tran-
ships of a transcription factor like Myc includes thousands scription factor activity?’
of target genes, representing a wide variety of functions and • ‘Which are the proteins that participate in two
processes. An ontology-driven approach would best solve successive regulatory pathways?’.
the issue of knowledge querying, representation and man- • ‘Which are the transcription factors (Human)
agement. Previously, attempts have been made to model the that are located in the cytoplasm?’.
gene regulation process; resulting in the Gene Regulation The following design principles were followed in the pro-
Ontology (GRO) (Beisswanger et al., 2008). GRO provides cess of GeXKB development:
a conceptual model to represent common knowledge about • 'is a' completeness
the gene regulation domain. However, it was primarily built • 'all-some' semantics
as a scaffold for knowledge intensive natural language proc- • only classes used for modelling of the domain of
essing (NLP) tasks and lacks the granularity in concepts discourse (see Table 1)
much needed for advanced querying and hypothesis genera- • maximal flexibility both for users and for future
tion. extensions
We have built a system that integrates existing ontologies 3 GEXKB ARCHITECTURE AND
relevant for the domain of gene expression to support the CONSTRUCTION
discovery of new scientific knowledge. We have named this The core of the three ontologies is built of terms from a
knowledge system: the Gene Expression Knowledge Base number of well established biomedical ontologies, first of
(GeXKB). This system is conceived as part of the Semantic all GO (Ashburner et al., 2000) and Molecular Interactions
Systems Biology (SSB) (http://www.semantic-systems- ontology (Kerrien et al., 2007), The core is used to integrate
biology.org) initiative and comprises at the current stage data from GOA (Barrell et al., 2009), IntAct database
three application ontologies that capture the knowledge (Kerrien et al., 2007), KEGG (Kanehisa and Goto, 2000),
about gene expression, namely the Gene Expression UniProtKB (Magrane and Uniprot consortium, 2011) and
Ontology (GeXO), Regulation of Gene Expression NCBI Gene (Wheeler et al., 2005). In the subsequent
Ontology (ReXO) and the Regulation of Transcription sections we describe the architecture and the main features
Ontology (ReTO). of the ontologies.
3.1 Data integration pipeline
2 GEXKB OBJECTIVES AND DESIGN The ontologies were built using an automated pipeline
PRINCIPLES implemented with the use of the library ONTO-PERL
GeXKB is designed to provide the molecular biologist with (Antezana et al., 2008).
a knowledge system that captures knowledge on a variety of
Figure 1: The figure illustrates the seed ontology of GeXO.
2
3.1.1 Building seed ontologies: 3.1.3 Building the complete ontologies:
GeXO, ReXO and ReTO share a common Upper Level
This is the final phase in the generation of the ontologies
Ontology (ULO), which provides a general scaffold for data
which proceed as follows:
integration. It was developed on the basis of the Science
Integrated Ontology (SIO) (http://code.google.com/p/seman (1) The species specific ontologies (from the previous
ticscience/wiki/SIO) with the addition of few terms from step) are merged together.
other ontologies. The origin of the terms is preserved in
(2) From the KEGG database all the pathways
external references. The ULO is generated on the fly by the
involving at least one of the core proteins are
pipeline and does not exist as an individual artifact. The
extracted and incorporated in the KB along with the
upper level term IDs are of the form ‘SSB:nnnnnnn’.
pertinent information. The pathway terms become
children of the term 'SSB:0011221' ( 'pathway',
The ULO is then merged with GO (domain specific
'BioPAX:Pathway'). The corresponding KEGG
fragments of Biological Process, complete Cellular
orthology groups are incorporated as children of the
Component, complete Molecular Function), MI ('interaction
term 'protein cluster' (SSB:0001122). This step
type' branch), and the Biorel ontology (Blondé et al. 2011).
results in a second extension of the set of proteins.
This yields three ontologies referred to as seed ontologies.
To be more specific, in order to build the seed ontology for (3) Putative orthology relationships were computed
GeXO, the term ‘gene expression’ (GO:0010467) and all its with the use of the high-performance library
descendants are imported. For ReXO and ReTO the TurboOrtho (Ekseth et al., 2010), a multi-threaded
corresponding GO terms are: 'regulation of gene expression' C++ implementation of the OrthoMCL algorithm
(GO:0010468) and 'regulation of transcription, DNA (Li et al., 2003). The relations including core
dependent' (GO:0006355). We refer to these three terms as proteins are added to the KB, leading to the final
sub-roots. Each of them is connected to the ULO as a extension of the set of proteins.
subclass of 'biological process'. To ensure 'is a'
(4) The set of proteins in the GeXKB was finally
completeness, each of the ontologies is complemented with
augmented with:
an auxiliary term - (‘gene expression process’
(GeXO:0000001), 'process of regulation of gene expression’ • GOA annotations for Cellular Components and
(ReXO:0000001), ‘process of regulation of DNA-dependent Molecular Functions,
transcription’ (ReTO:0000001)), which becomes the parent
• Additional information (e.g. protein
of all the terms that did not have an 'is a' path to the sub-
modifications) from UniProtKB,
root. Apart from this, the three seed ontologies are
structurally identical (Figure 1). • The corresponding genes along with the pertinent
information from NCBI.
3.1.2 Building species specific intermediate ontologies:
The final result is the three ontologies in the OBO (Smith et
The GeXKB ontologies support three model organisms:
al., 2007) format.
Homo sapiens, Mus musculus and Rattus norvegicus.
3.1.4 Enhancing the utility of the ontologies:
The corresponding three species-specific intermediate
ontologies were developed in the following steps: (1) Transitive closures were constructed with the use of
the library ONTO-PERL for the following relation
(1) For each species GOA annotations are used to types: 'is a', 'part of', ‘regulates’.
extract all the associations involving domain
specific Biological Process terms incorporated in (2) The ontologies were exported in a number of formats:
the previous phase. The corresponding proteins are RDF, OWL, XML, and DOT.
added as child terms to the upper level term (3) The RDF exports were used to populate a triple store,
‘protein’ (SSB:0001211) and referred to as 'core refer Table 2 (Virtuoso Open Link).
proteins' hereafter.
(2) From the IntAct database all the interactions
involving at least one of the core proteins are
retrieved and incorporated into the knowledge base
along with their pertinent information. This results
in a further extension of the set of proteins in the
KB.
3
Venkatesan et al.
No. of No. of No. of the familiarity of the IDs. Furthermore, in compliance with
Ontology the Linked Data recommendations we minted the URIs in
classes relations instances
our own common name-space: http://www.semantic-
GeXO 168417 15 0 systems-biology.org/ and have consistently used rdfs:label
ReXO 152962 15 0 properties to aid human readability of the results.
ReTO 141095 15 0
RDF GeXO- ReXO- ReTO-
Table 1: An overview of the ontologies in GeXKB GeXO ReXO ReTO
graphs tc tc tc
~3
No. of ~3.3 ~23 ~19.9 ~2.8 ~19.1
3.2 GeXKB and the Semantic Web million
triples million million million million million
The Semantic Web (Berners-Lee and Hendler, 2001) is an
extension of the WWW which aims at building a web of Table 2: Shows the number of triples in the individual graphs of
data accessible both by computers and human beings. This GeXKB
new technology is increasingly gaining momentum, in par-
ticular in the domain of Life Sciences (Antezana et al., 4 QUERYING GEXKB
2009). In this section we demonstrate the utility of GeXKB with
the help of a few example SPARQL queries. These queries
In order to make use of these new technologies, the RDF are available as a part of a list of sample queries provided on
versions of the ontologies have been loaded into Open Link the query page (http://www.semantic-systems-
Virtuoso (http://virtuoso.openlinksw.com) and can be ac- biology.org/apo/queryingcco/sparql). To query GeXKB, the
cessed via a SPARQL query page (http://www.semantic- base URI and the prefixes are set and the SELECT block
systems-biology.org/apo/queryingcco/sparql). In contrast to specifies the variables to be part of the solution. The RDF
other Semantic Web formalisms, such as OWL, RDF ena- triple pattern queried is defined in the WHERE block. The
bles handling of large amounts of knowledge due to its sim- queries are as follows:
ple and flexible syntax, making querying tractable. Howev-
er, on the downside the low expressivity of RDF/RDFS im- Q1: (see Table 3)
poses limitations on the inferencing over the knowledge Biological question: Which proteins can act as chromatin remodel-
base. To overcome this limitation, Blondé et al. (2011) have ing proteins and as modulators of transcription factor activity?
developed a novel approach for semi-automated reasoning SPARQL query:
on RDF stores with the use of the SPARUL update language
(http://www.w3.org/TR/sparql11-update/). This allows for BASE
pre-computing the inferences supported by the store, thus PREFIX rdfs:
making implicit knowledge explicit and available for query- PREFIX ssb:
ing. In order to provide maximum flexibility for querying, PREFIX taxon:
two graphs are available for each of the ontologies - with or PREFIX graph1:
PREFIX graph2:
without closures (e.g. GeXO-tc and GeXO, 'tc' standing for
'total closure'). SELECT distinct ?protein_id ?protein_name
WHERE {
The most convincing evidence of the success of the Seman- GRAPH graph1: {
tic Web is the quick expansion of the Linked Data cloud ? protein_id ssb:is_a ssb:SSB_0001211 .
(Heath and Bizer, 2011). In the course of the design of ?b_process ssb:is_a ssb:GO_0040029 .
GeXKB a number of decisions were made to facilitate the ?b_process ssb:has_participant ? protein_id .
migration of GeXKB eventually to the Linked Data cloud. ? protein_id ssb:has_source taxon: .
For instance, we have re-used original IDs as much as pos- }
GRAPH graph2: {
sible. If the original IDs include a name-space (e.g. GO, MI)
ssb:GO_0034401 ssb:has_participant ? protein_id .
they were adopted without any modifications, otherwise the ? protein_id rdfs:label ?protein_name .
IDs were prepended with a name-space (for example UPKB }
for UniProtKB or NCBIgn for NCBI Gene), separated by a }
colon from the original ID (the colons are replaced with LIMIT 4
underscores in the RDF renderings). The re-use of the IDs
benefits as well the users due to faster query execution and
4
Q2: generation. This could be performed through the query fed-
Biological question: Which proteins participate in both the eration features that are included in the latest version of
JAK/STAT signaling pathway and Apoptosis? SPARQL (ver. 1.1) and will be explored in the future.
SPARQL query:
BASE Protein ID Protein Name
PREFIX rdfs:
http://www.semantic-systems-
PREFIX ssb: biology.org/SSB#UPKB_Q9NS37
ZHANG_HUMAN
PREFIX taxon:
http://www.semantic-systems-
PREFIX pathway1: TRI27_HUMAN
biology.org/SSB#UPKB_P14373
PREFIX pathway2:
http://www.semantic-systems-
PREFIX graph: TRI27_MOUSE
biology.org/SSB#UPKB_Q62158
http://www.semantic-systems-
SELECT distinct ?protein SPI1_HUMAN
biology.org/SSB#UPKB_P17947
WHERE {
GRAPH graph: {
Table 3: The table shows the results for Q1
?prot_id ssb:is_a ssb:SSB_0001211 .
?prot_id ssb:is_member_of ?cluster .
pathway1: ssb:has_agent ?cluster . 5 CONCLUSION
?prot_id ssb:has_source taxon: .
} The drastic increase in the amount of data generated in the
GRAPH graph: { field of molecular biology and biomedicine requires effi-
?prot_id ssb:is_member_of ?cluster . cient knowledge management practices. Ontologies cer-
pathway2: ssb:has_agent ?cluster . tainly provide a robust method to integrate data and effi-
?prot_id rdfs:label ?protein . ciently represent specific (sub) domain knowledge. With the
} creation of GeXKB, we have built a knowledge system that
} specifically supports researchers focusing on various aspects
Q3: of gene expression. The three ontologies provide the user
Biological question: Which are the transcription factors (Human) with the flexibility of choosing an ontology depending on
that are located in the cytoplasm? the breadth and specificity of information needed. Further
SPARQL query: flexibility is afforded by a range of available formats for
knowledge representation (OBO, RDF, OWL), data ex-
BASE change (XML), and visualisation (DOT).
PREFIX rdfs:
PREFIX ssb:
PREFIX taxon:
The presented examples demonstrate the utility of our
PREFIX location: knowledge base with respect to answering realistic domain
PREFIX graph: specific questions, and this utility is expected to grow with
its further development. The primary goal will be to aug-
SELECT distinct ?protein ?protein_name ment the knowledge base with additional high quality, cu-
WHERE { rated sources of information with documented transcription
GRAPH graph: { factor function and relations between transcription factors
?protein ssb:is_a ssb:SSB_0001211 . and their target genes.
?protein rdfs:label ?protein_name .
ssb:GO_0006355 ssb:has_participant ?protein .
?protein ssb:has_function ?function .
ACKNOWLEDGEMENTS
?function ssb:is_a ssb:GO_0003700 . This work is funded by the Norwegian University of Sci-
location: ssb:contains ?protein . ence and Technology (NTNU), Norway. AV was funded by
?protein ssb:has_source taxon: . Faculty of Natural Science and Technology and VM was
} funded by FUGE Mid-Norway.
}
REFERENCES
These queries offer just a glimpse of the repertoire of bio-
Antezana, E., Egaña, M., De Baets, B., Kuiper, M., and Mironov, V.
logical question that can be addressed to the knowledge
(2008). ONTO-PERL: an API for supporting the development and
system. In addition, users could also query the knowledge analysis of bio-ontologies. Bioinformatics. Mar 15;24(6):885-7.
base in combination with other complementary semantic
web resources to formulate advanced queries for hypothesis
5
Venkatesan et al.
Antezana, E., Kuiper, M., and Mironov, V. (2009). Biological knowledge
management: the emerging role of the Semantic Web technologies. Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I., Bridge, A., Derow,
Brief Bioinform., 10(4): 392-407. C., Dimmer, E., Feuermann, M. et al. (2007). IntAct - open source
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. resource for molecular interaction data. Nucleic Acids Res, 35:D561-
M., Davis, A. P., Dolinski , K. et al., (2000). Gene ontology: tool for 565.
the unification of biology. Nature Genetics, 25, 25-9. Li, L., Stoeckert, C. J. Jr., Roos, D. S. (2003). OrthoMCL: identification of
Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O'Donovan, C., and ortholog groups for eukaryotic genomes. Genome Res, 13:2178-2189.
Apweiler, R. (2009). The GOA database in 2009--an integrated Gene Magrane M. and the UniProt consortium
Ontology Annotation resource. Nucleic Acids Research 37: D396- UniProt Knowledgebase: a hub of integrated protein data
D403. Database, 2011
Beisswanger, E., Lee, V., Kim, J. J., Rebholz-Schuhmann, D., Splendiani, Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W.,
A., Dameron, O., Schulz, S., Hahn, U. (2008). Gene Regulation Goldberg, L. J., Eilbeck, K. et al. (2007). The OBO Foundry:
Ontology (GRO): design principles and use cases. Stud Health Technol coordinated evolution of ontologies to support biomedical data
Inform. 2008;136:9-14. integration. Nat Biotechnol, 25(11), 1251–1255.
Berners-Lee, T. and Hendler, J. (2001). 'Publishing on the semantic web'. Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K.,
Nature, 410, 1023-4. Church, D.M., DiCuccio, M. et al. (2005). Database resources of the
Blondé, W., Mironov, V., Venkatesan, A., Antezana, E., De Baets, B., and National Center for Biotechnology Information. Nucleic Acids Res
Kuiper M. (2011). Reasoning with bio-ontologies: using relational 2005, 33:D39-45.
closure rules to enable practical querying. Bioinformatics, Jun
1;27(11):1562-8.
Ekseth, O., Lindi, B., Kuiper, M., and Mironov, V. TurboOrtho – a high
performance alternative to OrthhoMCL. European Conference on
Computaional Biology: September 2010; Ghent.
Heath, T., and Bizer, . (2011) Linked Data: Evolving the Web into a
Global Data Space (1st edition). Synthesis Lectures on the Semantic
Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.
Kanehisa, M., and Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes
and Genomes. Nucleic Acids Res, 28:27-30.
Kerrien, S., Orchard, S., Montecchi-Palazzi, L., Aranda, B., Quinn, A. F.,
Vinod N, Bader, G. D., Xenarios, I. et al. (2007). Broadening the
horizon--level 2.5 of the HUPO-PSI format for molecular interactions.
BMC Biol. 9;5:44.
6