=Paper=
{{Paper
|id=None
|storemode=property
|title=Towards an integrated knowledge system for capturing gene expression events
|pdfUrl=https://ceur-ws.org/Vol-897/session4-paper17.pdf
|volume=Vol-897
|dblpUrl=https://dblp.org/rec/conf/icbo/VenkatesanMK12
}}
==Towards an integrated knowledge system for capturing gene expression events==
<pdf width="1500px">https://ceur-ws.org/Vol-897/session4-paper17.pdf</pdf>
<pre>
     Towards an integrated knowledge system for capturing gene
                         expression events


                          Aravind Venkatesan† Vladimir Mironov†* and Martin Kuiper
                                      Department of Biology, NTNU, 7491 Trondheim, Norway


ABSTRACT                                                                drastic increase in the available information and a lack of
Transcriptional regulation of gene expression is an important           adhering to accepted formal representations across all
mechanism in many biological processes. Aberrations in this             disparate knowledge bases allows only a fraction of the
mechanism have been implicated in cancer and other diseases.            knowledge to be easily considered in the analysis of new
Effective investigation of gene expression mechanisms requires a        data, or causes a user to query many databases individually,
system-wide integration and assessment of all available knowledge       sometimes even without the support of ontology terms that
of the underlying molecular networks. This calls for a method that      would warrant a common semantics of queries in different
effectively manages and integrates the available data. We have          databases. As discussed by Antezana et al. (2009),
built a semantic web based knowledge system that constitutes a          application ontologies can facilitate the query process itself
significant step in this direction: the Gene Expression Knowledge       as the ontology ensures a uniform semantics across all data.
Base (GeXKB). The GeXKB encompasses three application on-
tologies: the Gene Expression Ontology (GeXO), the Regulation
of Gene Expression Ontology (ReXO), and the Regulation of               1.1    Need for an integrated resource that captures
Transcription Ontology (ReTO). These three ontologies, respec-                 gene expression knowledge
tively, integrate gene expression information that is increasingly      Transcriptional gene expression and its regulation depend
more specific, yet decreasing in coverage, from a variety of            on a large variety of cellular processes that control the tim-
sources. The system is capable of answering complex biological          ing and level of transcription of an individual gene, often in
questions with respect to gene expression and in this way facili-       a cell- or condition specific manner. Regulation of the ex-
tates the formulation or assessment of new hypothesis. Here we          pression of protein coding genes is extensively studied.
discuss the architecture of these ontologies and the data integration   Gene expression falls into two main phases, i.e. transcrip-
process and provide examples demonstrating the utility thereof.         tion and translation. During the process of transcription,
The knowledge base is freely available for download and can be          proteins called transcription factors bind to specific DNA
queried through a SPARQL endpoint (http://www.semantic-                 sequence motifs (binding sites) of a gene, playing a key role
systems-biology.org/apo/).                                              in initiating or inhibiting the formation of an active RNA
                                                                        Polymerase II transcription complex. Active transcription
1    INTRODUCTION                                                       produces pre-mRNAs which are subsequently processed
Research in the Life Sciences is supported by a plethora of             (removal of introns, and polyadenylation of the transcript)
databases (see overview at www.pathguide.org). Moreover,                upon which mature mRNAs are transported from the nu-
the continuing advancements in functional genomics                      cleus to the cytoplasm where the mRNA is translated into a
technologies make it possible to create an overwhelming                 protein. Regulatory processes of gene expression occur at
amount of data in a single experiment. The many                         different levels, enabling the cell to adapt to different condi-
hypotheses that can be derived from such experiments must               tions by controlling its structure and function. Furthermore,
be assessed against a multitude of information and                      the process of gene expression may also be influenced at the
knowledge bases, often represented in a variety of formats.             epigenetic level, where nucleotide or protein modifications
Scientists therefore become increasingly dependent on                   can cause heritable changes in expression of otherwise iden-
sophisticated computer technologies to integrate and                    tical gene sequences. Abnormalities in the regulation of
manage all the available information. Furthermore, the                  gene expression can cause diseases such as the occurrence
                                                                        of malignant cell proliferation.
*
To whom correspondence should be addressed: mironov@nt.ntnu.no          The knowledge required to decipher the various processes
†
These authors contributed equally                                       involved in gene expression continues to grow. However,


                                                                                                                                      1
Venkatesan et al.


for a systems-wide understanding of gene regulation, there          aspects of the gene expression process. To this end it should
is a need for efficiently capturing knowledge of this domain        be able to provide answers to questions like:
in its entirety and to further facilitate efficient querying of            • ‘Which are the proteins that act as chromatin
this data. For instance, the complex one-to-many relation-                      remodeling proteins and as modulators of tran-
ships of a transcription factor like Myc includes thousands                     scription factor activity?’
of target genes, representing a wide variety of functions and              • ‘Which are the proteins that participate in two
processes. An ontology-driven approach would best solve                         successive regulatory pathways?’.
the issue of knowledge querying, representation and man-                   • ‘Which are the transcription factors (Human)
agement. Previously, attempts have been made to model the                       that are located in the cytoplasm?’.
gene regulation process; resulting in the Gene Regulation           The following design principles were followed in the pro-
Ontology (GRO) (Beisswanger et al., 2008). GRO provides             cess of GeXKB development:
a conceptual model to represent common knowledge about                     • 'is a' completeness
the gene regulation domain. However, it was primarily built                • 'all-some' semantics
as a scaffold for knowledge intensive natural language proc-               • only classes used for modelling of the domain of
essing (NLP) tasks and lacks the granularity in concepts                        discourse (see Table 1)
much needed for advanced querying and hypothesis genera-                   • maximal flexibility both for users and for future
tion.                                                                           extensions

We have built a system that integrates existing ontologies          3     GEXKB      ARCHITECTURE                          AND
relevant for the domain of gene expression to support the                 CONSTRUCTION
discovery of new scientific knowledge. We have named this           The core of the three ontologies is built of terms from a
knowledge system: the Gene Expression Knowledge Base                number of well established biomedical ontologies, first of
(GeXKB). This system is conceived as part of the Semantic           all GO (Ashburner et al., 2000) and Molecular Interactions
Systems Biology (SSB) (http://www.semantic-systems-                 ontology (Kerrien et al., 2007), The core is used to integrate
biology.org) initiative and comprises at the current stage          data from GOA (Barrell et al., 2009), IntAct database
three application ontologies that capture the knowledge             (Kerrien et al., 2007), KEGG (Kanehisa and Goto, 2000),
about gene expression, namely the Gene Expression                   UniProtKB (Magrane and Uniprot consortium, 2011) and
Ontology (GeXO), Regulation of Gene Expression                      NCBI Gene (Wheeler et al., 2005). In the subsequent
Ontology (ReXO) and the Regulation of Transcription                 sections we describe the architecture and the main features
Ontology (ReTO).                                                    of the ontologies.
                                                                    3.1    Data integration pipeline
2   GEXKB    OBJECTIVES                  AND        DESIGN          The ontologies were built using an automated pipeline
    PRINCIPLES                                                      implemented with the use of the library ONTO-PERL
GeXKB is designed to provide the molecular biologist with           (Antezana et al., 2008).
a knowledge system that captures knowledge on a variety of


                                    Figure 1: The figure illustrates the seed ontology of GeXO.


2
3.1.1 Building seed ontologies:                                 3.1.3 Building the complete ontologies:
GeXO, ReXO and ReTO share a common Upper Level
                                                                This is the final phase in the generation of the ontologies
Ontology (ULO), which provides a general scaffold for data
                                                                which proceed as follows:
integration. It was developed on the basis of the Science
Integrated Ontology (SIO) (http://code.google.com/p/seman            (1) The species specific ontologies (from the previous
ticscience/wiki/SIO) with the addition of few terms from                 step) are merged together.
other ontologies. The origin of the terms is preserved in
                                                                     (2) From the KEGG database all the pathways
external references. The ULO is generated on the fly by the
                                                                         involving at least one of the core proteins are
pipeline and does not exist as an individual artifact. The
                                                                         extracted and incorporated in the KB along with the
upper level term IDs are of the form ‘SSB:nnnnnnn’.
                                                                         pertinent information. The pathway terms become
                                                                         children of the term 'SSB:0011221' ( 'pathway',
The ULO is then merged with GO (domain specific
                                                                         'BioPAX:Pathway'). The corresponding KEGG
fragments of Biological Process, complete Cellular
                                                                         orthology groups are incorporated as children of the
Component, complete Molecular Function), MI ('interaction
                                                                         term 'protein cluster' (SSB:0001122). This step
type' branch), and the Biorel ontology (Blondé et al. 2011).
                                                                         results in a second extension of the set of proteins.
This yields three ontologies referred to as seed ontologies.
To be more specific, in order to build the seed ontology for         (3) Putative orthology relationships were computed
GeXO, the term ‘gene expression’ (GO:0010467) and all its                with the use of the high-performance library
descendants are imported. For ReXO and ReTO the                          TurboOrtho (Ekseth et al., 2010), a multi-threaded
corresponding GO terms are: 'regulation of gene expression'              C++ implementation of the OrthoMCL algorithm
(GO:0010468) and 'regulation of transcription, DNA                       (Li et al., 2003). The relations including core
dependent' (GO:0006355). We refer to these three terms as                proteins are added to the KB, leading to the final
sub-roots. Each of them is connected to the ULO as a                     extension of the set of proteins.
subclass of 'biological process'. To ensure 'is a'
                                                                     (4) The set of proteins in the GeXKB was finally
completeness, each of the ontologies is complemented with
                                                                         augmented with:
an auxiliary term - (‘gene expression process’
(GeXO:0000001), 'process of regulation of gene expression’            •    GOA annotations for Cellular Components and
(ReXO:0000001), ‘process of regulation of DNA-dependent                    Molecular Functions,
transcription’ (ReTO:0000001)), which becomes the parent
                                                                      •    Additional      information    (e.g.       protein
of all the terms that did not have an 'is a' path to the sub-
                                                                           modifications) from UniProtKB,
root. Apart from this, the three seed ontologies are
structurally identical (Figure 1).                                    •    The corresponding genes along with the pertinent
                                                                           information from NCBI.
3.1.2 Building species specific intermediate ontologies:
                                                                The final result is the three ontologies in the OBO (Smith et
The GeXKB ontologies support three model organisms:
                                                                al., 2007) format.
Homo sapiens, Mus musculus and Rattus norvegicus.
                                                                3.1.4 Enhancing the utility of the ontologies:
The corresponding three species-specific intermediate
ontologies were developed in the following steps:                 (1) Transitive closures were constructed with the use of
                                                                      the library ONTO-PERL for the following relation
    (1) For each species GOA annotations are used to                  types: 'is a', 'part of', ‘regulates’.
        extract all the associations involving domain
        specific Biological Process terms incorporated in         (2) The ontologies were exported in a number of formats:
        the previous phase. The corresponding proteins are            RDF, OWL, XML, and DOT.
        added as child terms to the upper level term              (3) The RDF exports were used to populate a triple store,
        ‘protein’ (SSB:0001211) and referred to as 'core              refer Table 2 (Virtuoso Open Link).
        proteins' hereafter.
    (2) From the IntAct database all the interactions
        involving at least one of the core proteins are
        retrieved and incorporated into the knowledge base
        along with their pertinent information. This results
        in a further extension of the set of proteins in the
        KB.


                                                                                                                            3
Venkatesan et al.


                No. of         No. of         No. of            the familiarity of the IDs. Furthermore, in compliance with
Ontology                                                        the Linked Data recommendations we minted the URIs in
                classes        relations      instances
                                                                our own common name-space: http://www.semantic-
GeXO            168417         15             0                 systems-biology.org/ and have consistently used rdfs:label
ReXO            152962         15             0                 properties to aid human readability of the results.

ReTO            141095         15             0
                                                                 RDF                 GeXO-               ReXO-                  ReTO-
       Table 1: An overview of the ontologies in GeXKB                     GeXO                ReXO                   ReTO
                                                                 graphs              tc                  tc                     tc

                                                                                               ~3
                                                                 No. of    ~3.3      ~23                 ~19.9        ~2.8      ~19.1
3.2    GeXKB and the Semantic Web                                                              million
                                                                 triples   million   million             million      million   million
The Semantic Web (Berners-Lee and Hendler, 2001) is an
extension of the WWW which aims at building a web of            Table 2: Shows the number of triples in the individual graphs of
data accessible both by computers and human beings. This        GeXKB
new technology is increasingly gaining momentum, in par-
ticular in the domain of Life Sciences (Antezana et al.,        4    QUERYING GEXKB
2009).                                                          In this section we demonstrate the utility of GeXKB with
                                                                the help of a few example SPARQL queries. These queries
In order to make use of these new technologies, the RDF         are available as a part of a list of sample queries provided on
versions of the ontologies have been loaded into Open Link      the      query      page        (http://www.semantic-systems-
Virtuoso (http://virtuoso.openlinksw.com) and can be ac-        biology.org/apo/queryingcco/sparql). To query GeXKB, the
cessed via a SPARQL query page (http://www.semantic-            base URI and the prefixes are set and the SELECT block
systems-biology.org/apo/queryingcco/sparql). In contrast to     specifies the variables to be part of the solution. The RDF
other Semantic Web formalisms, such as OWL, RDF ena-            triple pattern queried is defined in the WHERE block. The
bles handling of large amounts of knowledge due to its sim-     queries are as follows:
ple and flexible syntax, making querying tractable. Howev-
er, on the downside the low expressivity of RDF/RDFS im-        Q1: (see Table 3)
poses limitations on the inferencing over the knowledge         Biological question: Which proteins can act as chromatin remodel-
base. To overcome this limitation, Blondé et al. (2011) have    ing proteins and as modulators of transcription factor activity?
developed a novel approach for semi-automated reasoning         SPARQL query:
on RDF stores with the use of the SPARUL update language
(http://www.w3.org/TR/sparql11-update/). This allows for        BASE <http://www.semantic-systems-biology.org/>
pre-computing the inferences supported by the store, thus       PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
making implicit knowledge explicit and available for query-     PREFIX ssb: <SSB#>
ing. In order to provide maximum flexibility for querying,      PREFIX taxon: <SSB#NCBItx_9606>
two graphs are available for each of the ontologies - with or   PREFIX graph1: <ReXO>
                                                                PREFIX graph2: <ReTO-tc>
without closures (e.g. GeXO-tc and GeXO, 'tc' standing for
'total closure').                                               SELECT distinct ?protein_id ?protein_name
                                                                WHERE {
The most convincing evidence of the success of the Seman-        GRAPH graph1: {
tic Web is the quick expansion of the Linked Data cloud           ? protein_id ssb:is_a ssb:SSB_0001211 .
(Heath and Bizer, 2011). In the course of the design of           ?b_process ssb:is_a ssb:GO_0040029 .
GeXKB a number of decisions were made to facilitate the           ?b_process ssb:has_participant ? protein_id .
migration of GeXKB eventually to the Linked Data cloud.           ? protein_id ssb:has_source taxon: .
For instance, we have re-used original IDs as much as pos-       }
                                                                 GRAPH graph2: {
sible. If the original IDs include a name-space (e.g. GO, MI)
                                                                  ssb:GO_0034401 ssb:has_participant ? protein_id .
they were adopted without any modifications, otherwise the        ? protein_id rdfs:label ?protein_name .
IDs were prepended with a name-space (for example UPKB           }
for UniProtKB or NCBIgn for NCBI Gene), separated by a          }
colon from the original ID (the colons are replaced with        LIMIT 4
underscores in the RDF renderings). The re-use of the IDs
benefits as well the users due to faster query execution and


4
Q2:                                                                generation. This could be performed through the query fed-
Biological question: Which proteins participate in both the        eration features that are included in the latest version of
JAK/STAT signaling pathway and Apoptosis?                          SPARQL (ver. 1.1) and will be explored in the future.
SPARQL query:

BASE <http://www.semantic-systems-biology.org/>                     Protein ID                            Protein Name
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
                                                                    http://www.semantic-systems-
PREFIX ssb: <SSB#>                                                  biology.org/SSB#UPKB_Q9NS37
                                                                                                          ZHANG_HUMAN
PREFIX taxon: <SSB#NCBItx_9606>
                                                                    http://www.semantic-systems-
PREFIX pathway1: <SSB#KEGG_ko04630>                                                                       TRI27_HUMAN
                                                                    biology.org/SSB#UPKB_P14373
PREFIX pathway2: <SSB#KEGG_ko04210>
                                                                    http://www.semantic-systems-
PREFIX graph: <GeXO>                                                                                      TRI27_MOUSE
                                                                    biology.org/SSB#UPKB_Q62158
                                                                    http://www.semantic-systems-
SELECT distinct ?protein                                                                                  SPI1_HUMAN
                                                                    biology.org/SSB#UPKB_P17947
WHERE {
GRAPH graph: {
                                                                            Table 3: The table shows the results for Q1
 ?prot_id ssb:is_a ssb:SSB_0001211 .
 ?prot_id ssb:is_member_of ?cluster .
 pathway1: ssb:has_agent ?cluster .                                5    CONCLUSION
 ?prot_id ssb:has_source taxon: .
}                                                                  The drastic increase in the amount of data generated in the
GRAPH graph: {                                                     field of molecular biology and biomedicine requires effi-
 ?prot_id ssb:is_member_of ?cluster .                              cient knowledge management practices. Ontologies cer-
 pathway2: ssb:has_agent ?cluster .                                tainly provide a robust method to integrate data and effi-
 ?prot_id rdfs:label ?protein .                                    ciently represent specific (sub) domain knowledge. With the
}                                                                  creation of GeXKB, we have built a knowledge system that
}                                                                  specifically supports researchers focusing on various aspects
Q3:                                                                of gene expression. The three ontologies provide the user
Biological question: Which are the transcription factors (Human)   with the flexibility of choosing an ontology depending on
that are located in the cytoplasm?                                 the breadth and specificity of information needed. Further
SPARQL query:                                                      flexibility is afforded by a range of available formats for
                                                                   knowledge representation (OBO, RDF, OWL), data ex-
BASE <http://www.semantic-systems-biology.org/>                    change (XML), and visualisation (DOT).
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ssb: <SSB#>
PREFIX taxon: <SSB#NCBItx_9606>
                                                                   The presented examples demonstrate the utility of our
PREFIX location: <SSB#GO_0005737>                                  knowledge base with respect to answering realistic domain
PREFIX graph: <ReTO-tc>                                            specific questions, and this utility is expected to grow with
                                                                   its further development. The primary goal will be to aug-
SELECT distinct ?protein ?protein_name                             ment the knowledge base with additional high quality, cu-
WHERE {                                                            rated sources of information with documented transcription
 GRAPH graph: {                                                    factor function and relations between transcription factors
  ?protein ssb:is_a ssb:SSB_0001211 .                              and their target genes.
  ?protein rdfs:label ?protein_name .
  ssb:GO_0006355 ssb:has_participant ?protein .
  ?protein ssb:has_function ?function .
                                                                   ACKNOWLEDGEMENTS
  ?function ssb:is_a ssb:GO_0003700 .                              This work is funded by the Norwegian University of Sci-
  location: ssb:contains ?protein .                                ence and Technology (NTNU), Norway. AV was funded by
  ?protein ssb:has_source taxon: .                                 Faculty of Natural Science and Technology and VM was
 }                                                                 funded by FUGE Mid-Norway.
}
                                                                   REFERENCES
These queries offer just a glimpse of the repertoire of bio-
                                                                   Antezana, E., Egaña, M., De Baets, B., Kuiper, M., and Mironov, V.
logical question that can be addressed to the knowledge
                                                                       (2008). ONTO-PERL: an API for supporting the development and
system. In addition, users could also query the knowledge              analysis of bio-ontologies. Bioinformatics. Mar 15;24(6):885-7.
base in combination with other complementary semantic
web resources to formulate advanced queries for hypothesis


                                                                                                                                         5
Venkatesan et al.


Antezana, E., Kuiper, M., and Mironov, V. (2009). Biological knowledge
    management: the emerging role of the Semantic Web technologies.              Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I., Bridge, A., Derow,
    Brief Bioinform., 10(4): 392-407.                                                C., Dimmer, E., Feuermann, M. et al. (2007). IntAct - open source
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J.       resource for molecular interaction data. Nucleic Acids Res, 35:D561-
    M., Davis, A. P., Dolinski , K. et al., (2000). Gene ontology: tool for          565.
    the unification of biology. Nature Genetics, 25, 25-9.                       Li, L., Stoeckert, C. J. Jr., Roos, D. S. (2003). OrthoMCL: identification of
Barrell, D., Dimmer, E., Huntley, R.P., Binns, D., O'Donovan, C., and                ortholog groups for eukaryotic genomes. Genome Res, 13:2178-2189.
    Apweiler, R. (2009). The GOA database in 2009--an integrated Gene            Magrane         M.         and        the         UniProt        consortium
    Ontology Annotation resource. Nucleic Acids Research 37: D396-                   UniProt    Knowledgebase:     a   hub    of   integrated   protein   data
    D403.                                                                            Database, 2011
Beisswanger, E., Lee, V., Kim, J. J., Rebholz-Schuhmann, D., Splendiani,         Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W.,
    A., Dameron, O., Schulz, S., Hahn, U. (2008). Gene Regulation                    Goldberg, L. J., Eilbeck, K. et al. (2007). The OBO Foundry:
    Ontology (GRO): design principles and use cases. Stud Health Technol             coordinated evolution of ontologies to support biomedical data
    Inform. 2008;136:9-14.                                                           integration. Nat Biotechnol, 25(11), 1251–1255.
Berners-Lee, T. and Hendler, J. (2001). 'Publishing on the semantic web'.        Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K.,
    Nature, 410, 1023-4.                                                             Church, D.M., DiCuccio, M. et al. (2005). Database resources of the
Blondé, W., Mironov, V., Venkatesan, A., Antezana, E., De Baets, B., and             National Center for Biotechnology Information. Nucleic Acids Res
    Kuiper M. (2011). Reasoning with bio-ontologies: using relational                2005, 33:D39-45.
    closure rules to enable practical querying. Bioinformatics, Jun
    1;27(11):1562-8.
Ekseth, O., Lindi, B., Kuiper, M., and Mironov, V. TurboOrtho – a high
    performance alternative to OrthhoMCL. European Conference on
    Computaional Biology: September 2010; Ghent.
Heath, T., and Bizer, . (2011) Linked Data: Evolving the Web into a
    Global Data Space (1st edition). Synthesis Lectures on the Semantic
    Web: Theory and Technology, 1:1, 1-136. Morgan & Claypool.
Kanehisa, M., and Goto, S. (2000). KEGG: Kyoto Encyclopedia of Genes
    and Genomes. Nucleic Acids Res, 28:27-30.
Kerrien, S., Orchard, S., Montecchi-Palazzi, L., Aranda, B., Quinn, A. F.,
    Vinod N, Bader, G. D., Xenarios, I. et al. (2007). Broadening the
    horizon--level 2.5 of the HUPO-PSI format for molecular interactions.
    BMC Biol. 9;5:44.


6

</pre>