=Paper= {{Paper |id=Vol-513/paper-8 |storemode=property |title=Sys-Bio Gateway: a Framework of Bioinformatics Database Resources Oriented to Systems Biology |pdfUrl=https://ceur-ws.org/Vol-513/paper08.pdf |volume=Vol-513 |dblpUrl=https://dblp.org/rec/conf/iwsg/MilanesiAMVDM09 }} ==Sys-Bio Gateway: a Framework of Bioinformatics Database Resources Oriented to Systems Biology== https://ceur-ws.org/Vol-513/paper08.pdf
    IWPLS'09

    Sys-Bio Gateway: a framework of bioinformatics database
    resources oriented to systems biology
    Luciano Milanesi1*, Roberta Alfieri1, Ettore Mosca1, Federica Viti1, Pasqualina D'Ursi1,
    Ivan Merelli1
    1
    Institute for Biomedical Technologies-CNR, via Fratelli Cervi 93, 20090 Segrate (Mi),
    *
    Corresponding author


    Associate Editors: Sandra Gesing and Jano van Hemert

                                                                         systems biology-related activities, from functional characterization
                                                                         of genomic and proteomic data to the development of mathemati-
ABSTRACT                                                                 cal models of biological processes, requires an integrated view of
                                                                         all relevant information useful to accomplish those tasks.
In this paper we present SysBio-Gateway, a framework of
standard solution for data integration in bioinformatics and                        Data integration problem can be described by three lev-
Systems Biology. In the context of several research projects,            els of complexity. The first layer is the integration of information
we developed a set of databases and related web interfaces to            from heterogeneous resources by collecting data between different
cover     many       levels    of    the    biological   system          databases to allow a unified query schema. The second level con-
complexity. Furthermore several analysis tools have been
                                                                         sists in identifying correlative associations across different data-
developed and integrated in the web resources in order to cope
with problems of data integration and mining in a systems                sets, generally using an ontology support, to provide a comprehen-
perspective. The SysBio-Gateway, freely accessible at the URL            sive view of the same objects in light of different data sources. The
http://www.itb.cnr.it/sysbio-gateway, offers access to all the           third layer is mapping the information gained about interacting
presented resources, which concerns different levels of                  objects into networks and pathways that may be used as basic
organization of biological living systems, from genes to organs,
                                                                         models for the underlying cellular systems.
passing through proteins, protein families, cellular processes,
tissues, and pathologies.                                                           In this paper we show an overview of solutions, devel-
                                                                         oped by our bioinformatics laboratory, for data integration in the
                                                                         context of bioinformatics and Systems Biology, where many levels
1       INTRODUCTION                                                     of complexity are covered: from proteins to cellular processes,
                                                                         from tissues to disease and organs. A set of databases and web
           Data integration is nowadays an essential task to accom-      resources oriented to different biological topics are presented here
plish in order to achieve a view of the biological knowledge as          in a unified framework, to provide some practical examples of our
much complete as possible. This is very important in a discipline,       experience in this field. In particular we present resources related
such as the bioinformatics one, in which data are growing at light-      to specific a) biological processes, such as cell cycle, b) patholo-
ning speed thanks to novel bio-molecular high-throughput tech-           gies, such as breast cancer, c) organs, such as brain, d) protein
niques. In particular, considering the systems biology field, the        families, such as protein kinases, e) protein mutations, and f) tis-
integration of biological knowledge related to different levels -        sues. The here proposed solutions are based on a common method-
such as genomics, transcriptomics, proteomics, and network inter-        ology of data integration suitable we adoptedto support different
actions – is crucial in order to support the mathematical modelling      biological projects. Several tools have been developed ad hoc to
and the computer simulation of biological pathways.                      cope with different problems arising both from the increasing
           Data integration can be defined as the process of com-        number of experimental data and from the need to improve the
bining information, residing at different sources, to provide the        knowledge in a systems perspective.
user with a unified view of these data for enabling the possibility to         Web resources are presented here in a unified framework, to
achieve real knowledge. Experimental researchers and computer            provide some practical examples of our experience in this field.
scientists can discover through data integration new and interesting     We present resources related to specific knowledge about a) bio-
relationships that allow better and faster experimental decisions,       logical processes, such as cell cycle, b) pathology, such as breast
for example about protein targets and drug molecules. Moreover,          cancer, c) organs, such as brain, d) protein family, such as protein
the achievement of interesting results in most bioinformatics and        kinases, e) protein mutations, and f) tissue.

*
          To whom correspondence should be addressed.
                                                                         2    METHODS
                                                                        in the developed resources to enrich to the informative content.
        All the presented databases rely on a data warehouse            Ontologies provides not only the availability of a commonly
approach, which requires collection and transformation of               accepted vocabulary, which facilitates data sharing and
heterogeneous data coming from different sources to make them           information querying, but also increases the performance of
accessible by the scientific community though a unified query           statistical and analytical studies. The hierarchical graphs, which
schema. This model is typical of data integration and differs from      represent ontologies backbone, can in fact support the generation
the normalized databases, designed to support data integrity, which     of novel knowledge, leading to the creation of new relationships
are widely used to maintain primary resources. While many data          among biological entities or to the deduction of new associations.
warehouses solutions used in bioinformatics provide generic query                In the context of Systems Biology, another service
interfaces applicable to all the data they contain, our system allows   provided by our infrastructure is a high performance framework
the construction of queries that extract and filter information         for the numerical simulation of molecular models and an
derived from the original resources. This integration improves the      associated parameters values estimation tool. In fact, uncertainty of
usability of the information, but data must be fitted into unique       parameters values is one of the greatest problem in the
format that takes into account the relationships between the            development of new cellular models and in silico parameters
different sources.                                                      estimation is one of the most common adopted solutions. Our
        The implementation of our relational databases is always        system provides a global optimization algorithm, which relies on
managed by a MySQL server. The primary data are collected with          an evolution strategy, that can be employed on the top of the
a series of Perl scripts which retrieve data from external resources,   simulation engine to accomplish parameter estimation of complex
transform them in a compliant format and load them into the             mathematical models. Due to the high computational load needed
warehouse data model. The developed resources are all freely            for the parameter estimation of large networks, we implement this
accessible through web interfaces which are made up of a set of         system using a distributed paradigm which can also be used in the
HTML pages dynamically generated from PHP scripts, in order to          context of grid computing technologies. Relying on a data parallel
provide information in specific reports created for responding to       approach it is possible to handle the parameters estimation of
specific requirements related to the biological problem of interest.    complex models using large computational facilities, which is very
Besides the integration problem, the analysis of large quantity of      useful for example in the case of cell cycle related models.
experimental data in a Systems Biology perspective has been
tackled by developing a number of specific tools integrated in the
                                                                        3     RESULTS
web resources. Some examples available through the SysBio-
Gateway are:
                                                                                We propose here a standard solution for the
      •    a high performance tool for the simulation of Ordinary       implementation of bioinformatics framework, by collecting our
           Differential Equation relying on mathematical models         experiences in the SysBio-Gateway, which embraces different
           based on Xppaut                                              levels of organization of biological living systems by linking,
      •    a tool for the visualization of the Protein Data Bank        integrating and analysing heterogeneous data that vertically lead
           protein structure and correlated Connolly surfaces           from genes to organs, passing through proteins, protein families,
      •    a set of tools for the analysis of protein-protein           cellular processes, tissues, and pathologies. This approach has been
           interactions network        (search   for    the   first     tested for different databases implementations, such as:
           neighbourhood, search for shortest path and common
           annotations)                                                         •    Cell Cycle
      •    a tool for modelling the protein mutant starting from                •    G2S Breast Cancer
           Single Nucleotide Polymorphism data which relies on                  •    Gene Nerve Cell
           Modeller                                                             •    Kinweb
      •    a tool for pathological image processing oriented to                 •    ProCMD
           support tissue microarray analysis.                                  •    TMA Rep


        In order to guarantee data sharing and structuring,             SysBio-Gateway is freely           accessible at       the URL,
developed databases are enriched by a underlying crucial feature:       http://www.itb.cnr.it/sysbio-gateway. In this page (Fig.1) the user
the ontological support. Exploited ontologies concern all levels of     can directly access the resources held in the gateway web page,
molecular biology, from genes to proteins to pathways, even             handle and analyse data according to specific demands coming
covering tissues and diseases aspects. As example, we used Gene         from the bioinformatics and the Systems Biology communities.
Ontology for genes annotation and KEGG Pathway Ontology
(derived from the hierarchical organization of KEGG pathways)
for biological networks. For their intrinsic structure, that is more    3.1    Cell Cycle Database
complex than a simple recognized vocabulary, ontologies are used
                                                                       breast cancer genes by common signatures or the suggestion of
        The cell cycle is one of the biological processes most fre-    possible annotations for not yet annotated genes. The Genes-to-
quently investigated in systems biology studies and it involves the    Systems Breast Cancer (G2SBC) Database [2] is a bioinformatics
knowledge of a large number of genes and networks of protein           resource that collects information about breast cancer genes, pro-
interactions. A deep knowledge of the molecular aspect of this         teins and mathematical models and provides a number of tools to
biological process can contribute to making cancer research more       analyse the integrated data. Protein-protein interactions data are
accurate and innovative. In this context the mathematical model-       used to suggest new possible annotations and the link with the Cell
ling of the cell cycle has a relevant role to quantify the behaviour   Cycle database allows the simulation of cell cycle mathematical
of each components of the systems. The mathematical modelling          models beginning from breast cancer molecular alterations. Taking
of a biological process such as the cell cycle allows a systemic       advantage from the multi-level approach, due to the consideration
description that helps to highlight some features such as emergent     of both the “building-blocks” level (genes and proteins) and the
properties which could be hidden when the analysis is performed        systems level (molecular and cellular systems), the G2SBC Data-
only from a reductionist point of view. Moreover, in modelling         base enables predictions and new hypothesis formulation.
complex systems, a complete annotation of all the components is
equally important to understand the interaction mechanism inside
the network: for this reason data integration of the model compo-      3.3    Gene Nerve Cell Database
nents has high relevance in systems biology studies.
        The Cell Cycle Database [1], intended to support systems                In the past few years the new research field of neuroinfor-
biology analysis on the cell cycle process, starting from two organ-   matics has strongly emerged. Two main aspects must be high-
ism, yeast and mammalian, that present a high evolutionary mo-         lighted in his context: the interplay of structural, chemical and
lecular conservation between them. The database integrates infor-      electrical signals in nervous tissue and the importance of modelling
mation about genes and proteins involved in cell cycle process,        such signals. The great amount of qualitative experimental data in
stores complete models of the interaction networks and allows the      neuroscience represents the starting point to expand the nerve cell
mathematical simulation over time of the quantitative behaviour of     modelling in new directions, especially in the development of gene
each component. To accomplish this task, we developed, on the          and protein interaction networks. The aim of this new discipline is
top of the database, a web interface for browsing information re-      thus to gather the application of computational models and ana-
lated to cell cycle genes, proteins and mathematical models. In this   lytical tools and the improvement of neuroscience knowledge.
framework, we implemented a pipeline which allows users to deal        Moreover in the nervous system a great importance must be given
with the mathematical part of the models, in order to solve under      to the study of the molecular processes, and the knowledge about
different conditions the Ordinary Different Equation systems that      the key players (genes and proteins) involved in such processes
describe the biological process. In this way the resource is useful    should be as much complete as possible. The study of nerve cells,
both to retrieve information about cell cycle model components         neurons and, more in general, brain and its development involves
and to analyze their dynamical properties.                             information regarding a large number of genes and molecular in-
This integrated system aims to become a useful resource for col-       teractions. Thus the systems biology approach is essential. Starting
lecting all the information related to actual and future models of     from the integration of genes and proteins interaction data with
this network. The flexibility of the database allows the addition of   experimental data it is possible to develop new discovery strategies
mathematical data which are used for simulating the behaviour of       in brain studies.
the cell cycle components in the different models. Coupling struc-              In this context we propose a new data integration system,
ture and dynamical information about models Cell Cycle Database        the “Gene Nerve Cell Database” [3], a resource to support neuroin-
allows to achieve system-level properties, such as stable steady       formatics research which contains up to date information regarding
states and oscillations.                                               the mouse genes which have brain-specific gene expression pat-
                                                                       terns. The list of genes specifically expressed in the nervous sys-
                                                                       tem was built starting from the Mousebrain Gene Expression Map
3.2    G2S Breast Cancer                                               (BGEM) and the Allen Brain Atlas. Other genes were taken from
                                                                       literature, including the available web resources.
        The study of breast cancer and its development involves the
knowledge of a large number of genes and molecular interactions        3.4    Kinweb
and thus the systems biology approach is essential to describe the
processes related to the pathology and to perform useful predic-               Protein kinases are a well defined family of proteins, char-
tions. For the effective application of the systemic approach it is    acterized by the presence of a common kinase catalytic domain and
essential to arrange information about genes, cellular pathways and    playing a significant role in many important cellular processes,
interactions that they undertake. These annotations are publicly       such as proliferation, maintenance of cell shape, apoptosys. In
available in bioinformatics resources and their integration produces   many members of the family, additional non-kinase domains con-
an information enrichment, allowing, for example, the clustering of
tribute to further specialization, resulting in subcellular localiza-   structural data and a multiple alignment highlighting the substi-
tion, protein binding and regulation of activity, among others.         tuted position. Molecular models of variants can be visualized with
           About 500 genes encode members of the kinase family          interactive tools; PDB coordinates of the models are also available
in the human genome, and although many of them represent well           for further analysis. Furthermore, an automatic modelling interface
known genes, a larger number of genes code for proteins of more         allows the user to generate multiple alignments and 3D models of
recent identification, or for unknown proteins identified as kinase     new variants.
only after computational studies.                                               ProCMD [5] is an up-to-date interactive mutant database
           A systematic in silico study performed on the human ge-      that integrates phenotypical descriptions with functional and struc-
nome, led to the identification of 5 genes, on chromosome 1, 11,        tural data obtained by computational approaches. It will be useful
13, 15 and 16 respectively, and 1 pseudogene on chromosome X;           in the research and clinical fields to help elucidate the chain of
some of these genes are reported as kinases from NCBI but are           events leading from a molecular defect to the related disease.
absent in other databases, such as KinBase. Comparative analysis
of 483 gene regions and subsequent computational analysis, aimed        3.6    TMA Rep
at identifying unannotated exons, indicate that a large number of
kinase may code for alternately spliced forms or be incorrectly                 Tissue MicroArray technique is becoming increasingly im-
annotated. An InterProScan automated analysis was perfomed to           portant in pathology for the validation of experimental data from
study domains distribution and combination in the various fami-         transcriptomics analysis. This approach produces many images
lies. At the same time, other structural features were also added to    which need to be properly managed, if possible exploiting an infra-
the annotation process, including the putative presence of trans-       structure able to support tissue sharing between institutes. More-
membrane alpha helices, and the cystein propensity to participate       over, the nowadays available frameworks oriented to Tissue Mi-
into a disulfide bridge.                                                croArray provide good storage for clinical patients, sample treat-
        The predicted human kinome was extended by identifying          ments and block constructions information, but their utility is lim-
both additional genes and potential splice variants, resulting in a     ited by the lack of data integration with bioinformatic approaches.
varied panorama where functionality may be searched at the gene                   We propose a Tissue MicroArray web oriented system
and protein level. Structural analysis of kinase proteins domains as    [6] that supports researchers in managing bio-samples and that,
defined in multiple sources together with transmembrane alpha           through the use of ontologies, enables tissue sharing in order to
helices and signal peptide prediction provides hints to function        promote TMA experiments design and results evaluation. Our
assignment. The results of the human kinome analysis are collected      system provides ontological description both for describing pre-
in the KinWeb database [4], available for browsing and searching        analysis tissue images and for identifying post-process image re-
over the internet, where all results from the comparative analysis      sults, which represents a crucial feature for promoting information
and the gene structure annotation are made available, alongside the     exchange. Working on well-defined terms allows to perform que-
domain information. The site provides a comprehensive analysis of       ries on web resources for literature articles, in order to integrate
functional domains of each gene product. For each kinase, Gen-          both pathology and bioinformatics data.
Bank RefSeq and the SwissProt entry names are available along                 Through this system, users associate an ontology-based de-
with information about kinase classification (Hanks and Hunter          scription to each image uploaded into the database and also inte-
classification). Kinases may be searched by domain combinations         grate results with the ontological descriptions of biosequences
and the relative genes may be viewed in a graphic browser at vari-      identified in each tissue. It is even possible to integrate the onto-
ous level of magnification up to gene organization on the full          logical description provided by the user with a fully compliant
chromosome set.                                                         gene ontology definition, enabling statistical studies about correla-
                                                                        tion between the analyzed pathology and the most commonly re-
3.5    ProCMD                                                           lated biological processes. Finally, the web site embeds a tool ori-
                                                                        ented to pre-array tissue image analysis, specific for tubular breast
        Activated Protein C (ProC) is an anticoagulant plasma           cancer affected tissues.
serine protease which also plays an important role in controlling
inflammation and cell proliferation. Several mutations of the gene      4     CONCLUSION
are associated with phenotypic functional deficiency of protein C,
and with the risk of developing venous thrombosis. Structure pre-            In this paper we present an integrated solution to explore part
diction and computational analysis of the mutants have proven to        of the information gained in the field of life science oriented to
be a valuable aid in understanding the molecular aspects of clinical    systems biology. In order to achieve a systemic perspective of a set
thrombophilia. A specialized relational database and a search tool      interesting topics for our group, we come to this integrated portal,
for natural mutants of protein C have been built. The database          SysBio-Gateway, which combines a bioinformatics approach, i.e.
contains 195 entries that include 182 missense and 13 stop muta-        data integration using data warehouse approach, application of
tions. A menu driven search engine allows the user to retrieve          tools for the data analysis, study of structural modifications - both
stored information for each variant, that include genetic as well as    for genome and proteins -, and systems biology approach, that is
the study of protein-protein interaction networks, molecular
mathematical models, pathological states, under a systemic point
of view.


ACKNOWLEDGMENTS
This work has been supported by the NET2DRUG, EGEE-III,
BBMRI, EDGE European projects and by the MIUR FIRB LIT-
BIO (RBLA0332RH), ITALBIONET (RBPR05ZK2Z), BIOPOP-
GEN (RBIN064YAT), CNR-BIOINFORMATICS initiatives. We
also acknowledge the support of the e-Science Institute in Edin-
burgh.

REFERENCES
1. Alfieri R, Merelli I, Mosca E, Milanesi L. (2008) The cell cycle DB: a systems
    biology approach to cell cycle analysis Nucleic Acids Res. 36(Database issue):
    D641–D645.
2. Mosca E, Alfieri R, Milanesi L, Genes-to-Systems Breast Cancer (G2SBC)
    Database: a data integration approach for breast cancer research oriented to
    systems biology, Sysbiohealth Simposium 2008,
3. Alfieri R, Mosca E, Milanesi L, Gene Nerve Cell DB: a data integration approach
    for neuroinformatics research oriented to systems biology, Sysbiohealth
    Simposium 2008, Bologna, 24-25 November 2008
4. Milanesi L, Petrillo M, Sepe L, Boccia A, D'Agostino N, Passamano M, Di Nardo
    S, Tasco G, Casadio R, Paolella G. (2005) Systematic analysis of human kinase
    genes: a large number of genes and alternative splicing events result in functional
    and structural diversity. BMC Bioinformatics. 1;6 Suppl 4:S20.
5. D'Ursi P, Marino F, Caprera A, Milanesi L, Faioni EM, Rovida E. (2007) ProCMD:
    a database and 3D web resource for protein C mutants. BMC Bioinformatics. 8;8
    Suppl 1:S11
6. Viti F, Merelli I, Caprera A, Lazzari B, Stella A, Milanesi L. (2008) Ontology-
    based, Tissue MicroArray oriented, image centered tissue bank. BMC
    Bioinformatics. 25;9 Suppl 4:S4
Figure 1: The SysBio-Gateway access page. Both a short description and direct link to each database held in the
gateway is provided.