A Semantic Grid-based Data Access and Integration Service for Bioinformatics
               Giovanni Aloisio, Massimo Cafaro, Italo Epicoco, Sandro Fiore, Maria Mirto
                     ISUFI/CACT, University of Lecce and NNL/INFM&CNR, Italy
                    {giovanni.aloisio, massimo.cafaro, italo.epicoco, sandro.fiore, maria.mirto}@unile.it


                       Abstract                                    The semantic relation among these data repositories is
Given the heterogeneous nature of biological data and           a key factor for integration in bioinformatics since it
their intensive use in many tools, in this paper we propose     could allow a unique front end for accessing them, as
a semantic data access and integration (DAI) service,           required by many biological applications. Ontology could
based on the Grid paradigm, for the bioinformatics              help here to localise the right type of concept to be
domain. This service uses ontologies for correlating            searched for as opposed to identification of a mere label
different data sets. The DAI proposed in this work is a         naming a search table. It includes definitions of basic
fundamental component of the ProGenGrid system, a               concepts in the domain and relations among them, which
grid-enabled platform, which aims at the design and             should be interpretable both by machines and humans.
implementation of a virtual laboratory where e-scientists          Moreover, biological repositories are often quite large
could simulate complex “in silico” experiments,                 and need to be updated for annotations or when we add
composing some popular analysis and visualization tools         new entries. To date, many tools exist for simulating
(e.g. Blast and Rasmol) available as Web Services, into a       complex “in silico” experiments, that is simulations
workflow. The main goal of the DAI is to provide                carried out using biological data, as opposed to “in vitro”
bioinformatics tools with advanced functionalities and          or “in vivo” ones that are conducted respectively outside
data integration services for heterogeneous biological          or inside a living organism or cell. These tools need to
data banks, such as PDB and Swiss-Prot. A case study of         access heterogeneous data banks, distributed on a wide
our specialized data access service for locating similar        area, and in particular need a supporting infrastructure for
protein sequences is presented.                                 obtaining successfully a result [4]. Many of these tools
                                                                are freely available on the Internet, and there is plenty of
   Keywords: Bioinformatics, DAI, Ontologies, Web               software such as EMBOSS [5] and SRS [6] for accessing
Services, Computational Grid, Grid Portal, Globus               different data banks.
Toolkit.                                                           SRS is the most widely used data integration system
                                                                for biological, biochemical and biomedical databases. It
                                                                enables users of all backgrounds to intuitively access data
    1. Introduction                                             and permits internal data to be merged with data from the
                                                                public domain. The most prominent public server at EBI
    Complete genome sequences and protein-coding gene           (http://srs.ebi.ac.uk) currently holds more than 130
sets are becoming available for a growing number of             biological databases. A key problem with the current
organisms. While these are proving highly informative           structure of SRS is that it is designed only for accessing
and invaluable for studying those and related organisms,        local databases. This requires the SRS administrators to
at the same time they make it clear how far we still have             provide local copies of all the databases and
to go before reaching an in-depth understanding of how a              keep these local copies continuously up to date.
genome determines the lifestyle of an organism.                    This approach uses interconnected heterogeneous
    The increasing amount and complexity of biological          databases via web hypertext links at the level of
data makes it increasingly difficult to access and analyse      individual data items. Data retrieval in such system takes
the data. These data, stored in different geographically        place by using the results of one query to link and jump to
spread repositories, are heterogeneous when we consider         a particular entry in the same or another data source.
genomic, cellular, structure, phenotype and other types of      However most of the potential links among data in digital
biologically relevant information [1], and often describe       form are not readily available because the relevant data,
the same objects utilizing different representations such       when they exist, are in different databases. In addition,
as Swiss-Prot [2], where the protein is mapped just as          each database is typically based on different and
amino acid sequence or Protein Data Bank (PDB) [3] that         incompatible database technologies and uses different
contains 3D structure.                                          languages and vocabularies to access data. These
                                                                incompatibilities are especially significant when non-
textual data, such as 3D images of protein structures,             2. Why Bioinformatics Grids and Web
accessed by author-specified keywords, need to be linked              Services?
with nucleotide sequences in other databases. Because
each database is typically created as a standalone
application to support one functionality, linking among        2.1. Bioinformatics Grids
databases is most often an afterthought. It is possible
(using an integrated approach which considers the                 The interconnection of computers using Grid
semantic meaning of data) to dynamically create links          middleware enables the user to utilize computing power
such as a search engine.                                       and retrieve information from heterogeneous and
   To date, a (de facto) specialized data access service for   distributed sources transparently and efficiently. A
bioinformatics, able to provide access to data and             Computational Grid could be a solution to many
distributed tools, does not exist (yet).                       bioinformatics issues because it allows the deployment,
   A data access service is involved in many biological        distribution and management of needed biological
experiments where Workflow techniques are needed to            software components, the harmonized standard
assist the scientists in the design, execution and             integration of various software layers and services, a
monitoring of them. Workflow Management Systems                powerful, flexible policy definition, and control and
(WFMSs) support the enactment of processes by                  negotiation mechanisms for a collaborative grid
coordinating the temporal and logical order of the             environment. This could reveal useful information for
elementary process activities and supplying the data,          understanding the complex interrelation between genetic
resources and application systems necessary for the            information and hereditary diseases and hence can lead to
execution [7].                                                 important discoveries in life science.
   The Grid [8] framework is an optimal candidate for             Bioinformatics Grids are environments built for the
executing bioinformatics workflows because it offers the       specific domain of biology including hardware and
computational power for high throughput applications and       software resources needed for solving issues related to
basic services such as efficient mechanisms for                biological experiments and simulations. Some examples
transferring huge amounts of data and exchanging them          of Bioinformatics Grids are Asia Pacific BioGRID [11]
on secure channel.                                             and myGrid [12]; the former integrates selected
So, bioinformatics platforms need to offer powerful and        biomolecular applications with the Unicore infrastructure,
high level modelling techniques to ease the work of e-         the latter provides high-level grid services for
scientists, as for instance exploiting Computational Grids     bioinformatics applications for data and application
transparently and efficiently.                                 integration. These projects are very useful for the
   ProGenGrid (Proteomics and Genomics Grid) [9] is a          scientific community because new techniques for solving
software platform which integrates biological databases,       various bioinformatics issues are designed and
analysis and visualization tools, available as Web             experimented.
Services, for supporting complex “in silico” experiments.
The choice to couple Web Services [10] and Grid                2.2. Web Services
technologies produces components independent of
programming language and platforms that exploit a grid            Web services describe an emerging XML-based
infrastructure. ProGenGrid is based on the following key       distributed computing paradigm that differs from other
approaches: web/grid services, workflow, ontologies and        approaches such as CORBA and Java RMI. The basic
data integration through the Grid.                             idea is to build a system out of existing Internet-based
   In this paper we focus on the functions and                 standards. Web services define the description of how to
architecture of a Data Access and Integration (DAI)            invoke service components, a protocol for conveying
service and its use inside the ProGenGrid platform. The        remote procedure calls (RPC, but also Document style
use of the proposed DAI service in an experiment of            Web services can be used), and the discovery mechanism
searching similarity matching among proteins is                for locating the service definition of relevant service
presented. The outline of this paper is as follows: in         providers. Web Services technology allows independence
Section 2, we describe the features of a bioinformatics        from platforms/programming languages and reusability of
DAI. In Section 3 we describe our DAI solution whilst in       the code.
Section 4 we show the role of the DAI in the ProGenGrid
system. We conclude the paper in Section 5.                    2.3. Integrating Grid and Web Services
                                                               technologies to enable DAI service

                                                                  Data access and integration service include key steps
                                                               in the data life cycle process, such as data creation and
acquisition, use, modification, archiving and disposal.           Export/Import capabilities: provisions for importing
This process involves many data banks (data providers)         and exporting data to and from different file formats;
and users/applications, which use the data. Coupling the          Indexing: indexing methodology, including selection
Grid framework and Web Services makes it possible to           and use of the most appropriate controlled vocabulary;
build a bioinformatics DAI service satisfying the                 Query Language: proprietary or standard query
following features:                                            language for supporting complex query.
    Accessibility: ease of use, support for multiple data         In the next Section, we will discuss our solution for an
models and database abstractions; using a Grid                 efficient DAI.
framework it is possible to access a large set of resources
and data efficiently. Through easy to use user interfaces          3. The ProGenGrid Data Access and
that hide the complexity of accessing the Grid (the so                Integration (DAI) Service
called Grid Portals), the user can access a variety of grid
services.
    Capacity and archiving support: local and remote              Our DAI has been studied for supporting integration of
data storage capacity, for the archival process, including     biological data sources and high throughput applications
space for expansion and annotation of the database; a          such as Blast or Drug design applications. It is also
Grid offers huge amount of data storage capacity and           responsible for mapping high level requests (user
efficient mechanisms to move the data between grid             requests) to low level queries, specific for each data
nodes.                                                         source. These ones are in general not structured. In the
    Intellectual property, privacy and security: the first     following part we describe in detail this service.
regards ownership of sequence data, images, and other
data stored in and communicated through the database,          3.1. Data Integration
the second is the provision for preserving confidentiality
of data and the last is the limit on user access. Each user       The main goal of data integration is to develop the
is recognized in a grid infrastructure through proper          technology to grant a user access to multiple information
credentials to access her own data or run applications on      systems, to retrieve information and to perform
the grid. Through a single sign-on the user at first           computations transparently as if they were a single
authenticates herself and then uses the resources for          source. The first complexity in achieving this goal is that
which she has permission rights (authorization process).       the information sources are often independent and
    Interfaces: connectivity with other databases and          autonomous, they have completely different scheme
applications; these represent the Web service interface to     structures and use different data formats. To provide
databases and application tools and are used either by the     uniform access, an integration system must therefore face
user or another service to send a query, to insert the         the problem of data heterogeneity at the system, syntax
parameters needed for the execution of a specific              and structural level. Moreover there is a significant
application and to obtain the results.                         degree of semantic heterogeneity among different
    Portability on multiple platforms: using Web               information sources. Unfortunately, the semantics of
services technology it is possible to build platform           different data sources is hidden or unclear. The
independent components;                                        integration system [14] must provide a mechanism to
    Performance: access time and data throughput; in           bridge across this semantic difference. Current solutions
particular using the GridFTP [13] protocol it is possible to   involve a link-integrated database system and hence
transfer (through parallel streams) efficiently huge           provide only partial, high-level integration with the
amounts of data;                                               growing number of rapidly expanding molecular biology
    However, there are other important issues of               databases. In Figure 1, we show an example of how
bioinformatics DAI that Grid and Web Service do not            Swiss-Prot and PDB are cross-referenced: Swiss-Prot
support such as:                                               identifies a protein with a proprietary identifier (P12544),
    Metadata Management: it includes the design,               but contains also the identifier used by PDB to identify
implementation, and maintenance of the metadata                the same protein (1HF1).
associated to different data sets whose semantic meaning          Another approach involves a data warehouse which
is described through a data dictionary or ontology;            combines data from a variety of databases in one physical
    Multiple data formats: support for various data            location. It is very powerful for running queries against
formats such as flat file, FastA and XML;                      high volumes of data but it requires complex procedures
    Data input support: hardware, software, and                for designing a global scheme and updating data.
processes involved in feeding data into the database, from        The model that we propose is an extension of the
keyboard and voice recognition to direct instrument feed       middleware mediator approach [15], based on two-part
and the Internet;
                                                                   • Mediator which i) given a user query, searches
   SwissProt.AC=P12544                                                 semantic relations in the DSO and ii) consults the
   Homo Sapiens Human                                                  Mapper, reformulating the query, and splitting it
   DR PDB; 1HF1; 06-DEC-98                                             into sub-queries, each one specific to a data
                                                                       source.
                                                               Regarding the Scheme (point i.), we have analysed the
                                                               Swiss-Prot database (Figure 3 shows an entry) and we
                                                               have built its E/R model. In particular some entities
   PDB.ID=1HF1                                                 (Figure 4) involved in the scheme are:
                                                                   • Entry: composed of ID (corresponds to ID –
   MOL_ID:1;                                                           IDentification - tag of Swiss-Prot), length
   MOLECULE: HANNUKA FACTOR                                            (sequence length which is the last field of ID tag,
   (THEORETICAL MODEL) SERINE PROTEINASE                               262 in the example of Fig. 4), seq (SQ involves the
                                                                       sequence i.e. TTCCP …), Descr (DE tag -
Fig. 1. Cross-referenced link between Swiss-Prot and                   description), AC (AC tag - accession number),
PDB.                                                                   CodGen (GN tag – codifying gene), Keyw (KW
                                                                       tag – keywords) fields;
middleware and on clients which formulate queries. The             • Taxonomy: involves ID, Name (OC tag - organism
first part (called wrapper) sits on top of each data source            taxonomy), Synonymous (OX tag - taxonomy
and often performs two different functions: i) it translates           through cross reference) fields;
the data into a common data model and ii) it takes a               • Reference: comprises ID, Title, Year, Volume and
query-fragment from the mediator and transforms it into                Journal (RN,RP,RC,RX,RA,RT,RL tags contain
an equivalent query in the query language of the sources.              the bibliographic reference) fields.
The second part (called mediator engine), built on top of      With regard to the ontology related to each data source
all of the wrappers, first decomposes a query in a set of      (point ii.), it contains semantic relations between concepts
sub-queries for each wrapper, then takes the partial results   described in the data source. In particular Figure 5 shows
from the wrappers and constructs the final result.             a fragment of the ontology for Swiss-Prot, where some
    There are mediator systems that provide a semantic         features for each protein (e.g. taxonomy, function etc.) are
bridge across information sources in complex application       mapped. It is worth noting here that in this database some
domain such as biology such as TAMBIS [16] or                  information are correlated, so using E/R scheme and the
BioDataServer [17], but these do not consider the              ontology it is possible to try all of the relations among
integration of distributed data sources in a grid              data.
environment.                                                   A possible relation among data obtained by scheme and
    In this paper, we present an information integration       ontology ties together entry and taxonomy with
system that follows the mediator architecture but extends      associated IDentry and IDTaxonomy (point iii.). So
it by incorporating domain specific bioinformatics             IDTaxonomy corresponds to the organism terms in the
knowledge in a grid environment.                               ontology.
    As can be seen in Figure 2, our system is made of:            We would like to integrate the following databases:
     • Semantic Wrapper (SW), built on top of a data                   • Structure: PDB and CATH [18];
        source, it includes                                            • Sequence: Swiss-Prot;
         i. Scheme, i.e. the (ER – Entity/Relation - or                • Function: ENZYME databases [19].
              UML) data model of a source;                        To build the SW component, we need to model each
        ii. Ontology, that describes a specific data           data source using a ER model and an ontology. In
              source;                                          particular, we plan to use Gene Ontology [20] for
       iii. Relations/associations, between the local          collecting the needed ontologies for modelling the data of
              ontology and the scheme;                         interest. The APIs indicated in point iv. (see Semantic
       iv. APIs, for retrieving a specific attribute or        Wrapper description) are simple functions that allow
              field.                                           binding and unbinding to/from the physical database, to
     • Mapper, a catalog that gathers the schemes and          search a given attribute or move between entries of the
        their description coming from each SW; it is used      database. Moreover these are needed for populating the
        to identify the data source of a query and to select   relational scheme automatically.
        the appropriate wrapper;
     • Data Source Ontology (DSO): it virtualises data
        sources and maps the semantic links between
        them;
                                              Fig. 2. ProGenGrid DAI Architecture.

   Indeed, for each analysed wrapper we have                          The Mapper contains a catalogue of data source
implemented in C language some functions that translate           schemes and a brief description. It is worth noting here
the data source into an XML scheme and carry out the              that it contains the logical file name of the scheme
ingestion of the entire database in our relational data           associated with one or more physical file names (for
model. These features have been provided jointly with the         instance EMBL databank has a relational, flat file and
GRelC library [21].                                               XML version corresponding each to a Mapper entry).
                                                                      Data Source Ontology (DSO) classifies the data
  ID GRAA_HUMAN STANDARD; PRT; 262 AA.
                                                                  sources w.r.t. some features providing a unified
  AC P12544;                                                      conceptual level representation of its registered
  DT 01-OCT-1989 (Rel. 12, Created)                               component resources.
  DT 01-OCT-1989 (Rel. 12, Last sequence update)                      In the following text we show how concepts in
  DT 01-OCT-2004 (Rel. 45, Last annotation update)                different ontologies are linked. As an example, the
  DE Granzyme A precursor (EC 3.4.21.78)
                                                                  relation                   “polypeptide_chain(is_composed,
  GN Name=GZMA; Synonyms=CTLA3, HFSP;
  OS Homo sapiens (Human).                                        SwissProt.sequence, PDB.sequence)” expresses the fact
  OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata.          that polypeptide_chain is both a sequence in Swiss-Prot
  OX NCBI_TaxID=9606;                                             or in PDB. For the databases cited above we could
  RN [1]                                                          consider the classification for protein as follows, where
  RP SEQUENCE FROM N.A.RC TISSUE=T-cell;                          the first field is the relation and the other ones are related
  RX MEDLINE=88125000; PubMed=3257574;
  RA     Gershenfeld H.K., Hershberger R.J., Shows T.B.,          attributes:
  Weissman I.L.;                                                  protein (has, name, polypeptide_chain, function)
  RT    "Cloning and chromosomal assignment of a human            polypeptide_chain(is_composed,           SwissProt.sequence,
  cDNA”                                                           PDB.sequence);
  RL Proc. Natl. Acad. Sci. U.S.A. 85:1184-1188(1988).            PDB.sequence(has, PDB.3Dstructure);
  RL Proteins 4:190-204(1988).
  CC -!- FUNCTION: This enzyme is necessary for target cell
                                                                  Cath.code(has, Cath.domain_def);
  CC lysis in cell- mediated immune responses. It cleaves after   PDB.3Dstructure(is_composed, Cath.domain_def)
                                                                  SwissProt.sequence(has,SwissProt.description,
           Fig. 3. An entry of Swiss-Prot database.               SwissProt.keywords);
                                                                  protein.function (is_composed, SwissProt.keywords);
protein.function (is_composed, Enzyme.class).                of legacy application using SOAP and its main feature is
Enzyme.ECnumber(has,Enzyme.catal.,Enzyme.class);             a transparent SOAP API. To guarantee a secure channel
                                                             to move biological data, we also used the Globus Security
                                                             Infrastructure (GSI) support, available through our
Entry                                                        gSOAP plug-in [23]. So, the Mediator Web Service
ID Length        Seq     Descr    AC    CodGen     Keyw      (server) and clients can establish a SOAP connection over
                                                             a secure GSI channel exchanging X.509v3 certificates for
Taxonomy                                                     mutual authentication/authorization and delegation. The
    ID                 Name             Synonymous           Workflow editor has been implemented in Java so in this
                                                             system the client to the Web Service has been realized
Reference                                                    using Apache Axis and GSS API.
 ID     Title          Year        Volume      Journal          Moreover, we are finishing the Wrapper APIs for the
                                                             data banks cited above, to provide a set of primitives to
Relation                                                     get access to and interact transparently with different data
       IDentry                      IDTaxonomy               sources. Finally, for high throughput applications we are
                                                             investigating an approach based on our mechanism called
    Fig. 4. Subset of the scheme built for Swiss-Prot.       SplitQuery which provides an efficient fragmentation of
                                                             the biological data set and a protocol for retrieving the
                                                             fragment, as described in [24].
                                                                Currently, we are exploiting the Globus Toolkit 3.2
             gene                                            pre-OGSI [25] as Grid middleware in our project.
                       reported     publications
     encodes                                                     4. Case study: using DAI in a Workflow
                                                                    for searching sequence similarity
                                 from
          sequence                       organism
                                                                Recently, many workflow languages have been
                                                             defined such as Web Services Flow Language (WSFL)
                                                             [26], Business Process Execution Language (BPEL) [27],
                       belongs      protein                  and UML extensions. We use UML (Unified Modeling
                                        has                  Language, [28]) activity diagrams as a workflow
                                                             language specification. UML, as well as all of its
                                    function                 extensions, is the most widely accepted notation for
                                                             designing and understanding complex systems; it has an
                                                             intuitive graphical notation, and UML activity diagrams
              Fig. 5. Ontology for Swiss-Prot.               support [29] most of the control flow constructs and are
                                                             suitable to model workflow execution.
                                                                As an application of ProGenGrid, we present a
   The Mediator accepts requests from the user and           workflow modelling the process of searching similarity
retrieves the information if the searched data (exploring    matching among proteins. Figure 6 shows an activity
the DSO) are semantically correlated. It is worth noting     diagram specification of the similarity search process.
here that the Mediator should implement a logic having a     This process starts by supplying a target protein <
definition of query with different abstraction levels        IDProtein > or its FASTA format (in this example, the
(initially, we planned to use the SQL standard language      protein target is 1LYN), the search procedure accesses the
but now we are considering other hypotheses, providing a     database and all of the information about target protein is
request virtualisation layer). The Mediator engine           recovered from the Swiss-Prot database.
coordinates the temporal activities of all of the               To date, we are using the SQL language like that for
components selecting those available on some nodes of a      our experiment. In particular, given the input protein X
Computational Grid.
                                                             (1LYN), and indicating with Yi , i ∈ (1, … 200000) a set
                                                             of sequences extracted from Swiss-Prot, the following
3.2. Implementation                                          query first selects all sequences from Swiss-Prot whose
                                                             alignment score is greater than a threshold value score,
   The Mediator component provides some methods,             and then, using the sequence Accession Number, it selects
through a Web Services interface. The Web service server     from PDB the structural information related to such
has been implemented in C, exploiting the gSOAP              sequences:
Toolkit [22], because it is well suited for the conversion
                           Fig. 6. Workflow of a bioinformatics experiment of sequence comparison.


select Y.Structure from PDB where Y.AC in (select Y.AC         querying PDB efficiently.
from Swiss-Prot where align[blastP(X, Yi)] > score)
                                                                   5. Conclusions
   We have searched all of the ACs (Accession numbers)
of the sequences of the Swiss-Prot that are similar to the        The large amount of data sets that today is available
input protein and hence satisfying a given score (applying     from geographically distributed storage sources, is
blastp tool [30]). Since the ACs of the Swiss-Prot are         making data integration increasingly important.
present in the PDB, we have selected the corresponding         Integration of data demands significant advances in
structure for visualizing it with the Rasmol tool [31].        middleware; distributed infrastructures such as Grids and
Given a protein, its graphical representation can be           Web Services can be used for data integration.
compared with respect to each similar protein produced         In particular coupling these with ontologies is a
by Blast.                                                      promising approach to model bioinformatics sources. In
We should express two considerations:                          this paper we presented the architecture of a semantics-
     1. All of the tools used in the experiment are run on     enriched Data Access and Integration service for
          Grid nodes; for instance for the visualization, we   biological databases. The proposed system extends the
          have used GRB library [32] to redirect the output    classical mediator approach in data integration by
          of Rasmol on our desktop, using all of the           introducing domain ontologies in description of data
          features of this tool;                               sources and exposing services through the Web Services
     2. In the above query we have used a semantic join        approach. Compared to other approaches, our system uses
          for characterizing the relation between Swiss-       Grid protocols such as GridFTP and GSI for fast and
          Prot and PDB.                                        secure exchange of data.
   In a simple experiment such as that described above,           In our architecture wrappers are created manually and
our data access service is fundamental to access the           added to the mediator modifying its source code. We are
Swiss-Prot and PDB data banks to retrieve the data. In         now focusing our efforts to build a dynamic mediator
particular an added value of our DAI service is related to     through semantic mediation. It will allow using semantic
the fact that, as protein sequences are retrieved from         information about data sources, such as query
Swiss-Prot, their correspondent PDB versions (protein          capabilities, data provenance, data scheme, etc. The main
structures) can be recovered by using the information          goal is to provide a method to add wrappers without
stored in the DAI schemes and ontologies, allowing
source code modifications. A secondary goal is a tool for             [16] Stevens et alt (2000). “TAMBIS: Transparent Access to
automatic wrapper generation.                                         Multiple Bioinformatics Information Sources”. Bioinformatics,
   Future work will regard the full implementation of the             16:2 PP.184-186.
system and its use inside ProGenGrid, a grid-based                    [17] Lange et alt. (2001). “A Computational Support for Access
                                                                      to Integrated Molecular Biology Data”.
service oriented to software environment for                          Site address:
bioinformatics applications.                                          http://www.bioinfo.de/isb/gcb01/poster/lange.html#img-1.
                                                                      [18] Orengo C.A., Michie A.D., Jones S., Jones D.T., Swindells
     6. References                                                    M.B., Thornton J.M. “CATH – A Hierarchic Classification of
                                                                      Protein Domain Structures”. Structure 1997; 5: 1093-1108.
                                                                      [19] Bairoch, A. (1993). The ENZYME data bank. Nucleic
[1] Fasman, K. H., Letovsky, S. I., Cottingham, R. W. and
                                                                      Acids Res. 21, 3155-3156.
Kingsbury, D. T. (1996). Improvements to the GDB Human
                                                                      [20] The Gene Ontology Consortium. Gene Ontology: tool for
Genome Data Base. Nucleic Acids Res. 24, 57-63.
                                                                      the unification of biology. Nature Genet. 25: 25-29 (2000).
[2] B., Boeckmann, A., Bairoch, R., Apweiler, M., Blatter, A.,
                                                                      [21] Aloisio, G., Cafaro, M., Fiore, S., Mirto, M.: The GRelC
Estreicher, E., Gasteiger, M. J., Martin, K., Michoud, C.,
                                                                      Project: Towards GRID-DBMS, Proceedings of Parallel and
O'Donovan, I., Phan, S., Pilbout, and M., Schneider. The Swiss-
                                                                      Distributed Computing and Networks (PDCN) IASTED, pp-1-7,
Prot protein knowledge base and its supplement TrEMBL.
                                                                      Innsbruck (Austria) February 17-19 (2004). Site address:
Nucleic Acids Research 31: 365-370 (2003). Site address:
                                                                      http://gandalf.unile.it.
http://www.ebi.ac.uk/swissprot/.
                                                                      [22] Van Engelen, R.A., Gallivan, K.A. “The gSOAP Toolkit
[3] Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E.
                                                                      for Web Services and Peer-To-Peer Computing Networks.”,
F., Brice, M. D., Rodgers, J. R., Shimanouchi, O. K. T. and
                                                                      Proceedings of IEEE CCGrid Conference, May 2002, Berlin,
Tasumi, M. (1977). The Protein Data Bank: a computer-based
                                                                      pp. 128-135.
archival file for macromolecular structures. J. Mol. Biol. 112,
                                                                      [23] Aloisio, G., Cafaro, M., Lezzi, D., Van Engelen, R.A.
535-542.
                                                                      "Secure Web Services with Globus GSI and gSOAP",
[4] Özsu, M.T., & Valduriez, P. (1999). Principles of
                                                                      Proceedings of Euro-Par 2003, 26th - 29th August 2003,
Distributed Database Systems, 2nd edition, Prentice Hall (Ed.),
                                                                      Klagenfurt, Austria, Lecture Notes in Computer Science,
Upper Saddle River, NJ, USA.
                                                                      Springer-Verlag, N. 2790, 421-426, 2003. Site address:
[5] Rice, P. Longden, I. and Bleasby, A. "EMBOSS: The
                                                                      http://sara.unile.it/~cafaro/gsi-plugin.html.
European Molecular Biology Open Software Suite" Trends in
                                                                      [24] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “Bioinformatics
Genetics June 2000, vol 16, No 6. pp.276-277. Site address:
                                                                      Data Access Service in the ProGenGrid System”. Proceedings
http://www.ch.embnet.org/EMBOSS/.
                                                                      of the First International Workshop on Grid Computing and its
[6]      SRS        Network       Browser.       Site      address:
                                                                      Application to Data Analysis (GADA 2004), October 25-29,
http://www.ebi.ac.uk/srs/srsc/.
                                                                      Larnaca, Cyprus, Greece, OTM Workshop 2004, LNCS 3292,
[7] WfMC. Workflow management coalition reference model.
                                                                      pp. 211-221, R. Meersman et al. (Eds.), 2004.
Site address: http://www.wfmc.org/.
                                                                      [25] I., Foster, C., Kesselman: Globus: A Metacomputing
[8] I., Foster, C., Kesselman: The Grid: Blueprint for a New
                                                                      Infrastructure Toolkit, Intl J. Supercomputer Applications, Vol.
Computing Infrastructure, Published by Morgan Kaufmann
                                                                      11, 1997, No. 2, pp. 115-128.
(1998).
                                                                      [26]IBM. Web services flow language -wsfl. Site address:
[9] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “ProGenGrid: A
                                                                      http://www-
Grid Framework for Bioinformatics”. Proceedings of
                                                                      306.ibm.com/software/solutions/webservices/pdf/WSFL.pdf
International Meeting on Computational Intelligence Methods
                                                                      [27]IBM. Business process execution language for web
for Bioinformatics and Biostatistics (CIBB 2004), September
                                                                      services-       bpel4ws.       Site      address:    http://www-
14-15 2004, Perugia, Italy.
                                                                      106.ibm.com/developerworks/webservices/library/ws-bpel/.
[10] Kreger, H. “Web Services Conceptual Architecture.”,
                                                                      [28]OMG. Uml- unified modeling language: Extensions for
WSCA 1.0. IBM, 2001.
                                                                      workflow          process       definition.     Site     address:
[11] T.T. Wee, M.D. Silva, L.K. Siong, O.G. Sin, R. Buyya, and
                                                                      http://www.omg.org/uml/.
R. Godhia, “Asia Pacific BioGRID Initiative”, Site Address:
                                                                      [29]R. Eshuis and R. Wieringa. Verification support for
http://www.apbionet.org/grid/docs/.Presentation        Slides    at
                                                                      workflow design with UML activity graphs. In CSE02. Springer
APGrid Core Meeting, Phuket. 2002.
                                                                      Verlag, 2002.
[12] myGrid Project, University of Manchester. Site address:
                                                                      [30] Altschul, Stephen F., Gish Warren, Webb Miller, Eugene
http://mygrid.man.ac.uk/.
                                                                      W. Myers, and David J. Lipman (1990). Basic local alignment
[13] GridFTP Protocol. Site Address: http://www-
                                                                      search tool. J. Mol. Biol. 215:403-410.
fp.mcs.anl.gov/dsl/GridFTP-Protocol-RFC-Draft.pdf.
                                                                      [31] Roger A. Sayle and E. J. Milner-White, "RasMol:
[14] M. Lenzerini. Data Integration: A Theoretical Perspective.
                                                                      Biomolecular graphics for all", Trends in Biochemical Science
In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART
                                                                      (TIBS), September 1995, Vol. 20, No. 9, p.374. Site address:
symposium of Principles of database systems (PODS), pp. 233-
                                                                      http://www.umass.edu/microbio/rasmol/.
246. ACM Press, 2002.
                                                                      [32] Aloisio, G., Blasi, E., Cafaro, M., Epicoco, I. “The GRB
[15] Widerhold G. “Mediators in the Architecture of Future
                                                                      library: Grid Computing with Globus in C.”, Proceedings HPCN
Information Systems”. IEEE Computer 1992; 25:38-49.
                                                                      Europe 2001, Amsterdam, Netherlands, Lecture Notes in
                                                                      Computer Science, Springer-Verlag, N. 2110, 133-140, 2001.