A Semantic Grid-based Data Access and Integration Service for Bioinformatics Giovanni Aloisio, Massimo Cafaro, Italo Epicoco, Sandro Fiore, Maria Mirto ISUFI/CACT, University of Lecce and NNL/INFM&CNR, Italy {giovanni.aloisio, massimo.cafaro, italo.epicoco, sandro.fiore, maria.mirto}@unile.it Abstract The semantic relation among these data repositories is Given the heterogeneous nature of biological data and a key factor for integration in bioinformatics since it their intensive use in many tools, in this paper we propose could allow a unique front end for accessing them, as a semantic data access and integration (DAI) service, required by many biological applications. Ontology could based on the Grid paradigm, for the bioinformatics help here to localise the right type of concept to be domain. This service uses ontologies for correlating searched for as opposed to identification of a mere label different data sets. The DAI proposed in this work is a naming a search table. It includes definitions of basic fundamental component of the ProGenGrid system, a concepts in the domain and relations among them, which grid-enabled platform, which aims at the design and should be interpretable both by machines and humans. implementation of a virtual laboratory where e-scientists Moreover, biological repositories are often quite large could simulate complex “in silico” experiments, and need to be updated for annotations or when we add composing some popular analysis and visualization tools new entries. To date, many tools exist for simulating (e.g. Blast and Rasmol) available as Web Services, into a complex “in silico” experiments, that is simulations workflow. The main goal of the DAI is to provide carried out using biological data, as opposed to “in vitro” bioinformatics tools with advanced functionalities and or “in vivo” ones that are conducted respectively outside data integration services for heterogeneous biological or inside a living organism or cell. These tools need to data banks, such as PDB and Swiss-Prot. A case study of access heterogeneous data banks, distributed on a wide our specialized data access service for locating similar area, and in particular need a supporting infrastructure for protein sequences is presented. obtaining successfully a result [4]. Many of these tools are freely available on the Internet, and there is plenty of Keywords: Bioinformatics, DAI, Ontologies, Web software such as EMBOSS [5] and SRS [6] for accessing Services, Computational Grid, Grid Portal, Globus different data banks. Toolkit. SRS is the most widely used data integration system for biological, biochemical and biomedical databases. It enables users of all backgrounds to intuitively access data 1. Introduction and permits internal data to be merged with data from the public domain. The most prominent public server at EBI Complete genome sequences and protein-coding gene (http://srs.ebi.ac.uk) currently holds more than 130 sets are becoming available for a growing number of biological databases. A key problem with the current organisms. While these are proving highly informative structure of SRS is that it is designed only for accessing and invaluable for studying those and related organisms, local databases. This requires the SRS administrators to at the same time they make it clear how far we still have ƒ provide local copies of all the databases and to go before reaching an in-depth understanding of how a ƒ keep these local copies continuously up to date. genome determines the lifestyle of an organism. This approach uses interconnected heterogeneous The increasing amount and complexity of biological databases via web hypertext links at the level of data makes it increasingly difficult to access and analyse individual data items. Data retrieval in such system takes the data. These data, stored in different geographically place by using the results of one query to link and jump to spread repositories, are heterogeneous when we consider a particular entry in the same or another data source. genomic, cellular, structure, phenotype and other types of However most of the potential links among data in digital biologically relevant information [1], and often describe form are not readily available because the relevant data, the same objects utilizing different representations such when they exist, are in different databases. In addition, as Swiss-Prot [2], where the protein is mapped just as each database is typically based on different and amino acid sequence or Protein Data Bank (PDB) [3] that incompatible database technologies and uses different contains 3D structure. languages and vocabularies to access data. These incompatibilities are especially significant when non- textual data, such as 3D images of protein structures, 2. Why Bioinformatics Grids and Web accessed by author-specified keywords, need to be linked Services? with nucleotide sequences in other databases. Because each database is typically created as a standalone application to support one functionality, linking among 2.1. Bioinformatics Grids databases is most often an afterthought. It is possible (using an integrated approach which considers the The interconnection of computers using Grid semantic meaning of data) to dynamically create links middleware enables the user to utilize computing power such as a search engine. and retrieve information from heterogeneous and To date, a (de facto) specialized data access service for distributed sources transparently and efficiently. A bioinformatics, able to provide access to data and Computational Grid could be a solution to many distributed tools, does not exist (yet). bioinformatics issues because it allows the deployment, A data access service is involved in many biological distribution and management of needed biological experiments where Workflow techniques are needed to software components, the harmonized standard assist the scientists in the design, execution and integration of various software layers and services, a monitoring of them. Workflow Management Systems powerful, flexible policy definition, and control and (WFMSs) support the enactment of processes by negotiation mechanisms for a collaborative grid coordinating the temporal and logical order of the environment. This could reveal useful information for elementary process activities and supplying the data, understanding the complex interrelation between genetic resources and application systems necessary for the information and hereditary diseases and hence can lead to execution [7]. important discoveries in life science. The Grid [8] framework is an optimal candidate for Bioinformatics Grids are environments built for the executing bioinformatics workflows because it offers the specific domain of biology including hardware and computational power for high throughput applications and software resources needed for solving issues related to basic services such as efficient mechanisms for biological experiments and simulations. Some examples transferring huge amounts of data and exchanging them of Bioinformatics Grids are Asia Pacific BioGRID [11] on secure channel. and myGrid [12]; the former integrates selected So, bioinformatics platforms need to offer powerful and biomolecular applications with the Unicore infrastructure, high level modelling techniques to ease the work of e- the latter provides high-level grid services for scientists, as for instance exploiting Computational Grids bioinformatics applications for data and application transparently and efficiently. integration. These projects are very useful for the ProGenGrid (Proteomics and Genomics Grid) [9] is a scientific community because new techniques for solving software platform which integrates biological databases, various bioinformatics issues are designed and analysis and visualization tools, available as Web experimented. Services, for supporting complex “in silico” experiments. The choice to couple Web Services [10] and Grid 2.2. Web Services technologies produces components independent of programming language and platforms that exploit a grid Web services describe an emerging XML-based infrastructure. ProGenGrid is based on the following key distributed computing paradigm that differs from other approaches: web/grid services, workflow, ontologies and approaches such as CORBA and Java RMI. The basic data integration through the Grid. idea is to build a system out of existing Internet-based In this paper we focus on the functions and standards. Web services define the description of how to architecture of a Data Access and Integration (DAI) invoke service components, a protocol for conveying service and its use inside the ProGenGrid platform. The remote procedure calls (RPC, but also Document style use of the proposed DAI service in an experiment of Web services can be used), and the discovery mechanism searching similarity matching among proteins is for locating the service definition of relevant service presented. The outline of this paper is as follows: in providers. Web Services technology allows independence Section 2, we describe the features of a bioinformatics from platforms/programming languages and reusability of DAI. In Section 3 we describe our DAI solution whilst in the code. Section 4 we show the role of the DAI in the ProGenGrid system. We conclude the paper in Section 5. 2.3. Integrating Grid and Web Services technologies to enable DAI service Data access and integration service include key steps in the data life cycle process, such as data creation and acquisition, use, modification, archiving and disposal. Export/Import capabilities: provisions for importing This process involves many data banks (data providers) and exporting data to and from different file formats; and users/applications, which use the data. Coupling the Indexing: indexing methodology, including selection Grid framework and Web Services makes it possible to and use of the most appropriate controlled vocabulary; build a bioinformatics DAI service satisfying the Query Language: proprietary or standard query following features: language for supporting complex query. Accessibility: ease of use, support for multiple data In the next Section, we will discuss our solution for an models and database abstractions; using a Grid efficient DAI. framework it is possible to access a large set of resources and data efficiently. Through easy to use user interfaces 3. The ProGenGrid Data Access and that hide the complexity of accessing the Grid (the so Integration (DAI) Service called Grid Portals), the user can access a variety of grid services. Capacity and archiving support: local and remote Our DAI has been studied for supporting integration of data storage capacity, for the archival process, including biological data sources and high throughput applications space for expansion and annotation of the database; a such as Blast or Drug design applications. It is also Grid offers huge amount of data storage capacity and responsible for mapping high level requests (user efficient mechanisms to move the data between grid requests) to low level queries, specific for each data nodes. source. These ones are in general not structured. In the Intellectual property, privacy and security: the first following part we describe in detail this service. regards ownership of sequence data, images, and other data stored in and communicated through the database, 3.1. Data Integration the second is the provision for preserving confidentiality of data and the last is the limit on user access. Each user The main goal of data integration is to develop the is recognized in a grid infrastructure through proper technology to grant a user access to multiple information credentials to access her own data or run applications on systems, to retrieve information and to perform the grid. Through a single sign-on the user at first computations transparently as if they were a single authenticates herself and then uses the resources for source. The first complexity in achieving this goal is that which she has permission rights (authorization process). the information sources are often independent and Interfaces: connectivity with other databases and autonomous, they have completely different scheme applications; these represent the Web service interface to structures and use different data formats. To provide databases and application tools and are used either by the uniform access, an integration system must therefore face user or another service to send a query, to insert the the problem of data heterogeneity at the system, syntax parameters needed for the execution of a specific and structural level. Moreover there is a significant application and to obtain the results. degree of semantic heterogeneity among different Portability on multiple platforms: using Web information sources. Unfortunately, the semantics of services technology it is possible to build platform different data sources is hidden or unclear. The independent components; integration system [14] must provide a mechanism to Performance: access time and data throughput; in bridge across this semantic difference. Current solutions particular using the GridFTP [13] protocol it is possible to involve a link-integrated database system and hence transfer (through parallel streams) efficiently huge provide only partial, high-level integration with the amounts of data; growing number of rapidly expanding molecular biology However, there are other important issues of databases. In Figure 1, we show an example of how bioinformatics DAI that Grid and Web Service do not Swiss-Prot and PDB are cross-referenced: Swiss-Prot support such as: identifies a protein with a proprietary identifier (P12544), Metadata Management: it includes the design, but contains also the identifier used by PDB to identify implementation, and maintenance of the metadata the same protein (1HF1). associated to different data sets whose semantic meaning Another approach involves a data warehouse which is described through a data dictionary or ontology; combines data from a variety of databases in one physical Multiple data formats: support for various data location. It is very powerful for running queries against formats such as flat file, FastA and XML; high volumes of data but it requires complex procedures Data input support: hardware, software, and for designing a global scheme and updating data. processes involved in feeding data into the database, from The model that we propose is an extension of the keyboard and voice recognition to direct instrument feed middleware mediator approach [15], based on two-part and the Internet; • Mediator which i) given a user query, searches SwissProt.AC=P12544 semantic relations in the DSO and ii) consults the Homo Sapiens Human Mapper, reformulating the query, and splitting it DR PDB; 1HF1; 06-DEC-98 into sub-queries, each one specific to a data source. Regarding the Scheme (point i.), we have analysed the Swiss-Prot database (Figure 3 shows an entry) and we have built its E/R model. In particular some entities PDB.ID=1HF1 (Figure 4) involved in the scheme are: • Entry: composed of ID (corresponds to ID – MOL_ID:1; IDentification - tag of Swiss-Prot), length MOLECULE: HANNUKA FACTOR (sequence length which is the last field of ID tag, (THEORETICAL MODEL) SERINE PROTEINASE 262 in the example of Fig. 4), seq (SQ involves the sequence i.e. TTCCP …), Descr (DE tag - Fig. 1. Cross-referenced link between Swiss-Prot and description), AC (AC tag - accession number), PDB. CodGen (GN tag – codifying gene), Keyw (KW tag – keywords) fields; middleware and on clients which formulate queries. The • Taxonomy: involves ID, Name (OC tag - organism first part (called wrapper) sits on top of each data source taxonomy), Synonymous (OX tag - taxonomy and often performs two different functions: i) it translates through cross reference) fields; the data into a common data model and ii) it takes a • Reference: comprises ID, Title, Year, Volume and query-fragment from the mediator and transforms it into Journal (RN,RP,RC,RX,RA,RT,RL tags contain an equivalent query in the query language of the sources. the bibliographic reference) fields. The second part (called mediator engine), built on top of With regard to the ontology related to each data source all of the wrappers, first decomposes a query in a set of (point ii.), it contains semantic relations between concepts sub-queries for each wrapper, then takes the partial results described in the data source. In particular Figure 5 shows from the wrappers and constructs the final result. a fragment of the ontology for Swiss-Prot, where some There are mediator systems that provide a semantic features for each protein (e.g. taxonomy, function etc.) are bridge across information sources in complex application mapped. It is worth noting here that in this database some domain such as biology such as TAMBIS [16] or information are correlated, so using E/R scheme and the BioDataServer [17], but these do not consider the ontology it is possible to try all of the relations among integration of distributed data sources in a grid data. environment. A possible relation among data obtained by scheme and In this paper, we present an information integration ontology ties together entry and taxonomy with system that follows the mediator architecture but extends associated IDentry and IDTaxonomy (point iii.). So it by incorporating domain specific bioinformatics IDTaxonomy corresponds to the organism terms in the knowledge in a grid environment. ontology. As can be seen in Figure 2, our system is made of: We would like to integrate the following databases: • Semantic Wrapper (SW), built on top of a data • Structure: PDB and CATH [18]; source, it includes • Sequence: Swiss-Prot; i. Scheme, i.e. the (ER – Entity/Relation - or • Function: ENZYME databases [19]. UML) data model of a source; To build the SW component, we need to model each ii. Ontology, that describes a specific data data source using a ER model and an ontology. In source; particular, we plan to use Gene Ontology [20] for iii. Relations/associations, between the local collecting the needed ontologies for modelling the data of ontology and the scheme; interest. The APIs indicated in point iv. (see Semantic iv. APIs, for retrieving a specific attribute or Wrapper description) are simple functions that allow field. binding and unbinding to/from the physical database, to • Mapper, a catalog that gathers the schemes and search a given attribute or move between entries of the their description coming from each SW; it is used database. Moreover these are needed for populating the to identify the data source of a query and to select relational scheme automatically. the appropriate wrapper; • Data Source Ontology (DSO): it virtualises data sources and maps the semantic links between them; Fig. 2. ProGenGrid DAI Architecture. Indeed, for each analysed wrapper we have The Mapper contains a catalogue of data source implemented in C language some functions that translate schemes and a brief description. It is worth noting here the data source into an XML scheme and carry out the that it contains the logical file name of the scheme ingestion of the entire database in our relational data associated with one or more physical file names (for model. These features have been provided jointly with the instance EMBL databank has a relational, flat file and GRelC library [21]. XML version corresponding each to a Mapper entry). Data Source Ontology (DSO) classifies the data ID GRAA_HUMAN STANDARD; PRT; 262 AA. sources w.r.t. some features providing a unified AC P12544; conceptual level representation of its registered DT 01-OCT-1989 (Rel. 12, Created) component resources. DT 01-OCT-1989 (Rel. 12, Last sequence update) In the following text we show how concepts in DT 01-OCT-2004 (Rel. 45, Last annotation update) different ontologies are linked. As an example, the DE Granzyme A precursor (EC 3.4.21.78) relation “polypeptide_chain(is_composed, GN Name=GZMA; Synonyms=CTLA3, HFSP; OS Homo sapiens (Human). SwissProt.sequence, PDB.sequence)” expresses the fact OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata. that polypeptide_chain is both a sequence in Swiss-Prot OX NCBI_TaxID=9606; or in PDB. For the databases cited above we could RN [1] consider the classification for protein as follows, where RP SEQUENCE FROM N.A.RC TISSUE=T-cell; the first field is the relation and the other ones are related RX MEDLINE=88125000; PubMed=3257574; RA Gershenfeld H.K., Hershberger R.J., Shows T.B., attributes: Weissman I.L.; protein (has, name, polypeptide_chain, function) RT "Cloning and chromosomal assignment of a human polypeptide_chain(is_composed, SwissProt.sequence, cDNA” PDB.sequence); RL Proc. Natl. Acad. Sci. U.S.A. 85:1184-1188(1988). PDB.sequence(has, PDB.3Dstructure); RL Proteins 4:190-204(1988). CC -!- FUNCTION: This enzyme is necessary for target cell Cath.code(has, Cath.domain_def); CC lysis in cell- mediated immune responses. It cleaves after PDB.3Dstructure(is_composed, Cath.domain_def) SwissProt.sequence(has,SwissProt.description, Fig. 3. An entry of Swiss-Prot database. SwissProt.keywords); protein.function (is_composed, SwissProt.keywords); protein.function (is_composed, Enzyme.class). of legacy application using SOAP and its main feature is Enzyme.ECnumber(has,Enzyme.catal.,Enzyme.class); a transparent SOAP API. To guarantee a secure channel to move biological data, we also used the Globus Security Infrastructure (GSI) support, available through our Entry gSOAP plug-in [23]. So, the Mediator Web Service ID Length Seq Descr AC CodGen Keyw (server) and clients can establish a SOAP connection over a secure GSI channel exchanging X.509v3 certificates for Taxonomy mutual authentication/authorization and delegation. The ID Name Synonymous Workflow editor has been implemented in Java so in this system the client to the Web Service has been realized Reference using Apache Axis and GSS API. ID Title Year Volume Journal Moreover, we are finishing the Wrapper APIs for the data banks cited above, to provide a set of primitives to Relation get access to and interact transparently with different data IDentry IDTaxonomy sources. Finally, for high throughput applications we are investigating an approach based on our mechanism called Fig. 4. Subset of the scheme built for Swiss-Prot. SplitQuery which provides an efficient fragmentation of the biological data set and a protocol for retrieving the fragment, as described in [24]. Currently, we are exploiting the Globus Toolkit 3.2 gene pre-OGSI [25] as Grid middleware in our project. reported publications encodes 4. Case study: using DAI in a Workflow for searching sequence similarity from sequence organism Recently, many workflow languages have been defined such as Web Services Flow Language (WSFL) [26], Business Process Execution Language (BPEL) [27], belongs protein and UML extensions. We use UML (Unified Modeling has Language, [28]) activity diagrams as a workflow language specification. UML, as well as all of its function extensions, is the most widely accepted notation for designing and understanding complex systems; it has an intuitive graphical notation, and UML activity diagrams Fig. 5. Ontology for Swiss-Prot. support [29] most of the control flow constructs and are suitable to model workflow execution. As an application of ProGenGrid, we present a The Mediator accepts requests from the user and workflow modelling the process of searching similarity retrieves the information if the searched data (exploring matching among proteins. Figure 6 shows an activity the DSO) are semantically correlated. It is worth noting diagram specification of the similarity search process. here that the Mediator should implement a logic having a This process starts by supplying a target protein < definition of query with different abstraction levels IDProtein > or its FASTA format (in this example, the (initially, we planned to use the SQL standard language protein target is 1LYN), the search procedure accesses the but now we are considering other hypotheses, providing a database and all of the information about target protein is request virtualisation layer). The Mediator engine recovered from the Swiss-Prot database. coordinates the temporal activities of all of the To date, we are using the SQL language like that for components selecting those available on some nodes of a our experiment. In particular, given the input protein X Computational Grid. (1LYN), and indicating with Yi , i ∈ (1, … 200000) a set of sequences extracted from Swiss-Prot, the following 3.2. Implementation query first selects all sequences from Swiss-Prot whose alignment score is greater than a threshold value score, The Mediator component provides some methods, and then, using the sequence Accession Number, it selects through a Web Services interface. The Web service server from PDB the structural information related to such has been implemented in C, exploiting the gSOAP sequences: Toolkit [22], because it is well suited for the conversion Fig. 6. Workflow of a bioinformatics experiment of sequence comparison. select Y.Structure from PDB where Y.AC in (select Y.AC querying PDB efficiently. from Swiss-Prot where align[blastP(X, Yi)] > score) 5. Conclusions We have searched all of the ACs (Accession numbers) of the sequences of the Swiss-Prot that are similar to the The large amount of data sets that today is available input protein and hence satisfying a given score (applying from geographically distributed storage sources, is blastp tool [30]). Since the ACs of the Swiss-Prot are making data integration increasingly important. present in the PDB, we have selected the corresponding Integration of data demands significant advances in structure for visualizing it with the Rasmol tool [31]. middleware; distributed infrastructures such as Grids and Given a protein, its graphical representation can be Web Services can be used for data integration. compared with respect to each similar protein produced In particular coupling these with ontologies is a by Blast. promising approach to model bioinformatics sources. In We should express two considerations: this paper we presented the architecture of a semantics- 1. All of the tools used in the experiment are run on enriched Data Access and Integration service for Grid nodes; for instance for the visualization, we biological databases. The proposed system extends the have used GRB library [32] to redirect the output classical mediator approach in data integration by of Rasmol on our desktop, using all of the introducing domain ontologies in description of data features of this tool; sources and exposing services through the Web Services 2. In the above query we have used a semantic join approach. Compared to other approaches, our system uses for characterizing the relation between Swiss- Grid protocols such as GridFTP and GSI for fast and Prot and PDB. secure exchange of data. In a simple experiment such as that described above, In our architecture wrappers are created manually and our data access service is fundamental to access the added to the mediator modifying its source code. We are Swiss-Prot and PDB data banks to retrieve the data. In now focusing our efforts to build a dynamic mediator particular an added value of our DAI service is related to through semantic mediation. It will allow using semantic the fact that, as protein sequences are retrieved from information about data sources, such as query Swiss-Prot, their correspondent PDB versions (protein capabilities, data provenance, data scheme, etc. The main structures) can be recovered by using the information goal is to provide a method to add wrappers without stored in the DAI schemes and ontologies, allowing source code modifications. A secondary goal is a tool for [16] Stevens et alt (2000). “TAMBIS: Transparent Access to automatic wrapper generation. Multiple Bioinformatics Information Sources”. Bioinformatics, Future work will regard the full implementation of the 16:2 PP.184-186. system and its use inside ProGenGrid, a grid-based [17] Lange et alt. (2001). “A Computational Support for Access to Integrated Molecular Biology Data”. service oriented to software environment for Site address: bioinformatics applications. http://www.bioinfo.de/isb/gcb01/poster/lange.html#img-1. [18] Orengo C.A., Michie A.D., Jones S., Jones D.T., Swindells 6. References M.B., Thornton J.M. “CATH – A Hierarchic Classification of Protein Domain Structures”. Structure 1997; 5: 1093-1108. [19] Bairoch, A. (1993). The ENZYME data bank. Nucleic [1] Fasman, K. H., Letovsky, S. I., Cottingham, R. W. and Acids Res. 21, 3155-3156. Kingsbury, D. T. (1996). Improvements to the GDB Human [20] The Gene Ontology Consortium. Gene Ontology: tool for Genome Data Base. Nucleic Acids Res. 24, 57-63. the unification of biology. Nature Genet. 25: 25-29 (2000). [2] B., Boeckmann, A., Bairoch, R., Apweiler, M., Blatter, A., [21] Aloisio, G., Cafaro, M., Fiore, S., Mirto, M.: The GRelC Estreicher, E., Gasteiger, M. J., Martin, K., Michoud, C., Project: Towards GRID-DBMS, Proceedings of Parallel and O'Donovan, I., Phan, S., Pilbout, and M., Schneider. The Swiss- Distributed Computing and Networks (PDCN) IASTED, pp-1-7, Prot protein knowledge base and its supplement TrEMBL. Innsbruck (Austria) February 17-19 (2004). Site address: Nucleic Acids Research 31: 365-370 (2003). Site address: http://gandalf.unile.it. http://www.ebi.ac.uk/swissprot/. [22] Van Engelen, R.A., Gallivan, K.A. “The gSOAP Toolkit [3] Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. for Web Services and Peer-To-Peer Computing Networks.”, F., Brice, M. D., Rodgers, J. R., Shimanouchi, O. K. T. and Proceedings of IEEE CCGrid Conference, May 2002, Berlin, Tasumi, M. (1977). The Protein Data Bank: a computer-based pp. 128-135. archival file for macromolecular structures. J. Mol. Biol. 112, [23] Aloisio, G., Cafaro, M., Lezzi, D., Van Engelen, R.A. 535-542. "Secure Web Services with Globus GSI and gSOAP", [4] Özsu, M.T., & Valduriez, P. (1999). Principles of Proceedings of Euro-Par 2003, 26th - 29th August 2003, Distributed Database Systems, 2nd edition, Prentice Hall (Ed.), Klagenfurt, Austria, Lecture Notes in Computer Science, Upper Saddle River, NJ, USA. Springer-Verlag, N. 2790, 421-426, 2003. Site address: [5] Rice, P. Longden, I. and Bleasby, A. "EMBOSS: The http://sara.unile.it/~cafaro/gsi-plugin.html. European Molecular Biology Open Software Suite" Trends in [24] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “Bioinformatics Genetics June 2000, vol 16, No 6. pp.276-277. Site address: Data Access Service in the ProGenGrid System”. Proceedings http://www.ch.embnet.org/EMBOSS/. of the First International Workshop on Grid Computing and its [6] SRS Network Browser. Site address: Application to Data Analysis (GADA 2004), October 25-29, http://www.ebi.ac.uk/srs/srsc/. Larnaca, Cyprus, Greece, OTM Workshop 2004, LNCS 3292, [7] WfMC. Workflow management coalition reference model. pp. 211-221, R. Meersman et al. (Eds.), 2004. Site address: http://www.wfmc.org/. [25] I., Foster, C., Kesselman: Globus: A Metacomputing [8] I., Foster, C., Kesselman: The Grid: Blueprint for a New Infrastructure Toolkit, Intl J. Supercomputer Applications, Vol. Computing Infrastructure, Published by Morgan Kaufmann 11, 1997, No. 2, pp. 115-128. (1998). [26]IBM. Web services flow language -wsfl. Site address: [9] G. Aloisio, M. Cafaro, S. Fiore, M. Mirto, “ProGenGrid: A http://www- Grid Framework for Bioinformatics”. Proceedings of 306.ibm.com/software/solutions/webservices/pdf/WSFL.pdf International Meeting on Computational Intelligence Methods [27]IBM. Business process execution language for web for Bioinformatics and Biostatistics (CIBB 2004), September services- bpel4ws. Site address: http://www- 14-15 2004, Perugia, Italy. 106.ibm.com/developerworks/webservices/library/ws-bpel/. [10] Kreger, H. “Web Services Conceptual Architecture.”, [28]OMG. Uml- unified modeling language: Extensions for WSCA 1.0. IBM, 2001. workflow process definition. Site address: [11] T.T. Wee, M.D. Silva, L.K. Siong, O.G. Sin, R. Buyya, and http://www.omg.org/uml/. R. Godhia, “Asia Pacific BioGRID Initiative”, Site Address: [29]R. Eshuis and R. Wieringa. Verification support for http://www.apbionet.org/grid/docs/.Presentation Slides at workflow design with UML activity graphs. In CSE02. Springer APGrid Core Meeting, Phuket. 2002. Verlag, 2002. [12] myGrid Project, University of Manchester. Site address: [30] Altschul, Stephen F., Gish Warren, Webb Miller, Eugene http://mygrid.man.ac.uk/. W. Myers, and David J. Lipman (1990). Basic local alignment [13] GridFTP Protocol. Site Address: http://www- search tool. J. Mol. Biol. 215:403-410. fp.mcs.anl.gov/dsl/GridFTP-Protocol-RFC-Draft.pdf. [31] Roger A. Sayle and E. J. Milner-White, "RasMol: [14] M. Lenzerini. Data Integration: A Theoretical Perspective. Biomolecular graphics for all", Trends in Biochemical Science In Proceedings of the 21st ACM SIGMOD-SIGACT-SIGART (TIBS), September 1995, Vol. 20, No. 9, p.374. Site address: symposium of Principles of database systems (PODS), pp. 233- http://www.umass.edu/microbio/rasmol/. 246. ACM Press, 2002. [32] Aloisio, G., Blasi, E., Cafaro, M., Epicoco, I. “The GRB [15] Widerhold G. “Mediators in the Architecture of Future library: Grid Computing with Globus in C.”, Proceedings HPCN Information Systems”. IEEE Computer 1992; 25:38-49. Europe 2001, Amsterdam, Netherlands, Lecture Notes in Computer Science, Springer-Verlag, N. 2110, 133-140, 2001.