IWPLS’09 Pages 1–8 Towards Realization of Scientific Dataspaces for the Breath Gas Analysis Research Community Ibrahim Elsayed 1,∗, Thomas Ludescher 2 , Konrad Schwarz 3 , Thomas Feilhauer 2 , Anton Amann 3 , and Peter Brezany 1 1 Department of Scientific Computing, Faculty of Computer Science, University of Vienna, Nordbergstrasse 15/C/3, 1090 Vienna, Austria 2 Research Center for Process and Product Engineering, Vorarlberg University of Applied Sciences, Hochschulstrasse 1, 6850 Dornbirn, Austria 3 Department of Operative Medicine, Innsbruck Medical University, Innsbruck, Austria and Breath Research Institute of the Austrian Academy of Sciences, Innsbruck, Austria September, 2009. Associate Editors: Sandra Gesing and Jano van Hemert ABSTRACT and Research1 . The Austrian Grid consortium combines Austria’s Motivation: Scientific dataspaces aim at providing associated leading researchers in advanced computing technologies with mechanisms for managing semantically rich relationships among well-recognized partners in grid-dependant application areas. An scientific data sources (primary data) and its corresponding findings overview of the Austrian Grid project is provided in 4. In 2007, three (derived data). The latter result from a set of activities defining partners of the Austrian Grid consortium have started a research concrete preprocessing and analysis methods (background data) collaboration with the intention to realize a Secure Infrastructure that were applied to a source dataset. To keep track of scientific for Scientific Data Life Cycle Management on top of applications in experiments that are being conducted by members of a scientific the field of breath gas analysis. This project is described in 6. That community these experiments should be linked with user information paper presents motivation, an overview of the architecture, and the i.e. institutional affiliation, email address, working field, etc. of the mechanisms involved in providing such a secure infrastructure. It scientist who conducted the experiment. This paper deals with the also introduces use case scenarios depicting the current sequence development of a scientific dataspace for the breath gas analysis of events in conducting breath gas analysis experiments. Our community. Breath gas analysis is an emerging new scientific field data management approach is based on the Grid and dataspace with a growing international scientific community addressing many concepts and a new positioning of the MATLAB R language and different breath gas studies in terms of investigating and screening computing environment. The Grid is an infrastructure that enables for hundreds of compounds in exhaled breath gas. The purpose of flexible, secure, and coordinated resource sharing among dynamic breath gas analysis scientific dataspaces is to enable collaborating collections of individuals and institutions 11. The idea of a future scientists and institutions several important activities, which include: data management paradigm called dataspace was introduced by (a) access to distributed breath gas data and analytical resources Franklin et al. 12 and also addressed by authors of this paper in collected and developed at different research institutions around the 8, 9. The goal is to manage a dataspace, rather than a database or world and (b) to easily contribute to and leverage the resources of an other dataset type. Dataspaces are modeled as participants (datasets) international- and national-scale, multi-institutional environment. and relationships. The concepts of a scientific dataspace paradigm Results: This paper describes the conception and prototypical are described in 9. It introduces a specific model of the e-Science implementation of the scientific dataspace paradigm for breath gas life cycle, which the authors defined as: analysis. The scientific dataspace is evaluated on top of applications from breath gas analysis. ... a domain independent ontology-based iterative metamodel, Availability: We discuss the scientific dataspace paradigm in the tracing semantics about procedures in e-Science applications. context of applications from breath gas analysis, however the Iterations of the model - so called e-Science life cycles - concepts introduced can also be deployed in other research domains. organized as instances of the e-Science life cycle ontology, Contact: elsayed@par.univie.ac.at are feeding a dataspace, allowing the dataspace to evolve and grow into a valuable, intelligent, and semantically rich space of scientific data. The major role of the e-Science life cycle ontology 15 is 1 INTRODUCTION to describe and semantically enrich the existing relationship The project Austrian Grid 2 is the second phase of the national among primary, background and derived datasets in e-Science grid initiative funded by the Austrian Federal Ministry of Science 1 BMWF (Federal Ministry of Science and Research) Contract: GZ BMWF- ∗ elsayed@par.univie.ac.at 10.220/0002-II/10/2007 www.austriangrid.at 2009 1 Elsayed et al. applications. Scientific experiments described by the e-Science life days or weeks, depending on computational and human resources cycle ontology are referred to as Life Cycle Resources (LCR). available. However, the resulting derived data, that have arisen This paper deals with an implementation of a scientific dataspace from the research task represent valuable information not only to paradigm on top of the e-Science life cycle ontology for the breath the acting research group, but also to other groups with respect to gas analysis research community. other main focuses. Breath gas analysis is an emerging new scientific field with Breath gas research specific dataspaces will be set up to serve a large scientific community spread all over the world and with a special subject, which is on one hand the relationship of source a promising significant impact on many application domains. data (exhaled breath gas measurement data) and its derived data Recent results suggest that detection of different kinds of cancer (e.g. specific cancer markers) in breath gas analysis experiments is possible by means of breath gas analysis beyond the scope of and on the other hand to integrate scientific knowledge into these available diagnostic methods. There is strong evidence that specific experiments. cancers can be detected using the concentration pattern of volatile This paper discusses the design of a scientific dataspace paradigm compounds in exhaled air 17. for the breath gas analysis scientific community. The rest of this The growing international community of breath gas researchers document is set out as follows: Section 2 introduces selected use is addressing many different studies including endogenously- case scenarios, lists the currently known uses of the scientific derived volatile compounds such as emitted by exhaled breath, dataspace and describes the information a breath gas analysis from skin, urine, faeces, and flatulence. They are currently scientific dataspace should contain and how this information is at the stage of developing new analytical methods, collecting organized; Section 3 describes the operations the dataspace must pilot data for cancer and other diseases and identifying marker support with an indication of how breath gas researcher will interact compounds. Breath gas researchers are investigating and screening and take advantage of the services the scientific dataspace provides. for hundreds of compounds in the exhaled breath gas. The In Section 4 we discuss the iterative approach of conducting breath analytical instruments and techniques used include GC-MS (Gas gas experiments on top of the e-Science life cycle model and in Chromatography Mass Spectrometric), PTR-MS (Proton-Transfer- Section 5 we conclude the paper. Reaction Mass Spectrometry), SIFT-MS (Selected-Ion-Flow-Tube Mass Spectrometry), IMS (Ion Mobility Spectrometry), laser spectrometry - as well as various statistical and data mining 2 APPROACH techniques supporting identification of specific markers. Currently, In our previous work we have introduced the e-Science life cycle during the investigations new different sampling and analytical ontology 9, whose major goal was to semantically enrich the techniques for breath gas measurements are being developed. More existing relationship among primary, derived, and background information about the breath gas research community and their datasets that emerge during the life cycle of scientific data. The research work is available in 3 and online at the website of the goal is to implement a data management solution based on the International Association for Breath Research (IABR - www.iabr.li), concepts of dataspaces for large-scale and long-term management of which was founded in May 2005 in Vienna. scientific data. Our approach is to preserve both, relationships and The purpose of breath gas analysis scientific dataspaces is to data together within a dataspace to be reused by owners and others. enable collaborating scientists and institutions several important To enable their reuse, data must be well preserved. The effects of activities, which include: (a) access to distributed breath gas data loss can be economic, because the experiments have to be data and analytical resources collected and developed at different re-run, but in some cases data loss represents an opportunity lost research institutions around the world, (b) to easily contribute forever 18. To look on the bright side of things, well preserved data to and leverage the resources of an international- and national- represent valuable information, which can lead to fruitful scientific scale, multi-institutional environment. This will strongly support findings in the acting and also in related research areas. Preservation global collaborations of scientists, improve decisions and increase of scientific data is therefore a major requirement, which can best the chance and scope of discoveries in the breath gas research be established if the full life cycle of data is addressed. This is domain. In this context there is a need for a supporting information achieved in our approach by the e-Science life cycle ontology. In the infrastructure providing advanced data management and analytical following sections we describe how breath gas analysis experiments methods and their composition into scientific experiments allowing are consolidated, what actions are involved in terms of data access, the scientist to keep track of their e-Science activities and to publish analysis, and publication. We also show how the full life cycle corresponding results of breath gas analysis linked together with of data of breath gas experiments is organized by the breath gas their source data and semantics about the purpose of the experiment. analysis scientific dataspace. Source data obtained from the previously mentioned analytical methods are referred to as breath gas measurement data and are saved, together with corresponding patient data, locally at each 2.1 Breath Gas Analysis Dataspace Usage research center. These data are the fundament for simulation and The use case scenario depicting the current sequence of events in modeling by the acting research group, e.g. observation of the conducting breath gas analysis experiments is illustrated in Fig. 1, correlation between isoprene breath content and cholesterol level which is reproduced from the position paper 6. It presents an in blood. Such breath gas experiments, if evaluated on large amount overview, with the Server being an isle system gathering the data. of real data, allow a more detailed analysis including e.g. gender- Import and export activities on this system are protected by smart specific relation with respect to age-dependency 14. The output of card security measures. these analyses aims at defining a large number of predictions and The implementation of a secure infrastructure, which provides the might provoke further experimentation, which in turn may take needed services for breath gas researchers to efficiently and securely 2 Breath Gas Analysis Scientific Dataspace 2.2 Breath Gas Analysis Dataspace Content The breath gas analysis scientific dataspace consists of a set of Proband databases set up by a dataspace designer and administered by a 1. Take sample dataspace manager. In particular, there are: (breath air and compartment airt) • primary databases for storing the final input dataset. This kind 2. Collect data (e.g., smoker) of datasets are being created after steps 1 to 7 of the use case 3. Add additional described above. They can be retrieved by submitting a query data to one of the breath gas research source databases available Doctor Questionnaire- software to the acting researcher. However, in order to handle the issue that data and probably also structure of the data might change 4. Transfer Sample bag 5. Attach sample bag completed questionnaire Proband's Laboratory over time, we store this created final input dataset into a Values separate database called primary database, which is designed 6. store particularly for this purpose. A typical final input data set is less raw data 8. Analyse data than 1 MB in its MATLAB structure (.mat file), which is a 7. Calculate binary data container format used by MATLAB. It may include Database substances and concentrations arrays, variables, functions, and other types of data. It is Different Mass Spectrometers Server with BGA Source DB Workstation organized in three blocks as follows (1) patient data - includes 9. Transfer identified all collected data of different test persons such as proband value substances with concentrations (e.g. height, weight), burden (e.g. smoker/nonsmoker), labor value (e.g. blood parameter), etc. (2) the system information Hospital information system block manages all system settings for the two databases like all users with their corresponding user groups, studies with their questionnaires, different mass spectrometers with status, container types with status, and (3) the analysis data part Fig. 1. The use case depicting the current sequence of events 6 - Steps 2 includes all information on a specific measurement of a sample to 4 present the collection and subsequent transfer (i.e. manual import) of such as mass spectrometer type, used container, collection date, personal data to the server hosting a breath gas analysis database. Steps 5 measurement date, data (substances with concentration and to 7 involve the collection and preprocessing of the probands analysis data. Step 8 is the actual analysis done using a workstation employing Matlab. additional information), etc. • background databases for storing the analytical methods used to analyze the final input dataset i.e. MATLAB functions in M-files (ASCII-textfiles containing MATLAB commands and functions). Typical size of a single M-file is less that 500 KB. • derived databases for storing the results of analyses tasks. Once the breath gas researcher has accomplished his perform steps 1 to 7 is part of the project described in 6. However, experiment, he can publish the results of his analysis. Therefore the present document focuses on the realization of a scientific we take advantage of MATLAB’s publishing function, which dataspace for the breath gas scientific community. We assume lets you export results as plots or as complete reports. Using that a secure and isolated database storing masspectrometer and the MATLAB editor, researchers can automatically export their patient data is already set up and administered by the corresponding MATLAB results, including the code into XML and various Regional Head Service as described in 6. In the following we refer other file formats, e.g. HTML or LaTeX. Since typical breath to this database as the source database. gas experiments include plotting functions, this dataset usually We are aware that in order to successfully establish a large- includes jpg images. Typical size is less that 2 MB. scale scientific dataspace for the breath gas analysis community • other databases i.e. volatomics databases, which contains data with a large amount of well described experiments of exhaled from studies of exhaled breath gas as well as other sources of breath, we rely on active participation of members of the scientific endogenously-derived gases such as skin, urine, faeces, and community. Therefore we have - in cooperation with leading breath flatulence. There is a small number of mandatory data fields, gas researchers - defined a number of actions that a researcher is which will record basic metadata of studies of exhaled breath conducting during the process of performing breath gas studies. We gas. Records in this database will be based on a specific report have then mapped the specific actions to activities of the e-Science from some published source, such as a journal, conference life cycle. We also indicate where (on the community portal or proceeding, or on-line publication. This means that records in within MATLAB) these actions should be taken. The actions, its this database will be made only after the study has shown some corresponding e-Science life cycle activities, and the“place” where significant research results that were already scientifically they are taken are listed in Table 1. published. Based on this common understanding we are designing and implementing the tools that support the breath gas researchers in In addition there are special databases set up for storing the conducting their breath gas studies according to actions described instances of the e-Science life cycle ontology, which are defined above. 3 Elsayed et al. Table 1. Definition of breath gas analysis actions and their mapping to e-Science life cycle activities action description e-Science life cycle activity place 1 Login to the system. GoalSpecification Portal 2 Definition of the goals of the study. GoalSpecification Portal 3 Collection of the probands analysis data (masspectrometer and patient data; this action DataPreparation Lab/Questionnaire covers steps 1-7 of the current sequence of events depicted in Fig. 1). Software 4 Formulation and submission of a query to the source database. This action generates DataPreparation Portal the final input dataset, which will be included into the dataspace as participant marked with type “primary data”. 5 Selection/development of the analytical method for analyzing the prepared dataset. This TaskSelection MATLAB action generates the analytical methods, which will be included into the dataspace as participant marked with type “background data”. 6 Execution of the analytical methods. TaskExecution MATLAB 7 Process the results and export them using MATLAB’s publication function into XML. ResultPublishing MATLAB This action generates the results report, which will be included into the dataspace as participant marked with type ”derived data”. 8 Set publication mode to conducted experiment. ResultPublishing Portal in RDF 2. This is based on using OGSA-DAI 19 to present an OGSA-DAI resources to specific users. Details about the access RDF store. OGSA-DAI is the de facto standard for data access and control mechanisms and the security considerations in general are integration for relational and XML data as well as file resources. described in 5, 6. There are many different tools available for In contrast to data warehouse systems, data owners don’t lose analyzing breath gas data (e.g. MATLAB, Octave, GridMiner, etc.). control over their data stored in above introduced participants of Breath gas researchers are already used to some tools and therefore the scientific dataspace. This is due to the publication concept might not want to change their analysis software. Therefore, we that is provided by the e-Science life cycle ontology (see Section were considering in our architecture to make the interface to 3). Breath gas researcher can limit access to their recources by the scientific dataspace independent of any analysis tool. Users assinging different publication modes. Also, single breath gas access breath gas source data from their source database via the experiments and even whole studies can still be removed from community portal, which transforms the source dataset, using the dataspace by the corresponding data owner. Another important wrappers into the format, which the corresponding tool can parse. distinction to data warehousing is in user management. The We provide interfaces for accessing data from and publishing data scientific dataspace provides multiple research groups and types of into the dataspace by following the strict access control mechanisms users to share their scientific resources without the time consuming introduced above. The aim of the user environment is to guide preparation of separate data marts as required in data warehousing. breath gas researcher through their experimentation and their data Furthermore, in data warehousing systems data regularly need to publication by considering the e-Science life cycle ontology. Once be extracted, cleaned, and loaded into the system; even though this the scientific dataspace has grown into a large scientific resource, process is fully automated, it still consumes time as does regular we suspect it will be widely used by analysis, diagnostic and maintenance. This time commitment required is not the case with visualization tools. the scientific dataspace paradigm because the data is loaded into the Each instance of the above listed types of databases is defined as corresponding databases, during execution of the experiment. dataspace-participant. On a lower level of abstraction the contents Since patient data is involved and due to legal requirements on of the databases itself are distinguished as dataspace-participants, such highly sensitive personal data, security and privacy issues i.e the final input dataset or a single analysis function (M-file) are of utmost importance. Thus within applications in the breath used within a specific breath gas analysis experiment. The different gas analysis research domain, all participating databases including levels of abstraction are illustrated in Fig. 3. On a higher level of the RDF store are isolated, monitored, and restricted to a single abstraction there might be whole dataspaces that act as participants point of access using the OGSA-DAI interface, hence implement of an interconnected large-scale hyper-dataspace. Such scenarios strict access control. An instance of the breath gas scientific are important when multiple research organizations, each having dataspace and its user environment is illustrated in Fig. 2. The deployed their own dataspace, engage collaborations and agree figure clearly shows that the dataspace consists of five databases on sharing their scientific dataspaces and life cycle resources, of types described above. Each is deployed as an OGSA-DAI respectively. resource running on a secure virtual host in a virtual machine resource. The recently developed logging framework inside OGSA- DAI is used to detect data abuse. The OGSA-DAI services itself are protected by standard security mechanisms supported by the 3 METHODS underlying container, which is in our case the Globus Toolkit Breath gas researchers will interact with the scientific dataspace via the container 1 . We use mapping files in order to restrict access to community Web portal as depicted in Fig. 2. A Web portal is a website that 4 Breath Gas Analysis Scientific Dataspace Single Point of Access Anlaysis Tool 1 Anlaysis Tool 2 Anlaysis Tool 3 Anlaysis Tool n VM Resource Secure Virtual Host … ??? OGSA‐DAI + Logging MATLAB® Octave GridMiner IABR_DBk User Environment – Community Protal SDB VM Resource source database Google Web Toolkit RDB Instance of a Breath Gas Analysis Scientific Dataspace Single Point of Access VM Resource VM Resource VM Resource VM Resource VM Resource Secure Virtual Host Secure Virtual Host Secure Virtual Host Secure Virtual Host Secure Virtual Host OGSA‐DAI + Logging OGSA‐DAI + Logging OGSA‐DAI + Logging OGSA‐DAI + Logging OGSA‐DAI + Logging IABR_DB1 IABR_DB2 IABR_DB3 … IABR_DBn IABR_DB SDB VM Resource SDB VM Resource SDB VM Resource SDB VM Resource SDB VM Resource primary database derived database background db volatomics db special database RDB XML RDB RDB RDF Fig. 2. An instance of the breath gas analysis scientific dataspace and its user environment - An instance is consolidated of at least five separate databases, which might be geographically distributed. Each database stores dataspace-participants of corresponding types (primary, derived, background, and other) as described above. The special database stores the relationships among participants in the form of individuals of the e-Science life cycle ontology. Analysis tools are independent of the user environment, where data is accessed from and loaded into the dataspace and corresponding external source databases hosting breath gas raw data. acts as a point of access for a wide variety of information. The community defines the concepts of the e-Science life cycle model as OWL-classes and portal consists of various portlets, which are pluggable components that are properties. managed by the portal. The portlets of the Web portal are categorized into The e-Science life cycle manager guides the breath gas researcher through three major categories: (1) Infrastructure - Resource Management Portlets, the above mentioned five e-Science life cycle activities. It creates new (2) Monitoring - Semantic Logging Portlets, and (3) Action - Scientific instances of the e-Science life cycle and corresponding activities and allows Dataspace Portlets. The first two categories are primarily for administrators the researcher to refine them throughout his study. Already defined instances and therefore not further described in this paper. The latter category is of e-Science life cycle activities can be reused in other iterations or even in designed for the breath gas researcher. It consist of two portlets (1) the e- other studies. For instance, a created final input dataset, which represents the Science life cycle manager and (2) the dataspace browser. In the following, output of the activity DataPreparation might be reused in different iterations the purpose and functionality of these two portlets are described and an of the same experiment or even in different studies. In order to find the overview of the user interaction is given. The community portal is built corresponding instance or other related instances for reuse we provide an using the Google Web Toolkit (GWT) 13, which allows to build and maintain integrated search and query interface as described below. complex yet highly performant JavaScript front-end applications in the Java programming language, especially when combined with the Google Plugin Search & Query the dataspace - Dataspace systems must enable users to for Eclipse. interact with dataspaces through a search and query interface. However, we should keep in mind that much of the interaction with such a system is of exploratory nature. The implemented scientific dataspace model is 3.1 The e-Science life cycle manager based on the e-Science life cycle ontology. Data about conducted breath gas experiments is semantically rich described within instances of the ontology The e-Science life cycle manager is a tool that enables the breath and is stored in RDF format. Using SPARQL query language for RDF, the gas researcher to describe the experiment he is currently conducting scientific dataspace is able to provide answers to specific questions, such as according to the activities of the e-Science life cycle model. The the following: model consists of five phases representing typical activities a researcher is carrying out when performing scientific experiments. Thus the A ”I have detected a model error and want to know which derived data most important steps in performing scientific experiments in e-Science products need to be recomputed.” applications are classified into the five activities, which we named e-Science life cycle activities (GoalSpecification, DataPreparation, TaskSelection, B ”I want to check if insopiration is different to expiration of breath gas TaskExecution, ResultPublishing) 9. The e-Science life cycle ontology dataset x. If the results already exist, I’ll save hours of computation.” 5 Elsayed et al. C ”Is there any experiment done on the volatile organic compound isoprene on exhaled breath gas in the context of cholesterol level in high blood?” P P SDS2 SPARQL is a powerful query language for RDF data. However, strict P mapping P P and complex grammar is a common characteristic of any query language SDS3 being proposed for structured data including SPARQL, SQL, XQuery, etc. P We are aware that it is not easy for breath gas researcher to accept these P languages. Keyword search is widely used in Web search engines, but it P SDS1 cannot efficiently support semantic query. Therefore we implement a search P and query interface providing the breath gas researcher an easy way how to express queries that will be transformed into a SPARQL query behind and submitted to the RDF store. SDS(Pi,Pj,Pk) DB2 DB1 relation ship BD Pj Neighborhood keyword queries - as introduced in 7, they have also the PD Pi goal to explore associations between data items, can be provided by the life ip n sh Level of Abstraction re la cycle ontology. For example searching for “endogenously-derived gases” tio t io rela ns hi p returns not only the goalSpecification instances available in the dataspace DB3 Pk that mention “endogenously-derived gases”, but also the instances of its DD neighbor activities informing the user what proband data was used and where it resides, which analytical methods were applied, and what results of the corresponding experiment were achieved. Information about the scientist Pi1 who conducted the experiment is also returned. This will strongly enforce Pj1 new research collaboration within the breath gas research community. relationship final input dataset masspecttrometer and Publication Modes - Different research groups are more or less collaborating patient data analytical method within the breath gas scientific community. A research group belongs to relational data Matlab Script re l at M‐file a Regional Head Service (RHS). The RHS is responsible for managing a io p ns hi LCR(Pi1,Pj1,Pk1) hi p l at io n country’s secure databases, which are under its ultimate control. The RHS re again belongs to a Project Head Service (PHS), which implements the access Pk1 decisions services, authentication, user management, and the dataspace services implementing the functions described above 6. analysis results Users can publish their conducted e-Science life cycle experiments using Matlab Report Legend: five different publication modes: XML P… Dataspace Participant SDS … Scientific Dataspace LCR … Life Cycle Resource • researcher - access to the instance of the life cycle can only be accessed by a specific researcher i.e. the researcher who conducted the experiment. • investigator - the instance will be accessible for investigators i.e. the Fig. 3. Levels of abstraction of dataspace participants - in the lowest supervisor of a researcher. abstraction level, participants of the scientific dataspace represent concrete • research group - the life cycle instance will be available for members datasets that were used within a breath gas experiment. They form together of the research group the publisher is working with. a Life Cycle Resource (LCR). In a particular breath gas experiment, these • project member - the life cycle instance will be available for members participants are Pi1 - the final input dataset (primary data), Pj1 - the of the project the publisher is involved in. analytical method as MATLAB commands and functions organized in a m-file (background data), and Pk1 - the resulted analysis report generated • IABR member - the life cycle instance will be available for all members using MATLAB’s publish function (derived data). These participants are of the IABR. In this case there will be an entry made into the volatomics stored in corresponding databases, as depicted in the figure. The relationship database among these participants is semantically rich described by individuals of the e-Science life cycle ontology. In order to simplify matters we don’t 3.2 The Dataspace Browser show in this figure the database that stores the relationships (RDF). On The dataspace browser is a tool that allows the user to navigate trough the the next abstraction level the databases DB1, DB2, and DB3 represent e-Science life cycle resources available in the dataspace in a visual way. It participants forming the scientific dataspace SDS(Pi , Pj , Pk ), which is an is implemented as a portlet for easy integration into a community portal. It instance of a breath gas analysis dataspace as was deployed as experimental submits SPARQL queries attached with the role of the requesting user to the framework for the Breath Research Institute of the Austrian Academy of RDF Store. Based on the role of the user, he will see more or less e-Science Sciences. On the top level we illustrate a large-scale hyper dataspace, that life cycle resources. For instance the scientific dataspace may contain life arise when multiple dataspaces deployed for different organizations are cycle resources to which the publication mode Researcher was assigned. interconnected e.g multiple breath gas analysis research institutions engage Such life cycle resources should be only accessible for the researcher who research collaborations, each running their own breath gas dataspace. In created the resource. Therefore, the request from the Dataspace Browser this scenario the dataspaces itself represent participants of the large-scale will include the role of the user. The response represents RDF-data and is hyper-dataspace. used as input for the dataspace browser. There are a number of tools available that visualizes RDF data. Some example projects include Welkin 20, multiple plugin tools for the Protege environment 10, and Semantic Analytics Visualization 16. These tools could 6 Breath Gas Analysis Scientific Dataspace also be used by breath gas researcher in order to browse the contents of the experiment, e.g. selection of all ex-smokers. The outputs of this dataspaces. selection process, which is done within MATLAB are twofold: (1) A new MATLAB MAT-structure containing the selected data 4 DISCUSSION (fids.mat). This represents the final input dataset and is therefore stored in the primary database as Binary Large Object (BLOB). In data warehousing data typically is extracted, transformed from (2) The MATLAB function itself (selectData.m), which is multiple data sources, and loaded into a separate database, called a responsible to select the required data. This M-file represents the data warehouse. This is not the case with the scientific dataspace background data and is therefore saved in the background database. paradigm discussed in this paper. Although, we load the final Serveral analysis functions might be implemented by the breath input dataset prepared for an analysis task into a separate primary gas researcher and applied to the final input dataset. Such database that is participating the dataspace. This is in fact one step MATLAB functions also represent background data, thus are of the whole data life cycle in e-Science, which we are trying to saved correspondingly. For easier handling we provide an empty semantically enrich and preserve within the scientific dataspace. MATLAB file named executeStudy.m, which will be used Experiments on exhaled breath gas are being successively refined in all experiments. The breath gas researcher may add their own [iterations of action 4-6 in Table 1], by the acting researcher until implementations into it or import external analysis functions. These the study either shows a significant result (i.e. definition of accurate functions calculate the derived data, which are usually input to methods for estimation of blood gas levels of certain biomarker a plot function. Again an empty template (plotData.m) is values from breath gas samples) [prepare action 7] or ends up in prepared to be used by the researcher. Finally we provide a publish a modification of the intended defined goal specification for that function (dsPublish.m), which generates an XML report of the experiment [modify goals and restart action 2]. However, in both conducted experiment including plots, if used, by taking advantage cases several iterations of the e-Science life cycle model are being of MATLAB’s publishing feature. This report is then zipped and performed. Some instances of e-Science life cycle activities might saved as BLOB in the derived database of the scientific dataspace. be reused in another iteration of the life cycle, for instance when Based on this investigations, we have created guidelines for breath a breath gas researcher executes the same final input dataset on gas researchers, defining concrete activities and documentation a slightly refined analytical method. In this case the goals defined policies, which guide users through the e-Science live cycle. An and the data prepared for that experiment did not change, therefore empty template with named files is also created for a better its corresponding instances of the e-Science life cycle activities are comprehension. being reused within a new iteration of the life cycle. During this iterative process of refinement, it might be the case that the acting researcher needs to share the experiment with his supervisor, who might decide to further share it with the research BGA Dataspace group or even with members of a collaboration project. Finally, metaData consistsOf consistsOf once the study has shown some significant results that are published in some journal or in proceedings of a conference etc., these describes participatesIn participatesIn results represent valuable information not only to the acting research isDescribedBy group and it might be worth to publish the experiment using the Participant describesRelationshipAmong LifeCycleResource ID:LCR111111 IABR member publication mode. Thus it will be accessible for all isUsedIn members of the breath research association. fids.mat hasAttributes 4.1 Evaluating Studies report.zip attributeName Type hasAttributes During our investigations on the e-Science life cycle model, we hasAttributes selectData.m executeStudy.m have cooperatively (computer scientists and breath gas researcher) plotData.m attributeName Masspectrometer correspondsTo ... conducted several sample breath gas experiments on top of the ... attributeValue scientific dataspace. In the following we summarize the major hasAttributes correspondsTo Masspectrometer and patient data activities of those experiments and describe their outputs and how data is organized by the scientific dataspace. At first the breath ... ... attributeValue PTR-MS gas study is described in textual form. Users can define their own attributes and add values to it, e.g. attribute Description contains a textual description of the goals of the sutdy. This data is, together with information about the acting researcher (research Fig. 4. Snapshot of an RDF graph of a sample breath gas experiment - This group, department, publication mode) saved as individuals of the snapshot shows a life cycle resource, which describes the relationship of life cycle ontology in the RDF Store. A snapshot of a concrete RDF the participants depicted in the figure. The participants represent primary, graph of a sample breath gas experiment is illustrated in Fig. 4. background, and derived data of the breath gas experiment. Participants Data access is done using the OGSA-DAI client and a certificate. are described with attributes and their values, which the acting breath gas In most scenarios, the researcher first loads all values of a the dataset researcher defines, while conducting the experiment. In the breath gas he is investigating into the MATLAB structure. This is done using an analysis domain, there is a set of attribute names predefined, to be followed implemented data load MATLAB-function (loadData.m), which by scientists for a consistent description of breath gas analysis experiments. communicates with an OGSA-DAI client. Then the breath gas researcher selects the values he is interested in for the current 7 Elsayed et al. 4.2 Implementation Status Innsbruck, forming the breath gas analysis application. The entire A first experimental framework has been developed in order to research team contributed to the discussions that led to this paper early evaluate the concepts of the e-Science life cycle ontology, and provided the environment in which the ideas could be tested. which is the basis of the scientific dataspace paradigm discussed Finally we would like to thank the support from the e-Science in this paper. This framework builds on top of Protege. We used Institute for the hosting of the IWPLS09 workshop. the built-in individual editor to create several life cycle resources REFERENCES that represent breath gas analysis experiments and the SPARQL [1]The globus alliance. http://www.globus.org. query panel in order to retrieve those resources. Furthermore, we [2]Resource description framework (RDF). http://www.w3.org/RDF, February 2004. [3]A. Amann and D. Smith. Breath analysis for clinical diagnosis and therapeutic deployed an experimental breath gas source database, restricted monitoring. World Scientific, Singapore, 2005. to a single point of access using the OGSA-DAI interface and [4]M. Baumgartner, C. Glasner, and J. Volkert. An overview of the austrian grid implemented a MATLAB function that allows to communicate with infrastructure. Proceedings of the 1st Austrian Grid Symposium, 2005. that database using a certificate. We then conducted several breath [5]M. Descher, P. Masser, T. Feilhauer, A. M Tjoa, and D.Huemer. Retaining data gas experiments using the MATLAB environment and manually control to the client in infrastructure clouds. In Proceedings of International Conference on Availability, Reliability and Security, IEEE, Fukuoka, Japan, 2009. saved its background and derived data into corresponding databases. [6]M. Descher, P. Masser, T. Ludescher, B. Wenzel, T. Feilhauer, P. Brezany, Even, if this first experimental framework has no interfaces I. Elsayed, A. Wöhrer, A. M Tjoa, and D.Huemer. Position paper: Secure implemented, it allowed us to proof the concepts of the introduced infrastructure for scientific data life cycle management. In Proceedings of breath gas analysis scientific dataspace and to elaborate a feasibility International Conference on Availability, Reliability and Security, IEEE, Fukuoka, Japan, 2009. study. Based on this first experimental evaluations we have finalized [7]X. Dong and A. Halevy. Indexing dataspaces. In SIGMOD ’07: Proceedings of and documented the architecture of the system including the design the 2007 ACM SIGMOD international conference on Management of data, pages of all necessary OGSA-DAI workflows in an UML-based design 43–54, New York, NY, USA, 2007. ACM. document to be available as technical report. [8]I. Elsayed, P. Brezany, and A M. Tjoa. Towards realization of dataspaces. In As the next step we addressed the interface design, and started DEXA ’06: Proceedings of International Conference on Database and Expert Systems Applications, IEEE, 2006. to build a first prototype using the above introduced Google Web [9]I. Elsayed, A. Muslimovic, and P. Brezany. Intelligent dataspaces for e-science. Toolkit (GWT). In Proceedings of International Conference on Computational Intelligence, Man- Machine Systems and Cybernetics (CIMMACS ’08), World Scientific and Engineering Academy and Society Press, Cairo, Egypt, 2008. [10]Stanford Center for Biomedical Informatics Research. Protege ontology editor 5 CONCLUSION and knowledge-base framework. http://protege.stanford.edu/, 2009. This paper describes the proposal and implementation of a scientific [11]I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the Grid: dataspace paradigm for the breath gas analysis scientific community. Enabling scalable virtual organizations. International Journal of Supercomputer Applications, 15(3), 2001. The authors, computer scientists and leading breath gas researcher [12]M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: A new have investigated and evaluated a novel data management and abstraction for information management. ACM SIGMOD, December 2005. scientific data preservation approach based on the concepts of [13]Google Code. Googel web toolkit (gwt). http://simile.mit.edu/welkin/, 2009. dataspaces for the scientific community of breath gas researcher. [14]Kushch I, Arendack B, Stolc S, Mochalski P, Filipiak W, Schwarz K, Schwentner L, Schmid A, Dzien A, Lechleitner M, Witkovsk V, Miekisch W, Schubert J, First breath gas experiments were collaboratively conducted on Unterkofler K, and Amann A. Breath isoprene - aspects of normal physiology top of the experimental dataspace framework, which allowed us related to age, gender and cholesterol profile as determined in a proton transfer to evaluate the e-Science life cycle ontology. This first evaluation reaction mass spectrometry study. Clin Chem Lab Med. 2008;46(7):1011-8. represents the bases for an extension and deeper investigation of the [15]I. Elsayed et al. The e-Science Life Cycle Ontology. www.gridminer.org/e- scientific dataspace paradigm introduced. sciencelifecycle, 2008. [16]A. P. Sheth L. Deligiannidis and B. Aleman-Meza. Semantic analytics visualization. In Proceedings of the International Conference on Intelligence and Security Informatics (ISI-2006), IEEE, San Diego, CA, USA, 2006. ACKNOWLEDGEMENT [17]M. Ligor, T. Ligor, A. Bajtarevic, C. Ager, M. Pienz, M. Klieber, H. Denz, M. Fiegl, W. Hilbe, W. Weiss, P. Lukas, H. Jamnig, M. Hackl, B. Buszewski, The Austrian BMWF (Federal Ministry for Science and Research) W. Miekisch, J. Schubert, and A. Amann. Determination of volatile organic funding of the Austrian Grid 2 project (Contract: GZ BMWF- compounds appearing in exhaled breath of lung cancer patients by solid phase 10.220/0002-II/10/2007) is key to bringing the partners together microextraction and gas chromatography mass spectrometry. Clinical Chemistry and to undertaking the research. The Department of Scientific and Laboratory Medicine 47, 550–560., 2009. [18]C. Lynch. Big data: How do your data grow? Nature, 455(7209):28–29, 09 Computing, University of Vienna brought the e-Science life cycle 2008/09/04/print. ontology to the project and the Research Center for Process and [19]M. Antonioletti et al. OGSA-DAI 3.0 - the whats and the whys. Proceedings of Product Engineering, University of Applied Sciences, Dornbirn, the UK e-Science All Hands Meeting 2007, September 2007. set up the meetings that led to the research collaboration with the [20]S. Mazzocchi and P. Ciccarese. Welkin - a graph-based rdf visualizer. http://code.google.com/intl/en/webtoolkit/, 2004. Breath Research Institute of the Austrian Academy of Sciences, 8