9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 Managing, Preserving and Disseminating Research Objects in Earth Science with the ROHub Science Gateway Raul Palma and Cezary Mazurek Jose Manuel Gomez-Perez and Andrés García Poznan Supercomputing and Networking Center Expert System Poznan, Poland Madrid, Spain {rpalma, mazurek}@man.poznan.pl {jmgomez, agarcia}@expertsystem.com Abstract— Research Objects (ROs) are semantically enriched experiment/observation, and the means for sharing, validating information units encapsulating all the materials and methods and disseminating the research work as a single information relevant to a particular scientific investigation, their associated unit, to be interpreted and reused by the community in the metadata and the context where such resources were produced future. and came into play. Their purpose is to enhance the sharing, preservation and communication of data-intensive science, Such capabilities require both an underlying (research facilitating validation, citation and reuse by the community. For object) model and the technological support implementing this such mission, infrastructure and tools for RO governance are model. The former, known as the RO model, specifies the critical. ROHub is the platform of reference in the management semantic vocabulary and relations for capturing and describing of ROs and their lifecycle. It enables researchers to preserve their ROs, their provenance and lifecycle. The latter is provided by work and make it available to others, as well as to discover and ROHub, a holistic RO management platform implemented reuse pre-existing scientific knowledge. In this paper, we natively on top of the RO model. ROHub supports scientists introduce ROHub to the Science Gateways community and throughout the research lifecycle to manage and to structure present new capabilities and extensions specific to Earth their resources as high-quality ROs, fostering collaboration Sciences, beyond previous efforts in experimental disciplines. within and across scientific communities with such ROs at the center. Keywords—Research Objects, Earth Science In the following, we introduce the RO model with a I. INTRODUCTION concrete example, followed by a description of ROHub and Research in data-intensive disciplines is increasingly the recently implemented extensions to both the model and the consuming and generating a variety of digital resources during platform in support of Earth Sciences communities. Finally, the course of scientific investigations. This has steadily we illustrate the usage of research objects and ROHub with a increased the need for means to systematically capture the working example and conclude with a discussion on the lifecycle of scientific investigations, which at the same time ongoing work. provide a single-entry point to all the related resources, including data, publications, computational resources, and the II. RESEARCH OBJECTS researchers involved in the investigation. In Earth Science, for A research object can aggregate an arbitrary number of example, the high-level research and information lifecycle heterogeneous resources, which can be internal or external involves tasks such as: access to data (e.g., raw data and/or a (linked by reference) to the research object location, such as variety of added value products); sharing results (with colleagues and/or community); execution of data analytic the data used or the results produced in an experiment study, methods and generation of models; validation and the (computational) methods employed to produce and analyse dissemination of findings; and collaboration with colleagues that data, and the people involved in the investigation. [1]. Additionally, the resources in the research objects can be organised within folders (a special type of resource), to Research Objects (ROs) [2] provide the mechanisms to facilitate their inspection. Similarly, the research object can support researchers in these tasks. Originally conceived to encapsulate any number of annotations associated to these support the scientific endeavour in experimental disciplines resources (or the research object itself), enabling the like Genomics or Astrophysics, ROs are rapidly being adopted understanding and interpretation of the scientific work, such as in other fields, with special interest in Earth Sciences. With provenance and evolution information, descriptions of the the necessary extensions and updates, research objects can support also earth scientists to manage their scientific computational methods, dependency information and settings investigations lifecycle, providing structured containers that about the experiment executions. aggregate all the resources related to a particular 9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 Figure 1 Partial view of an exemplary research object structure To represent the rich, and potentially complex, structure of API and the RO Evolution API, which define the formats and a research object, the underlying model was implemented as a links used to: i) create and maintain ROs, the resources suite of lightweight ontologies building upon existing aggregated and the associated annotations (metadata); ii) vocabularies: OAI ORE (Object Exchange and Reuse) for change the lifecycle stage of a RO, create an immutable copy specifying aggregation of resources, the Annotation Ontology (snapshot or archive) from a working (live) research object (AO) to support the annotations, and the PROV Ontology to and fetch their evolution provenance. The backend also represent provenance information. A complete specification of provides APIs for notifications, search, access control and user the model can be found in [3]. management, plus a SPARQL endpoint. The frontend exposes RO functionalities to the end-users through a web GUI. This is Figure 1 depicts a partial (and simplified) view of a the main interface for researchers to interact with ROHub. research object structure that illustrates the RO model with a concrete example. This research object is the result of the B. Key Features Overview study elaborated in Section V. Create, manage and share ROs: ROHub provides III. ROHUB different methods for creating ROs: from scratch, from a zip file or by importing resources from other repositories. It also ROHub (www.rohub.org) enables scientists to manage and supports different access modes for sharing ROs (open, public preserve their research work through ROs, to make it available or private), allowing to specify who can read/write to the RO. for publishing, to collaborate and to discover new knowledge Discover, explore and reuse ROs using a faceted or (see [4,5] for a more detailed description of ROHub and its keyword search interfaces, or using directly the SPARQL origin). endpoint, for discovering ROs that can then be inspected, A. ROHub Implementation downloaded, and reused to create new ones. ROHub comprises both a backend service and a frontend Assess RO quality: The RO overview panel shows a (client) application. The backend provides a set of REST APIs progress bar of the RO quality based on a set of basic RO [6] implementing the RO model, which can be used to access requirements (Figure 2). Further quality information can be ROHub programmatically. The two primary ones are the RO found in the quality panel, where ROs can be assessed against 9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 predefined checklist templates for specific domains or described above. community needs. IV. EXTENSIONS FOR EARTH SCIENCE Manage RO evolution: ROHub allows creating snapshots of the current state of the RO for sharing or release, keeping ROHub is a domain-agnostic platform that has been tested their versioning information and associated changes. RO in Experimental Sciences. Currently, we are extending its evolution can be visualized from the History panel. capabilities to support the specific needs of the Earth Science community as part of EVER-EST project. The analysis of such needs led to the new features in the model and in the platform, including support for: Geospatial and time information. The most relevant metadata in Earth Sciences, includes geographical location and the time period covered or associated to the RO. Data access policies to specify more detailed information about the possible use of digital content in publishing, distribution, and consumption of digital media across all sectors and communities. Intellectual properties rights to specify detailed information on the terms of use for a given resource. RO Fork functionality to create a new RO from an existing one to start a new line of work or extend a previous one, citing automatically the source RO. The resulting RO model extensions are publicly available (https://github.com/wf4ever/ro/tree/earth-science) and ROHub is currently being extended to support them and to provide related user interfaces. Such new capabilities include amongst Figure 2 ROHub – RO overview panel others: access and manipulation of geopositioned ROs through a map interface, definition and enforcement of data access Nested ROs: An RO can aggregate any type of resource, policies and intellectual property rights, and the creation of including internal, links to external resources and other ROs. new ROs by forking existing ones. The latter allows aggregating RO bundles [7] that are self- V. EXEMPLARY USE CASE contained ROs serialised as ZIP files and generated by 3rd party tools (e.g., workflow management systems). In this section, we introduce an excerpt of one real scenario provided by a virtual research community from Preserve and monitor ROs: Long-term preservation features EVER-EST project, and then highlight the current limitations include RO fixity checking and quality monitoring that in the existing technologies and practices to illustrate how the generate notifications of changes. RO content and quality use of the research object and ROHub can contribute to the changes are shown in the notification panel, and an atom feed preservation, sharing and reuse of research outputs. The RO is available to get automatic notifications. Additionally, the associated to this scenario (depicted in Figure 1) is available quality monitoring has an interface that can be reached from at: http://sandbox.rohub.org/rodl/ROs/SeaMonitoring01/ the quality panel to visualise the RO quality through time. Sea Monitoring Scenario: A researcher needs to define the habitat extent of the Cold Water Coral in the Bari Canyon and Semantic enrichment: An RO can be enriched automatically to provide this information to assess the good environmental with structured metadata extracted from its textual content, status related to the descriptor D1 (Biodiversity, Indicator including the main concepts, domains, lemmas and named Habitat extent) within the Marine Strategy Framework entities, in order to facilitate its discovery via the Directive for the Italian waters. To this scope, the researcher faceted/keyword search interfaces. Such metadata needs a habitat suitability model for the Cold Water Corals. complements the metadata provided explicitly by scientists, The researcher needs to search high resolution bathymetric offering a richer, machine-readable description of the RO. data, Cold Water Coral occurrences data and to run a good model to obtain a reliable map of habitat suitability for Cold DOI and citation: Now a DataCite (www.datacite.org) DOI Water Corals. The researcher needs to release the results to allocator, ROHub can assign a DOI to the released ROs, colleagues from different institutions working at the Marine enabling citation and stimulating scholarly communication Strategy Framework Directive, to share the model with them, and sharing before actual paper publication. DOI assignment to reuse the model in different locations, and to re-run the follows RO release after automatically checking that the RO model after one year using new data from the same location. follows DataCite’s policies, through the checklist mechanism For this scenario, it is very important to share data and results 9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 within the community, to reuse the models coming from words, include: different scientists working at the same topic, to preserve the Semantic search: To reuse an existing research object about results and to publish methodologies and final maps. habitat suitability models the researcher will pose a query with Current limitations: Currently there is not a reference site appropriated concepts (e.g. ecology, habitat suitability, habitat where a scientist can find publications on this specific topic, extent) in order to easily find the most suitable ones. Semantic workflows executing the models, links to the data to be used search gives the opportunity to retrieve concepts from all and results (to mention a few). There are no specific documents in the research object and it is really effective to repositories that are used to preserve and reuse all this find the research objects containing data and information that information. Generally, there is no information about the you need for your work. Without the semantic search a normal quality of the models and the methodologies applied and keyword search could be performed, however it was difficult described in the paper. Within the Marine Strategy Framework to find the effective concepts. Directive (http://data.europa.eu/eli/dir/2008/56/oj) there is a Checklists: To have information about the research object big lack of communication and all the relevant information is dispersed in different repositories. quality, and to select the research objects that effectively work. It is important to reuse research objects with a running Overcoming the limitations with ROHub: ROHub allows workflow and real link to data. The checklist is a good tool to the scientists to encapsulate the data, provenance of workflows evaluate a research object without losing time in verifying executions, results, documentation and other resources related manually its content. to the particular study, and to effectively preserve, share and reuse these resources through a single information unit. DOIs: After reusing a workflow, with different data input and modifying the model parameters a new research object will be Moreover, ROHub allows the scientists to manage, track and visualize the complete scientific life cycle of the study, to created with the selected data and the best suitable parameters. This new research object will be released with a DOI that collaborate throughout this process, and to disseminate the associated research object at different stages with colleagues gives the opportunity to be properly cited by other scientists in the community. The use of DOIs encourages researchers to or with the community (see Figure 3), so that other scientists can reuse the models in different locations and using different create research objects containing new data and research outcomes and specially to share them with the community, datasets. For the monitoring purpose the research object gives the possibility to access to all the resources necessary to since they enable citation and credit. This mechanism adds incentives for scientists to share their work and stimulate exactly re-run the same model in the same location at different time giving the opportunity to evaluate the differences in reuse, accelerating the incremental development of science. habitat extents applying the same methodologies. Some of the Scientific lifecycle management: The researcher can keep ROHub features used by the end-users in this scenario, in their track of the evolution of the scientific study, release preliminary results after reaching intermediate milestones in Figure 3 Research object lifecycle example 9th International Workshop on Science Gateways (IWSG 2017), 19-21 June 2017 order to share the results with other colleagues and to keep a reproducible science, in: Presutti, V., et al. (eds.) SemWebEval 2014. record of a particular state in the study. Such intermediate CCIS, vol. 457, pp. 77–82, Springer, Heidelberg (2014), Crete, Greece, May 2014. releases can then be compared to analyze the changes, or they can be used to start alternative lines of work. [5] R. Palma, O. Corcho, P. Hołubowicz, S. Pérez, K. Page, C. Mazurek, Digital libraries for the preservation of research methods and associated Notifications: The researcher can receive notifications artefacts, in Proc. 1st International Workshop on the Digital Preservation of Research Methods and Artefacts (DPRMA 2013) at Joint Conference regarding changes in research object content but also about on Digital Libraries (JCDL 2013). pp.8-15. Indianapolis, Indiana, USA, changes in the quality assessment. Updates on quality July 2013. downfalls can be particularly useful (e.g., one of the services [6] R. Palma, P. Hołubowicz, K. Page, S. Soiland-Reyes, G. Klyne, C. used is no longer available) in order to take corrective actions. Mazurek. A Suite of APIs for the Management of Research Objects, Similarly, team collaborators can be notified about research Proceedings of the Developers Workshop, ISWC. October 2014. object editing activity to keep track of the progress in the [7] Research Object Bundle 1.0 specification. November 2014. study, or to know when their input is required. https://researchobject.github.io/specifications/bundle/ VI. CONCLUSION The adoption of the Research Object paradigm by the scientific enterprise can accelerate science through a better management of the scientific information. Benefits of this approach can have an immediate impact on the validation, sharing, preservation and (eventually) reuse of scientific outcomes. However, appropriate tools and infrastructure need to be in place in order to provide the necessary functionalities to manage ROs throughout their entire lifecycle across the different scientific communities. ROHub is the first and main scientific gateway to provide holistic support for the management, sharing and communication of scientific knowledge in the form of ROs. In this paper, we recap on its main features and report the recently implemented and ongoing extensions that enable a variety of scientific communities, and specifically earth scientists, to adopt ROs in their daily work. It is still early to measure the impact that this will have in terms of increased scientific productivity and scholarly communication and citation across the different scientific areas. In addition to further refinement of the methods and tools produced, future work involves piloting the approach in our scientific communities and beyond in order to collect data, e.g. biblio and altmetrics, number of ROs, number of users, etc. that allow assessing such impact. ACKNOWLEDGMENT This work is supported by the EVER-EST EU project (HORIZON2020-674907). Special thanks to Federica Foglini, from CNR-ISMAR, whose RO in sea monitoring has been used as an example across the paper. REFERENCES [1] EVER-EST project, D3.1 - Use Cases Description and User Needs Document. Project deliverable. 2016 [2] K. Belhajjame, O. Corcho, D. Garijo, J. Zhao, P. Missier, D. Newman, R. Palma, S. Bechhofer, E. Garc´ıa-Cuesta, J.M. Gomez-Perez, G. Klyne, K. Page, M. Roos, J.E. Ruiz, S. Soiland-Reyes, L. Verdes- Montenegro, D. De Roure, and C.A. Goble. Workflow-centric research objects: First class citizens in scholarly discourse. In Proceedings of SePublica2012, pages 112, 2012. [3] K. Belhajjame, J. Zhao, D. Garijo, M. Gamble, K. Hettne, R. Palma, E. Mina, O. Corcho, J. Gómez-Pérez, S. Bechhofer, G. Klyne, C. Goble, Using a suite of ontologies for preserving workflow-centric research objects, in Journal of Web Semantics: Science, Services and Agents on the World Wide Web, 2015. doi:10.1016/j.websem.2015.01.003 [4] R. Palma, O. Corcho, J. Gomez-Perez and C. Mazurek, ROHub – a digital library of research objects supporting scientists towards