10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


     Mapping metadata from different research
infrastructures into a unified framework for use in a
             virtual research environment
                Paul Martin∗ , Laurent Remy† , Maria Theodoridou‡ , Keith Jeffery§ and Zhiming Zhao∗
                         ∗ Institute for Informatics, University of Amsterdam, Amsterdam, Netherlands
                                                     † euroCRIS / IS4RI, France
            ‡ Institute of Computer Science, Foundation for Research and Technology—Hellas, Heraklion, Greece
                                          § Keith G Jeffery Consultants, United Kingdom

  Emails: {p.w.martin, z.zhao}@uva.nl, lremy@is4ri.com, maria@ics.forth.gr, keith.jeffery@keithgjefferyconsultants.co.uk


   Abstract—Virtual Research Environments (VREs) augment             contrary to the recent drive towards open science and open
research activities by integrating tools for data discovery, data    data, which discourages ‘walled garden’ solutions.
retrieval, workflow management and researcher collaboration,
                                                                        Increasingly, what we observe instead is the creation of
often coupled with a specific computing infrastructure. The drive
towards open data science discourages ‘walled garden’ solutions      dedicated research infrastructures (RIs) that aggregate and
however, and has led to the creation of dedicated research           curate scientific data (including real-time observations) for a
infrastructures (RIs) that gather data and provide services to       particular research community, which then provide access to
particular research communities without prejudice towards any        these data via unified services [4], usually without prejudice
particular science gateway or virtual laboratory technology.
                                                                     towards any particular VRE. Complicating this matter, there
   There is a need for generic VREs that can be easily customised
to the needs of specific communities and coupled with the            is now a substantive push to better integrate these efforts into
services and resources of many different RIs, but the resource       a cohesive multidisciplinary commons for open science and
metadata produced by these RIs rarely adheres perfectly to any       open research data, as embodied by initiatives such as the
particular standard or vocabulary, making it difficult to search     European Open Science Cloud (EOSC) [5].
and discover resources independently of their provider. Cross-RI        Developing generic VREs that can be easily coupled with
search can be expedited by metadata mapping services that can
harvest metadata published under different standards to build        different RIs and customised for specific communities is a goal
unified resource catalogues—such an approach poses a number          of many recent research projects, including VRE4EIC1 and
of challenges however. In this paper we take the example of the      BlueBRIDGE2 , and is particularly challenging given the lack
VRE4EIC e-VRE metadata service, which uses X3ML mappings             of conformity of standards and vocabularies in environmental
to build a single CERIF catalogue for describing data products       science and similar domains. Significant software engineering
and other resources provided by multiple RIs. We consider the
extent to which it addresses the challenge of cross-RI search,       effort is often required on the behalf of data scientists to build
and we also discuss how it might take advantage of semantic          specific adaptors for such couplings, but even then it remains
harmonisation efforts in the environmental science domain.           crucial to provide the capability to search across different RIs
   Keywords—virtual research environment, research infrastruc-       for similar data products or services to support integrative and
ture, metadata catalogue, metadata mapping.                          transdisciplinary research. This entails a complex interaction
                                                                     between a VRE and multiple RIs, distributing queries through
                      I. I NTRODUCTION                               multiple adaptors and then aggregating the results—or else a
                                                                     prior harvesting of metadata from all providers to allow pre-
   Virtual Research Environments (VREs) [1], also known as           liminary queries to be conducted on a single logical catalogue.
virtual laboratories or science gateways, are one of three              In this paper we investigate how the use of a flexible
types of science support environment developed to support            metadata mapping and publication service can expedite the
researchers in data science [2], focusing on supporting research     coupling of a VRE with RI resources using different metadata
activities on a holistic rather than infrastructural or service      schemes to provide cross-RI metadata search and discovery.
level. VREs provide integrated environments that typically           As a case study, we take the VRE4EIC metadata service,
include tools for activities such as data discovery and retrieval,   developed as a building block for an RI-agnostic VRE, and
collaboration, process scheduling and workflow management,           we detail how X3ML mappings [6] from standards such as
and many are coupled with a particular computational infras-         ISO 19139 [7] and DCAT [8] to CERIF [9] are used to
tructure, often making use of public e-infrastructures or the        automatically ingest metadata published by different RIs to
Cloud. Data are brought into that infrastructure and manip-
ulated via a particular data processing platform or scientific         1 https://www.vre4eic.eu/

workflow management system [3]—however this approach is                2 http://www.bluebridge-vres.eu/
                               10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


produce a single resource catalogue. We weigh the benefits of                  Graphical User
                                                                                  Interface
this approach and discuss some ways in which such catalogues                                                   Authentication, Authorisation, Accounting Infrastructure (AAAI)
                                                                                                                                                                                    Application
                                                                             Metadata Manager
can be further augmented, for example to facilitate semantic                                                                                Linked Data
                                                                                                                                                                                    tier
                                                                             Resource Manager                        System Manager                              Workflow Manager
                                                                                                                                              Manager
search based on the harmonisation of vocabularies used for
                                                                                 Data Model
describing ecosystem and biodiversity data.                                       Mapper
                                                                                                                                                                                    Interoperability
                                                                             e-VRE Web Service                                           Metadata Manager
                                                                                                                                                                                    tier

                               II. BACKGROUND                                Message Oriented
                                                                               Middleware

                                                                                  Adapter                                             Interoperability Manager
   Modern environmental research depends on the collection                                                                                                                          Resource access
                                                                                                                                                                                    tier
                                                                             Metadata Service
and analysis of large volumes of data gathered via sensors,
                                                                                                                              Research Infrastructure resources
observations, simulations and experimentation. Researchers                  provides functionality

are called upon to address societal challenges that are inex-
tricably tied to the stability of our native ecosystems such as           Fig. 1. Providing a metadata service: the recommended microservice stack
food security and climate management, challenges intrinsically            to implement the metadata manager in the e-VRE reference architecture.
interdisciplinary in nature, requiring collaboration across tra-
ditional disciplinary boundaries. The role of RIs in this context                                                                       create
                                                                                                                                                                  «Data transfer
is to support researchers with data, platforms and tools, but no                                                                                                    service»

single RI can hope to encompass the full research ecosystem.                                          request data                                      prepare storage
The challenge therefore is to help researchers to freely and                    «Instrument                                   «Raw data                                             «Data store
                                                                                 controller»                                  collector»                                            controller»
effectively interact with the full range of research assets
                                                                                                     deliver raw data                              import data for curation
potentially available to them across many RIs, allowing them
to collaborate and conduct their research more effectively.                        «PID service»
                                                                                                                                                                           «Catalogue
                                                                                                                                                                            service»
   Publishing metadata about resources online (indicating type,                                          acquire identifier                  update catalogues
coverage, provenance, etc.) allows RIs to advertise their facil-
ities and researchers to browse and discover data and other               Fig. 2. A computational view of raw data acquisition: ENVRI RM specifies
resources useful to their research. While there exist standards           components and activities using UML (in this case, a component diagram).
such as ISOs 19115 [10] and 19139 [7] for geospatial metadata
however, the implementation of such standards by RIs can
be somewhat idiosyncratic. Resource catalogues themselves                 RM-ODP [20], it models RIs from five viewpoints: science,
can be described using standards such as DCAT [8] and                     information, computation, engineering and technology. Each
harvested via CSW [11] or OAI-PMH [12], but many RIs                      view has its own concerns that correspond to those of the
also use Semantic Web [13] technologies such as OWL [14]                  other views, and is able to describe various key RI activities
and SKOS [15] to describe their resources, adapting ontologies            (e.g. Figure 2). Open Information Linking for Environmental
such as OBOE [16] (for observations) and vocabularies such                RIs (OIL-E) [21] is a small set of OWL specifications based on
as EnvThes [17] (for ecology) to meet their own community’s               ENVRI RM that provide an upper ontology for RI descriptions
needs. Harmonisation of vocabulary and metadata between                   and which can be used to contextualise different kinds of RI
RIs thus remains a concern, with cluster projects such as                 asset from an architectural or interaction-based perspective—
ENVRIplus3 working to promote common models. Concur-                      as opposed to being a general-purpose ontology for describing
rently, initiatives like RDA4 address broader research data               scientific phenomena like BFO [22]. A conceptual model with
management issues such as metadata standards cataloguing,                 a similar focus on the products and tools of research rather
standards for data collections and interoperability between               than on scientific classification itself is CERIF [9], a European
repositories, providing recommendations to such projects.                 standard for describing research information systems. CERIF
   From the VRE perspective, it is necessary to be pragmatic              provides a framework for describing relationships between
when coupling with the services provided by RIs, a process                people, projects, tools and research products (and more), and
that can also be assisted by the use of standard models and vo-           has been applied to describing solid earth science RIs [23].
cabularies. Jeffery et al. [18] define a reference architecture for          These models provide both the means to talk about research
enhanced VREs (‘e-VREs’) able to work with many different                 support environments such as VREs and RIs in a standard way,
RIs and e-infrastructures. In this architecture, microservices            but can also be leveraged as a means to better classify different
are used to implement each of six key building blocks split               kinds of resource as part of a faceted search mechanism, as we
across three tiers of operation, as shown in Figure 1 for                 shall discuss later in Section IV. For now, we consider how
the case of the metadata management. Meanwhile Nieva et                   VREs can be constructed that support rather than are hindered
al. [19] describe a reference model (ENVRI RM) for envi-                  by the heterogeneity of RI resources and resource metadata,
ronmental science RIs, defining their archetypical elements               and how a VRE can facilitate cross-RI search and discovery.
in the context of the research data lifecycle. Being based on                                   III. M ETHODOLOGY AND CHALLENGES
  3 http://www.envriplus.eu/                                                 According to Jeffery et al. [18], VREs can retrieve descrip-
  4 https://rd-alliance.org/                                              tions of RIs’ resources either via separate interfaces with each


                                                                      2
                                      10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


                Metadata Manager                    VRE Catalogue                         Research
                                                                                      Infrastructure A

                             Adaptor A                                  Catalogue A

  Virtual Research                                                                        Research
    Environment                              Adaptor B                  Catalogue B   Infrastructure B

                                                     Adaptor C          Catalogue C
                                                                                          Research
                 Interoperability Manager                                             Infrastructure C


                     composition (part of)                       accesses             publishes to


Fig. 3. An e-VRE produces adaptors to harvest and convert metadata from
different catalogues, building a common metadata catalogue for its users.


RI’s own resource catalogue, or via a joint resource catalogue
that already encompasses all of the RIs’ resources. The former
                                                                                                             Fig. 4. The VRE4EIC metadata portal: searching for data publications
approach relies on the construction of separate discovery and                                                published by Anna Artese through CNR Pisa’s mass spectrometry analytical
access interfaces with every RI, and makes it difficult to                                                   laboratory.
search over multiple RI resource catalogues simultaneously,
requiring the translation and distribution of queries over every
interface. Meanwhile, the latter approach simplifies search and                                                    catalogues in reasonable time.
discovery, but requires initial harvesting of metadata from all                                                3) How to manage the underlying catalogue schema—given
separate RI catalogues, translation of all metadata into a single                                                  new vocabularies, standards or simply evolution in how
common denominator standard, and careful management as the                                                         standards are applied, how to update the model underly-
number of original data sources scales upwards.                                                                    ing a catalogue without losing existing data coherence.
   In terms of the e-VRE reference architecture [18], there are                                                4) How to manage ever larger quantities of data—whether
a few needed steps to harvest resource metadata from an RI:                                                        by relying on more capable database technologies, dis-
   1) A resource catalogue provided by an RI is identified                                                         tribution of the catalogue, or dynamic construction of
       for harvesting. Identification might be performed by a                                                      the catalogue ‘on demand’ based on prior queries.
       discovery service, or be part of the manual configuration                                             In light of these challenges, we consider a particular im-
       of a customised VRE metadata catalogue.                                                               plementation of the resource metadata harvesting approach
   2) The VRE’s interoperability manager must provide an                                                     described above based on certain key technologies.
       adaptor for the given resource catalogue—essentially,
                                                                                                                                     IV. I MPLEMENTATION
       the VRE must have the means to interact with the
       catalogue via the correct protocol (e.g. OAI-PMH or                                                      The VRE4EIC Metadata Portal has been developed in
       SPARQL [24]), but also have a model for (at least                                                     accordance with the e-VRE reference architecture, providing
       partially) mapping metadata retrieved from the source                                                 the necessary components to implement the metadata manager
       scheme to the scheme used internally by the VRE.                                                      functionality. The purpose of the portal is to provide faceted
   3) The adaptor can then be used to harvest metadata records                                               search over catalogue data harvested from multiple RIs, ag-
       from the source, mapping them into a format suitable for                                              gregated within a single CERIF-based VRE catalogue. Search
       ingestion into the VRE’s own metadata catalogue.                                                      is based on the composition of queries based on the context
   4) This ingested data is then made available to users of the                                              of the research data, filtering by organisations, projects, sites,
       VRE via its own search and query interface.                                                           instruments, people, etc., for example as shown in Figure 4.
The main entities involved in this process are shown in                                                      The portal supports map-based search, the export and storing
Figure 3. In this example, the result is that metadata can                                                   of specific queries, and the export of results in various formats.
now be harvested by the VRE’s metadata manager using                                                         The CERIF catalogue itself is implemented in RDF (based
the adaptors provided by the interoperability manager. This                                                  on an OWL ontology) as a Blazegraph5 triple store and is
activity may be a one-off event, but more likely the metadata                                                structured according to CERIF version 1.66 .
harvested will need to be periodically updated.                                                                 Metadata harvested from external sources is converted to
   Whatever the chosen approach however, any VRE catalogu-                                                   CERIF RDF using the X3ML mapping framework [6]. The
ing solution should try to address certain challenges:                                                       mapping process is as illustrated in Figure 5:
   1) How best to discover new resources—a VRE catalogue                                                        1) Sample metadata, along with their corresponding meta-
       may be carefully curated for a given community, but                                                         data schemes are retrieved for analysis.
       even if automation is rejected, there should be a clear                                                  2) Mappings are defined that dictate the transformation of
       process for how to expand the catalogue.                                                                    the selected RDF and XML based schemas to CERIF.
   2) How to ensure the freshness of catalogue data—ensuring                                                   5 https://www.blazegraph.com/

       that updates to source catalogues are propagated to VRE                                                 6 https://www.eurocris.org/cerif/main-features-cerif


                                                                                                         3
                          10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


                                                                                      In summary, the Portal has many desirable characteristics: a
                                                                                   flexible model in CERIF for integrating heterogeneous meta-
                                                                                   data, a tool-assisted metadata mapping pipeline to easily create
                                                                                   or refine metadata mappings or refine existing mappings, and
                                                                                   a mature technology base for unified VRE catalogues. What
                                                                                   we foresee more development needed in is the discovery of
                                                                                   new resources and the acquisition of updates. In this respect,
                                                                                   RI-side services for advertisement of new resources or updates
                                                                                   to which a VRE can subscribe to trigger automated ingestion
                                                                                   of new or modified metadata would be particularly useful.
                                                                                      The VRE4EIC Metadata Portal has been provided as a
                                                                                   demonstrator to the cluster of environmental science RIs in
                                                                                   Europe via the ENVRIplus project as well as directly to
Fig. 5. e-VRE metadata acquisition and retrieval workflow: metadata records        the European Plate Observing System (EPOS)10 , with sample
are acquired from multiple sources, mapped to CERIF RDF and stored in the          data harvested from a subset of those RIs. Evaluation of
VRE catalogue; authenticated VRE users query data via the e-VRE.                   the demonstrator indicates a number of possible avenues of
                                                                                   development, particularly with regard to supporting richer
                                                                                   cross-RI search, the two most noteworthy here being:
                                                                                      1) Further exploitation of CERIF’s semantic layer.
                                                                                      2) Integration of semantic search facilities.
                                                                                   A notable feature of CERIF is how it separates its semantic
                                                                                   layer from its primary entity-relationship model. Most CERIF
                                                                                   relations are semantically agnostic, lacking any particular in-
                                                                                   terpretation beyond identifying a link. Almost every entity and
                                                                                   relation can be assigned though a classification that indicates
                                                                                   a particular semantic interpretation (e.g. that the relationship
                                                                                   between a Person and a Product is that of a creator), allowing a
Fig. 6. Example of mapping rules generated in 3M: result metadata in CKAN          CERIF database to be enriched with concepts from an external
is mapped to a CERIF product with data properties corresponding to each
possible attribute in the original CKAN XML scheme.                                semantic model (or several linked models).
                                                                                      The vocabulary provided by OIL-E11 has been identified
                                                                                   within VRE4EIC as a means to further classify objects in
  3) Metadata is retrieved from different data sources in their                    CERIF in terms of their role in an RI, e.g. classifying
     native format, e.g. as ISO 19139 or CKAN7 data.                               individuals and facilities by the roles they play in research
  4) These mappings are used to transform the source data                          activities, datasets in terms of the research data lifecycle,
     into CERIF format.                                                            or computational services by the functions they enable. This
  5) The transformed data are ingested into the CERIF meta-                        provides additional operational context for faceted search
     data catalogue.                                                               (e.g. identifying which processes generated a given data prod-
Once ingested, these data become available to users of the                         uct), but providing additional context into the scientific context
metadata portal, who can query and browse data upon authen-                        for data products (e.g. categorising the experimental method
tication by the front-end authentication/authorisation service.                    applied or the branch of science to which it belongs) is also
   X3ML mappings are described using the 3M Mapping                                necessary. Environmental science RIs such as AnaEE12 and
Memory Manager8 . Mappings are described by mapping rules                          LTER-Europe13 are actively developing better vocabularies for
relating subject-property-object triples from the source scheme                    describing ecosystem and biodiversity research data, building
to equivalent structures in the target scheme, subject to various                  upon existing SKOS vocabularies. The AnaEE data vocab-
syntactic conditions, as illustrated in Figure 6. 3M supports the                  ulary (anaeeThes) [25] and LTER’s environmental thesaurus
specification of generators to produce identifiers for new con-                    EnvThes [17] have mappings to other established domain
cepts constructed during translation of terms, and provides test                   vocabularies such as Agrovoc14 and GEMET15 . These RIs
and analytics facilities. Mappings into CERIF RDF have been                        are now collaborating with other RIs involved in ENVRIplus
produced for Dublin Core, CKAN, DCAT-AP, and ISO 19139                             to harmonise their vocabularies in order to provide semantic
metadata, as well as RI architecture descriptions in OIL-E, as                     linking between terms used in their respective sub-domains.
part of the technical output of the VRE4EIC project9 .                               10 https://www.epos-ip.org/
                                                                                     11 http://oil-e.net/ontology/
  7 https://ckan.org/                                                                12 https://www.anaee.com/
  8 https://github.com/isl/Mapping-Memory-Manager                                    13 http://www.lter-europe.net/lter-europe
  9 Mappings are accessible at http://www.ics.forth.gr/isl/3M-VRE4EIC, user-         14 http://aims.fao.org/standards/agrovoc

name ‘vre4eicGuest’ and password ‘vre4eic’.                                          15 http://www.eionet.europa.eu/gemet/


                                                                               4
                            10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


The identification of synonymous, subsuming and intersecting              It is not only resource metadata that can be usefully accessed
terms (and the publication of links on the Semantic Web)               via a VRE. Access to provenance data (which might be struc-
provides the basis for better semantic search, whereby a greater       tured according to a standard such as PROV-O [36]) for data
range of data products with similar characteristics can be             products and processes would also be useful to researchers,
retrieved on query without necessarily sharing precisely the           and VREs can also be contributors of provenance data via their
same controlled vocabulary for their metadata. Making use of           own workflow systems (e.g. for Kepler [37]). CERIF is able to
such linked vocabulary would simplify the task of integrating          represent time-bounded role-based semantic relationships, but
resource metadata from multiple catalogues as it would reduce          the source metadata provided by RIs still often lacks this kind
the need to map all metadata values into a single master               of information; the adoption of standardised and ubiquitous
vocabulary (with the likely resulting loss of nuance), while           provenance by RIs would address this either by enriching
still retaining the benefits of cross-RI search and discovery.         the basic metadata for resources, or by providing additional
                                                                       sources of provenance data that could be integrated with the
                              V. D ISCUSSION                           base metadata when producing unified catalogues.
   The use of linked data [26] for describing resources (of               The e-VRE reference architecture also addresses the need
all kinds) is already well-established, with research now              for a workflow manager component, for composing processing
focusing on different approaches to generating linked data             tasks in series or parallel on available computational resources.
from various sources and with how to navigate and query                Most scientific investigations do follow a clear workflow,
distributed information—for example, recent research includes          and there have been a number of workflow management
the generation of a navigable Graph of Things from an array            systems developed with different characteristics and target
of live IoT data sources [27] and the use of crowdsourcing             applications [38], several of which have been applied to sci-
to provide real-time transport data in rural areas [28], both          ence [39]. The use of ontologies for verification and validation
topics with relevance to how RIs gather and expose field               of workflows has already been explored (e.g. [40]), and the
observations acquired via sensors or human experts. On the             ability to construct and validate such workflow specifications
topic of distributed query, various languages/frameworks have          using metadata from service catalogues demonstrates that the
been proposed such as LDQL [29] and LILAC [30], which                  cataloguing problem is not wholly centred on datasets.
may make linked data based search over distributed catalogues                                VI. C ONCLUSION
more practical and efficient than is currently the case.
   The Semantic Web is plagued by many of the problems                    In this paper we linked the development of VREs (also
of knowledge representation in AI including computability,             science gateways and virtual laboratories) to the outgrowth
inconsistency and incompleteness, adding data redundancy,              of dedicated RIs in Europe and beyond, and argued the need
unreliability and limited performance versus more tightly              for new VREs that can be freely coupled with different RI
integrated data models. Considerable attention has been given          resources based on the requirements of researchers and the
to the openness, extensibility and computability of Semantic           evolving data research environment. We asserted that metadata
Web standards, weighing different options (e.g. the use of             mapping is needed to facilitate cross-RI search and discovery
SKOS over OWL [31], [32]). Most geospatial technologies                due to the diversity of metadata schemes, vocabularies and
used by environmental science RIs today have been developed            protocols used to access resource catalogue data published by
independently of the Semantic Web however, with recom-                 different RIs, and furthermore that it is useful to be able to
mendations such as INSPIRE16 being mostly disjoint from it,            aggregate distributed resource metadata into a single logical
though technologies such as OGC’s GeoSPARQL17 attempt to               catalogue. We outlined a methodology for building such a
address this. This poses a barrier for integration of geospatial       catalogue based on the e-VRE reference architecture and the
catalogues published via CSW or OAI-PMH into the Semantic              adoption of a robust metadata mapping pipeline for handling
Web, and adaptors are still needed to query such data sources          heterogeneous data sources. We provided an example in the
and present responses in RDF format (e.g. [33]).                       VRE4EIC Metadata Portal of how the methodology is applied,
   For mapping between a modest set of standards, man-                 using CERIF as a framework for aggregating resource meta-
ual mapping with tool support remains most practical, but              data from different metadata catalogues provided by EPOS and
automation may help to accelerate the construction of new              ENVRIplus. We described the application of X3ML mappings,
mappings. How to best map between ontologies (or other kinds           constructed using the 3M editor, to translate ISO 19139 XML,
of schema) remains an open question, but mapping techniques            CKAN, Dublin Core, DCAT-AP and OIL-E data into CERIF
can be evaluated by comparing performance against ontology             RDF for ingestion into a CERIF catalogue. We considered how
sets covering the same domain (e.g. OntoFarm for conference            the CERIF semantic layer can be augmented with vocabulary
organisation [34]). Multi-lingual support is also important in         from OIL-E to further contextualise research entities, and how
collaboration; for example Bella et al. [35] address how to            recent semantic harmonisation work in environmental science
conduct mapping based on more than just English syntax.                RIs can further augment the capabilities of VREs as clients for
                                                                       semantic faceted search of RI resources. Finally, we discussed
  16 https://inspire.ec.europa.eu/                                     the role that some of the technologies identified have in other
  17 http://www.opengeospatial.org/standards/geosparql                 research literature, examined some related work, and suggested


                                                                   5
                           10th International Workshop on Science Gateways (IWSG 2018), 13-15 June 2018


future avenues of investigation for coupling VREs with other                         [19] A. Nieva de la Hidalga, B. Magagna, M. Stocker, A. Hardisty, P. Martin,
types of service provided by RIs, e.g. provenance services.                               Z. Zhao, M. Atkinson, and K. Jeffery, “The ENVRI Reference Model
                                                                                          (ENVRI RM) version 2.2, 30th October 2017,” Nov. 2017. [Online].
                                                                                          Available: https://doi.org/10.5281/zenodo.1050349
                       ACKNOWLEDGEMENTS                                              [20] ISO 10746-1, “Information technology—Open Distributed Processing—
  This work was supported by the European Union’s Hori-                                   Reference model: Overview,” International Organization for Standard-
                                                                                          ization, ISO/IEC Standard, 1998.
zon 2020 research and innovation programme under grant                               [21] P. Martin, P. Grosso, B. Magagna, H. Schentz, Y. Chen, A. Hardisty,
agreements 654182 (ENVRIplus project), 676247 (VRE4EIC                                    W. Los, K. Jeffery, C. de Laat, and Z. Zhao, “Open information
project) and 643963 (SWITCH project).                                                     linking for environmental research infrastructures,” in 2015 IEEE 11th
                                                                                          International Conference on e-Science (e-Science). IEEE, 2015, pp.
                                                                                          513–520.
                              R EFERENCES                                            [22] R. Arp, B. Smith, and A. D. Spear, Building ontologies with Basic
 [1] L. Candela, D. Castelli, and P. Pagano, “Virtual research environments:              Formal Ontology. The MIT Press, 2015.
     an overview and a research agenda,” Data Science Journal, vol. 12, pp.          [23] D. Bailo, D. Ulbricht, M. L. Nayembil, L. Trani, A. Spinuso, and
     75–81, 2013.                                                                         K. G. Jeffery, “Mapping solid earth data and research infrastructures
 [2] Z. Zhao, P. Martin, C. de Laat, K. Jeffery, A. Jones, I. Taylor,                     to CERIF,” Procedia Computer Science, vol. 106, pp. 112–121, 2017.
     A. Hardisty, M. Atkinson, A. Zuiderwijk, Y. Yin, and Y. Chen, “Time             [24] W3C SPARQL Working Group, “SPARQL 1.1 overview,” W3C, W3C
     critical requirements and technical considerations for advanced support              Recommendation, 2013, http://www.w3.org/TR/2013/REC-sparql11-
     environments for data-intensive research,” in 2nd International workshop             overview-20130321/.
     on Interoperable infrastructures for interdisciplinary big data sciences        [25] Anaee-France semantic group, “AnaEE Thesaurus,” 2016. [Online].
     (IT4RIs 16), in the context of IEEE Real-time System Symposium (RTSS),               Available: http://dx.doi.org/10.15454/1.4894016754286177E12
     Porto, Portugal, 2016.                                                          [26] T.     Berners-Lee,     “Linked     data,”    W3C      Design    Issues,
 [3] E. Deelman, D. Gannon, M. Shields, and I. Taylor, “Workflows and                     2006, accessed 26th February 2018. [Online]. Available:
     e-Science: An overview of workflow system features and capabilities,”                https://www.w3.org/DesignIssues/LinkedData.html
     Future Generation Computer Systems, vol. 25, no. 5, pp. 528–540, 2009.          [27] D. Le-Phuoc, H. N. M. Quoc, H. N. Quoc, T. T. Nhat, and M. Hauswirth,
 [4] P. Martin, Y. Chen, A. Hardisty, K. Jeffery, and Z. Zhao, “Computational             “The graph of things: A step towards the live knowledge graph of
     challenges in global environmental research infrastructures,” in Terres-             connected things,” Web Semantics: Science, Services and Agents on the
     trial Ecosystem Research Infrastructures: Challenges and Opportunities,              World Wide Web, vol. 37, pp. 25–35, 2016.
     A. Chabbi and H. W. Loescher, Eds. CRC Press, 2017, ch. 12, pp.                 [28] D. Corsar, P. Edwards, J. Nelson, C. Baillie, K. Papangelis, and
     305–340.                                                                             N. Velaga, “Linking open data and the crowd for real-time passenger
 [5] European Commission, “Realising the european open science cloud,”                    information,” Web Semantics: Science, Services and Agents on the World
     2016.                                                                                Wide Web, vol. 43, pp. 18–24, 2017.
 [6] Y. Marketakis, N. Minadakis, H. Kondylakis, K. Konsolaki, G. Samar-             [29] O. Hartig and J. Pérez, “LDQL: A query language for the web of linked
     itakis, M. Theodoridou, G. Flouris, and M. Doerr, “X3ML mapping                      data,” Web Semantics: Science, Services and Agents on the World Wide
     framework for information integration in cultural heritage and beyond,”              Web, vol. 41, pp. 9–29, 2016.
     International Journal on Digital Libraries, pp. 1–19, 2016.                     [30] G. Montoya, H. Skaf-Molli, P. Molli, and M.-E. Vidal, “Decomposing
 [7] ISO 19139:2007, “Geographic information—Metadata—XML schema                          federated queries in presence of replicated fragments,” Web Semantics:
     implementation,” International Organization for Standardization, ISO/TS              Science, Services and Agents on the World Wide Web, vol. 42, pp. 1–18,
     Standard, 2007.                                                                      2017.
 [8] J. Erickson and F. Maali, “Data catalog vocabulary (DCAT),” W3C,                [31] A. Stellato, “Dictionary, thesaurus or ontology? disentangling our
     W3C Recommendation, 2014, http://www.w3.org/TR/2014/REC-vocab-                       choices in the semantic web jungle,” Journal of Integrative Agriculture,
     dcat-20140116/.                                                                      vol. 11, no. 5, pp. 710–719, 2012.
 [9] B. Jörg, “CERIF: The common european research information format               [32] T. Baker, S. Bechhofer, A. Isaac, A. Miles, G. Schreiber, and E. Sum-
     model,” Data Science Journal, vol. 9, pp. 24–31, 2010.                               mers, “Key choices in the design of simple knowledge organization
[10] ISO 19115-1:2014, “Geographic information—Metadata—Part 1: Fun-                      system (SKOS),” Web Semantics: Science, Services and Agents on the
     damentals,” International Organization for Standardization, ISO Stan-                World Wide Web, vol. 20, pp. 35–49, 2013.
     dard, 2014.                                                                     [33] K. Patroumpas, N. Georgomanolis, T. Stratiotis, M. Alexakis, and
[11] D. Nebert, U. Voges, and L. Bigagli, “OGC catalogue services                         S. Athanasiou, “Exposing INSPIRE on the semantic web,” Web Seman-
     3.0—general model,” Open Geospatial Consortium, OGC Implemen-                        tics: Science, Services and Agents on the World Wide Web, vol. 35, pp.
     tation Standard, 2016, http://docs.opengeospatial.org/is/12-168r6/12-                53–62, 2015.
     168r6.html.                                                                     [34] O. Zamazal and V. Svátek, “The ten-year OntoFarm and its fertilization
[12] C. Lagoze and H. Van de Sompel, “The making of the open archives                     within the onto-sphere,” Web Semantics: Science, Services and Agents
     initiative protocol for metadata harvesting,” Library hi tech, vol. 21,              on the World Wide Web, vol. 43, pp. 46–53, 2017.
     no. 2, pp. 118–128, 2003.                                                       [35] G. Bella, F. Giunchiglia, and F. McNeill, “Language and domain aware
[13] T. Berners-Lee, J. Hendler, O. Lassila et al., “The semantic web,”                   lightweight ontology matching,” Web Semantics: Science, Services and
     Scientific american, vol. 284, no. 5, pp. 28–37, 2001.                               Agents on the World Wide Web, vol. 43, pp. 1–17, 2017.
[14] W3C OWL Working Group, “OWL 2 web ontology language,” W3C,                      [36] D. McGuinness, S. Sahoo, and T. Lebo, “PROV-O: The PROV ontology,”
     W3C Recommendation, 2012, https://www.w3.org/TR/2012/REC-owl2-                       W3C, W3C Recommendation, 2013, http://www.w3.org/TR/2013/REC-
     overview-20121211/.                                                                  prov-o-20130430/.
[15] S. Bechhofer and A. Miles, “SKOS simple knowledge orga-                         [37] I. Altintas, O. Barney, and E. Jaeger-Frank, “Provenance collection
     nization system reference,” W3C, W3C Recommendation, 2009,                           support in the Kepler scientific workflow system,” Provenance and
     http://www.w3.org/TR/2009/REC-skos-reference-20090818/.                              annotation of data, pp. 118–132, 2006.
[16] J. Madin, S. Bowers, M. Schildhauer, S. Krivov, D. Pennington, and              [38] C. S. Liew, M. P. Atkinson, M. Galea, T. F. Ang, P. Martin, and
     F. Villa, “An ontology for describing and synthesizing ecological obser-             J. I. V. Hemert, “Scientific workflows: Moving across paradigms,” ACM
     vation data,” Ecological informatics, vol. 2, no. 3, pp. 279–296, 2007.              Comput. Surv., vol. 49, no. 4, pp. 66:1–66:39, Dec. 2016. [Online].
[17] H. Schentz, J. Peterseil, and N. Bertrand, “Envthes-interlinked thesaurus            Available: http://doi.acm.org/10.1145/3012429
     for long term ecological research, monitoring, and experiments.” in             [39] R. Mork, P. Martin, and Z. Zhao, “Contemporary challenges for data-
     EnviroInfo, 2013, pp. 824–832.                                                       intensive scientific workflow management systems,” in Proceedings of
[18] K. G. Jeffery, C. Meghini, C. Concordia, T. Patkos, V. Brasse, J. v.                 the 10th Workshop on Workflows in Support of Large-Scale Science.
     Ossenbruck, Y. Marketakis, N. Minadakis, and E. Marchetti, “A refer-                 ACM, 2015, p. 4.
     ence architecture for virtual research environments,” in Proceedings of         [40] T. Miksa and A. Rauber, “Using ontologies for verification and valida-
     the 15th International Symposium of Information Science (ISI 2017).                  tion of workflow-based experiments,” Web Semantics: Science, Services
     Verlag Werner Hulsbusch, 2017, pp. 76–88.                                            and Agents on the World Wide Web, vol. 43, pp. 25–45, 2017.


                                                                                 6