=Paper= {{Paper |id=Vol-1752/paper26 |storemode=property |title= Sharing Research Facilities Data in Common Data Infrastructures |pdfUrl=https://ceur-ws.org/Vol-1752/paper26.pdf |volume=Vol-1752 |authors=Vasily Bunakov,Alistair Mills,Piotr Oramus |dblpUrl=https://dblp.org/rec/conf/rcdl/BunakovMO16 }} == Sharing Research Facilities Data in Common Data Infrastructures == https://ceur-ws.org/Vol-1752/paper26.pdf
                              Sharing research facilities data
                             in common data infrastructures
                      © Vasily Bunakov                                                © Alistair Mills
                           Science and Technology Facilities Council,
                                     Harwell, United Kingdom
         vasily.bunakov@stfc.ac.uk ,                        alistair.mills@btinternet.com
                                         © Piotr Oramus
                                 AGH University of Science and Technology,
                                            Kraków, Poland
                                      oramus@student.agh.edu.pl
                        Abstract                                     Europe [9] – and EUDAT e-infrastructure [1] using
                                                                     popular metadata standards and protocols.
The work describes the collaboration between a large
experimental research facility and emerging national and             2 Use case description
cross-national data infrastructures, with the purpose of
sharing experimental data and making it findable in                    EUDAT has developed several services, namely:
common multi-disciplinary data catalogues.                             B2SHARE – a data publishing service;
                                                                       B2SAFE – a secure and reliable replication service;
1 Introduction                                                         B2FIND – a data discovery service (data catalogue);
                                                                       B2STAGE – a data delivery service for the rapid
    Many of the major centres of scientific research
                                                                        delivery of large volumes of data towards high-
provide both the instruments for the research, and the
                                                                        performance computing;
infrastructure for storing and processing data. This is
                                                                      B2ACCESS – user authentication service used by
typical for large research facilities like synchrotrons,
                                                                        some of the above services.
neutron sources, powerful lasers that grant timeslots to
visitor scientists for their specific investigations and
                                                                         EUDAT services are deployed centrally by project
provide infrastructure for data collection and
                                                                     participation organizations with free registration and
preservation. Generally, scientists work on the science
                                                                     access for researchers, or the services can be deployed by
and facility IT engineers work with the data; this leads to
                                                                     interested parties in their own environment as all the
a requirement that these two groups collaborate. Another
                                                                     software in support of these services is open source. We
requirement for collaboration comes from the emerging
                                                                     have focused on using the centrally deployed instance of
e-infrastructures that transcend institutional and national
                                                                     EUDAT B2FIND [8] which consumes records delivered
borders and research disciplines.
                                                                     by data providers using OAI-PMH [2], maps them to its
    Although research facilities make the data available,
                                                                     own metadata schema, and publishes them in a common
they do not provide a large range of access methods. The
                                                                     data catalogue. The OAI-PMH specification is
purpose of our work was to provide an industry standard
                                                                     straightforward and allows the use of different metadata
protocol for accessing the data so that a large number of
                                                                     schemas; however, within a single metadata schema,
researchers can find the records about datasets produced
                                                                     quite different interpretations of metadata elements are
by research facilities and access them easily.
                                                                     possible; EUDAT always negotiates the meanings of
    New routes to existing data and metadata are
                                                                     metadata elements with the data provider.
important as in the last decade the number of data sources
                                                                         The data provider in our case is the ISIS neutron and
in Europe has increased enormously. It is no longer
                                                                     muon source [3] that collects data during scientific
viable for most researchers to track all of the data which
                                                                     investigations, and that catalogues the data using the
are relevant to their investigations, so data discovery
                                                                     ICAT software platform [4].             ISIS has a data
services provided by a cross-discipline infrastructure are
                                                                     management policy [7] that provides public access to
essential. Our work is an example of a productive
                                                                     most of its publicly funded data at the end of an embargo
collaboration between a discipline-specific data centre –
                                                                     period of three years. The ISIS policy requires that users
ISIS neutron and muon facility [3] that is a part of a wider
                                                                     of the data register with ISIS, and ISIS records their
landscape of similar neutron and photon facilities in
                                                                     activity. Registration is free, but the management of ISIS
                                                                     wants to be aware of the use of its data when assessing
Proceedings of the XVIII International Conference                    the impact of the facility.
«Data Analytics and Management in Data Intensive                         The work of providing ISIS data in EUDAT involved
Domains» (DAMDID/RCDL’2016), Ershovo, Russia,                        the following steps:
October 11 - 14, 2016                                                      evaluation of the available technology;




                                                               155
         building the metadata harvester;                          examine it for the details [12]. The software is modest in
         mapping the domain-specific metadata to a                 size, and can be easily deployed on a small computer.
          more popular schema;                                      The computer has to execute a script once per hour to
      mapping the data provided by the service end                 find new data, and it has to run a jOAI server
          point to the requirements of B2FIND;                      continuously.
      provision of a service end point for publishing              Table 1 Mapping from ICAT metadata to Dublin Core
          metadata;                                                 and EUDAT B2FIND
      liaison with EUDAT B2FIND for testing the
          end point and harvesting the data records.                 ICAT field          QDC term               B2FIND
    There were two main challenges to address during                                                            field
                                                                     Investigation       dc:identifier          -
implementation. The first challenge was the mapping of               ->doi
the metadata: from ISIS to OAI, then from OAI to                     Investigation       dc:title               title
B2FIND.        The second challenge was to avoid                     ->title
compromising the data policy set by ISIS.                            Investigation       dc:description         notes
                                                                     ->summary
    The first challenge was technical and required careful
                                                                     Instrument          dc:relation            tags
programming as well as discussions with specialists                  ->fullName
knowledgeable of the metadata models for both the data               Investigation
provider and the data consumer.                                      ->name
                                                                     InvestigationP
    The second challenge required access to the data                 arameter->name
records so that the harvester could collect them. In order           (multiple)
to get this access, ISIS provides suitable credentials, and          “dx.doi.org/”       dcterms:referen        URL
it was decided to restrict harvesting to the data records            +                   ces
                                                                     Investigation-
with persistent identifiers in DataCite [10], as this                >doi
implies that the records are not withheld by ISIS under              User->fullName      dc:creator             author
its data embargo policy.                                             -                   -                      spatial
                                                                     Name of the         dc:contributor         maintaine
                                                                     organization                               r
3 Technology stack and metadata mapping                              (as a literal)
                                                                     Description of      dc:subject             disciplin
    We chose the Qualified Dublin Core (QDC) metadata                a facility (as                             e
schema [6] to represent the data from ISIS. This schema              a literal)
is well known, has a large user base and is one of the               -                   -                      Publicati
schemas recognized by the EUDAT B2FIND metadata                                                                 onYear
                                                                     Investigation-      dcterms:issued         Publicati
mapping interface. The data from ISIS is well structured             >releaseDate                               onTimesta
but it is in a schema that is not supported by the EUDAT                                                        mp
B2FIND. The main purpose of B2FIND is data                           en                  dc:language            Language
discovery rather than the harmonization of metadata                  Facility->name      dc:publisher           Origin
                                                                     Facility
schemas. Table 1 presents the mapping from ICAT                      ->fullName
metadata schema to QDC and to EUDAT B2FIND                           Facility->url
schema. This mapping is essential for the semantics of               DatafileFormat      dc:format              Format
the ISIS data records once they are harvested by                     ->name
                                                                     DatafileFormat
EUDAT.                                                               ->type
    We then developed software that harvests the data                DatafileFormat
records from the ISIS data catalogue, maps them to the               ->version
QDC schema and passes them to the OAI-PMH server                     DatafileFormat
                                                                     ->description
that implements a popular standard for automatic data                Facility title      dc:relation            Geographi
harvesting [2] required by EUDAT B2FIND ingest                       (as a literal)                             cDescript
mechanism. We considered several implementations of                                                             ion
OAI-PMH, and chose a Java implementation called jOAI                 Web link (URL)      dc:rights              Rights
                                                                     to ISIS Data
[5] as it is mature, well documented and widely used.                Management
The data records acquisition component is a Python                   Policy
wrapper to ISIS ICAT API.                                            -                   dc:relation            Project
    The resultant technology stack is presented by Figure            Country code        dc:relation            Country
                                                                     (as a literal)      xsi:type=
1. The bottom layer is a domain-specific data catalogue                                  ”dcterms:ISO316
supported by the research facility (ISIS); the top layer is                              6”
a multidisciplinary data catalogue supported by a                    -                   -                      Geographi
common data infrastructure (EUDAT); the middle layers                                                           cCoverage
                                                                     Investigation       dcterms:tempora        TemporalC
are components that enable a transformation from a                   ->startDate         l                      overage:
domain-specific implementation to a common data                                                                 BeginDate
discovery service.                                                   Investigation                              TemporalC
    We have stored the software which was developed in               ->endDate                                  overage:
                                                                                                                EndDate
this project in a public repository, so that others can




                                                              156
                                                                    experimental data regulated by a facility data
                                                                    management policy – which in the case of ISIS is a
                                                                    liberal policy which encourages research data reuse [7].
                                                                        Apart from its usage in EUDAT B2FIND, the OAI-
                                                                    PMH endpoint for ISIS ICAT and the appropriate
                                                                    metadata mapping are being tested for the new Research
                                                                    Data Discovery Service (RDDS) which is a national UK
                                                                    initiative similar to EUDAT B2FIND but with a different
                                                                    scope of research data records collected [11]. RDDS is
                                                                    going to become another public channel for the
                                                                    dissemination of experimental data collected by the ISIS
Figure 1 Technology stack for the facility-specific data
                                                                    facility, along with EUDAT, DataCite and research
discovery service
                                                                    papers that cite data DOIs. Figure 2 represents the flow
    For the published information to be visible, it is              of data records and data persistent identifiers between
necessary to register the jOAI server with a discovery              different services of a common data discovery
service, such as B2FIND. The operation of the discovery             ecosystem.
service is the responsibility of a third party such as
EUDAT.
    The essential flow of work of the software is the
following:
     Once per hour, the software connects to the
         ICAT and requests details of any new records to
         publish. A suitable record has a Digital Object
         Identifier and a Release Date since the last time
         the software was run;
     For each record identified, the software
         serializes the record as a QDC object and passes
         it to the jOAI publisher;
     Once per hour, the jOAI publisher checks for
         new objects and publishes them.
    In this way, new records created by the data owner,
are generally available within two hours, with no manual            Figure 2 Data records and data DOIs flow
processing. No changes, other than configuration, are
                                                                        After a period of testing with a few harvesting e-
required to the ICAT server, the jOAI server or the
                                                                    infrastructures, the OAI-PMH stack has the potential to
discovery service. For the owner of the data, the
                                                                    become part of the ICAT software distribution [4] that is
additional processing required to provide this service is
                                                                    used by other neutron and photon facilities in Europe.
negligible. For the owner of the discovery service, the
                                                                    This should make it easier for other facilities to supply
additional processing is negligible.
                                                                    their data records to data discovery portals. It was not
                                                                    possible during the course of the project described in this
4 Data discovery use case                                           paper to assess the impact of this work on the various
    The services that we have developed in course of this           stakeholders. However, the existence of projects such as
work support the following data discovery use case. In              EUDAT and RDDS and their active collaboration with
order to find data, the researcher uses a Google-style free         this project supports our belief in the need for such
string search in the B2FIND data catalogue [8], and                 projects. As we continue to work in this area, we will
locates candidate datasets of interest. This is similar to          learn more about the needs of the stakeholders, and
using any search engine, except that B2FIND is likely to            change our implementation to support those needs.
be more relevant as it has a harvesting policy which
ensures that it searches a known set of sources; many of            Conclusion
the sources known to B2FIND are of little general
interest, and are not harvested by general purpose search              We considered the effort to implement the OAI-PMH
engines.                                                            endpoint and supply data records in e-infrastructures
    Having received search results, the user selects one            worthwhile for the following reasons:
of the candidates located by B2FIND. B2FIND presents                    large research facilities such as ISIS have an
more information about the chosen candidate. In the case                   interest in sharing data; it may be a legal or
of an ISIS record, this information includes the DOI                       policy requirement that they publish this data,
assigned to the dataset by the DataCite service [10]. The                  especially data that is collected in a publicly
DOI link references a web landing page supplied by the                     funded investigation; many investigators
ISIS facility; the landing page contains an actionable link                consider that the provision of data enhances the
that allows the user to get the data collected during the                  value of their research and consider that data
experiment, with the user access to the actual                             citation is as valuable as publication citation,




                                                              157
         hence more routes to citable data are beneficial          public funds allocated for research and development, by
         for researchers;                                          providing new routes for data publishing and data reuse.
     sharing data in multi-disciplinary catalogues like
         B2FIND and RDDS attracts new collaborators,               Acknowledgements
         facilitates data reuse within a discipline, and              This work is supported in part by Horizon 2020
         encourages cross-discipline research;                     EUDAT and the UK JISC RDDS projects, although the
     we are working within a community of European                views expressed are the views of the authors and not
         facilities which are adopting common standards            necessarily of the projects.
         for software and infrastructure [9]; the software
         developed in the course of this work and shared           References
         in GitHub [12] provides added value in the                [1] EUDAT: the collaborative Pan-European data
         technology stack already adopted by similar                    infrastructure. http://www.eudat.eu
         research centres, which makes our solution
                                                                   [2] Open Archives Initiative Protocol for Metadata
         organizationally scalable;
                                                                        Harvesting. https://www.openarchives.org/pmh
     other e-infrastructures can use the ISIS ICAT
         OAI-PMH endpoint that is now running as beta-             [3] ISIS neutron and muon research facility.
         service [13], to harvest data records for ISIS                 http://www.isis.stfc.ac.uk
         investigations with actionable links to publicly          [4] ICAT project. http://icatproject.org
         available data; metadata cross-walks need to be           [5] jOAI. http://www.dlese.org/oai
         defined between the OAI-PMH metadata and the              [6] DCMI                    Metadata              Terms.
         e-infrastructure metadata; this is similar to                  http://dublincore.org/documents/dcmi-terms
         EUDAT, and aims to avoid semantic                         [7] ISIS data policy. http://www.isis.stfc.ac.uk/user-
         misinterpretation of metadata elements.                        office/data-policy11204.html
    This work provides foundation IT-components and
                                                                   [8] EUDAT B2FIND service. http://b2find.eudat.eu
from an organizational point of view, may serve as a
model for sharing data collected by large research                 [9] PaNdata initiative. http://pan-data.eu
facilities    in    common      cross-disciplinary    data         [10] DataCite service. http://www.datacite.org
infrastructures. The work is a contribution to the                 [11] UK       Research      Data   Discovery     Service.
emerging European research data ecosystem comprising                    https://www.jisc.ac.uk/rd/projects/uk-research-
traditional research centres, common national and                       data-discovery
transnational e-infrastructures, research teams located in         [12] PMH component in ICAT GitHub repository
smaller labs in universities and industry, as well as                   https://github.com/icatproject-contrib/pmh
individual researchers willing to share data. The work             [13] ISIS ICAT OAI-PHM endpoint (beta-service).
aims to increase the efficacy and efficiency of using the              http://oai.eudat.stfc.ac.uk/oai/provider?verb=Identify




                                                             158