=Paper=
{{Paper
|id=Vol-1752/paper26
|storemode=property
|title=
Sharing Research Facilities Data in Common Data Infrastructures
|pdfUrl=https://ceur-ws.org/Vol-1752/paper26.pdf
|volume=Vol-1752
|authors=Vasily Bunakov,Alistair Mills,Piotr Oramus
|dblpUrl=https://dblp.org/rec/conf/rcdl/BunakovMO16
}}
==
Sharing Research Facilities Data in Common Data Infrastructures
==
Sharing research facilities data
in common data infrastructures
© Vasily Bunakov © Alistair Mills
Science and Technology Facilities Council,
Harwell, United Kingdom
vasily.bunakov@stfc.ac.uk , alistair.mills@btinternet.com
© Piotr Oramus
AGH University of Science and Technology,
Kraków, Poland
oramus@student.agh.edu.pl
Abstract Europe [9] – and EUDAT e-infrastructure [1] using
popular metadata standards and protocols.
The work describes the collaboration between a large
experimental research facility and emerging national and 2 Use case description
cross-national data infrastructures, with the purpose of
sharing experimental data and making it findable in EUDAT has developed several services, namely:
common multi-disciplinary data catalogues. B2SHARE – a data publishing service;
B2SAFE – a secure and reliable replication service;
1 Introduction B2FIND – a data discovery service (data catalogue);
B2STAGE – a data delivery service for the rapid
Many of the major centres of scientific research
delivery of large volumes of data towards high-
provide both the instruments for the research, and the
performance computing;
infrastructure for storing and processing data. This is
B2ACCESS – user authentication service used by
typical for large research facilities like synchrotrons,
some of the above services.
neutron sources, powerful lasers that grant timeslots to
visitor scientists for their specific investigations and
EUDAT services are deployed centrally by project
provide infrastructure for data collection and
participation organizations with free registration and
preservation. Generally, scientists work on the science
access for researchers, or the services can be deployed by
and facility IT engineers work with the data; this leads to
interested parties in their own environment as all the
a requirement that these two groups collaborate. Another
software in support of these services is open source. We
requirement for collaboration comes from the emerging
have focused on using the centrally deployed instance of
e-infrastructures that transcend institutional and national
EUDAT B2FIND [8] which consumes records delivered
borders and research disciplines.
by data providers using OAI-PMH [2], maps them to its
Although research facilities make the data available,
own metadata schema, and publishes them in a common
they do not provide a large range of access methods. The
data catalogue. The OAI-PMH specification is
purpose of our work was to provide an industry standard
straightforward and allows the use of different metadata
protocol for accessing the data so that a large number of
schemas; however, within a single metadata schema,
researchers can find the records about datasets produced
quite different interpretations of metadata elements are
by research facilities and access them easily.
possible; EUDAT always negotiates the meanings of
New routes to existing data and metadata are
metadata elements with the data provider.
important as in the last decade the number of data sources
The data provider in our case is the ISIS neutron and
in Europe has increased enormously. It is no longer
muon source [3] that collects data during scientific
viable for most researchers to track all of the data which
investigations, and that catalogues the data using the
are relevant to their investigations, so data discovery
ICAT software platform [4]. ISIS has a data
services provided by a cross-discipline infrastructure are
management policy [7] that provides public access to
essential. Our work is an example of a productive
most of its publicly funded data at the end of an embargo
collaboration between a discipline-specific data centre –
period of three years. The ISIS policy requires that users
ISIS neutron and muon facility [3] that is a part of a wider
of the data register with ISIS, and ISIS records their
landscape of similar neutron and photon facilities in
activity. Registration is free, but the management of ISIS
wants to be aware of the use of its data when assessing
Proceedings of the XVIII International Conference the impact of the facility.
«Data Analytics and Management in Data Intensive The work of providing ISIS data in EUDAT involved
Domains» (DAMDID/RCDL’2016), Ershovo, Russia, the following steps:
October 11 - 14, 2016 evaluation of the available technology;
155
building the metadata harvester; examine it for the details [12]. The software is modest in
mapping the domain-specific metadata to a size, and can be easily deployed on a small computer.
more popular schema; The computer has to execute a script once per hour to
mapping the data provided by the service end find new data, and it has to run a jOAI server
point to the requirements of B2FIND; continuously.
provision of a service end point for publishing Table 1 Mapping from ICAT metadata to Dublin Core
metadata; and EUDAT B2FIND
liaison with EUDAT B2FIND for testing the
end point and harvesting the data records. ICAT field QDC term B2FIND
There were two main challenges to address during field
Investigation dc:identifier -
implementation. The first challenge was the mapping of ->doi
the metadata: from ISIS to OAI, then from OAI to Investigation dc:title title
B2FIND. The second challenge was to avoid ->title
compromising the data policy set by ISIS. Investigation dc:description notes
->summary
The first challenge was technical and required careful
Instrument dc:relation tags
programming as well as discussions with specialists ->fullName
knowledgeable of the metadata models for both the data Investigation
provider and the data consumer. ->name
InvestigationP
The second challenge required access to the data arameter->name
records so that the harvester could collect them. In order (multiple)
to get this access, ISIS provides suitable credentials, and “dx.doi.org/” dcterms:referen URL
it was decided to restrict harvesting to the data records + ces
Investigation-
with persistent identifiers in DataCite [10], as this >doi
implies that the records are not withheld by ISIS under User->fullName dc:creator author
its data embargo policy. - - spatial
Name of the dc:contributor maintaine
organization r
3 Technology stack and metadata mapping (as a literal)
Description of dc:subject disciplin
We chose the Qualified Dublin Core (QDC) metadata a facility (as e
schema [6] to represent the data from ISIS. This schema a literal)
is well known, has a large user base and is one of the - - Publicati
schemas recognized by the EUDAT B2FIND metadata onYear
Investigation- dcterms:issued Publicati
mapping interface. The data from ISIS is well structured >releaseDate onTimesta
but it is in a schema that is not supported by the EUDAT mp
B2FIND. The main purpose of B2FIND is data en dc:language Language
discovery rather than the harmonization of metadata Facility->name dc:publisher Origin
Facility
schemas. Table 1 presents the mapping from ICAT ->fullName
metadata schema to QDC and to EUDAT B2FIND Facility->url
schema. This mapping is essential for the semantics of DatafileFormat dc:format Format
the ISIS data records once they are harvested by ->name
DatafileFormat
EUDAT. ->type
We then developed software that harvests the data DatafileFormat
records from the ISIS data catalogue, maps them to the ->version
QDC schema and passes them to the OAI-PMH server DatafileFormat
->description
that implements a popular standard for automatic data Facility title dc:relation Geographi
harvesting [2] required by EUDAT B2FIND ingest (as a literal) cDescript
mechanism. We considered several implementations of ion
OAI-PMH, and chose a Java implementation called jOAI Web link (URL) dc:rights Rights
to ISIS Data
[5] as it is mature, well documented and widely used. Management
The data records acquisition component is a Python Policy
wrapper to ISIS ICAT API. - dc:relation Project
The resultant technology stack is presented by Figure Country code dc:relation Country
(as a literal) xsi:type=
1. The bottom layer is a domain-specific data catalogue ”dcterms:ISO316
supported by the research facility (ISIS); the top layer is 6”
a multidisciplinary data catalogue supported by a - - Geographi
common data infrastructure (EUDAT); the middle layers cCoverage
Investigation dcterms:tempora TemporalC
are components that enable a transformation from a ->startDate l overage:
domain-specific implementation to a common data BeginDate
discovery service. Investigation TemporalC
We have stored the software which was developed in ->endDate overage:
EndDate
this project in a public repository, so that others can
156
experimental data regulated by a facility data
management policy – which in the case of ISIS is a
liberal policy which encourages research data reuse [7].
Apart from its usage in EUDAT B2FIND, the OAI-
PMH endpoint for ISIS ICAT and the appropriate
metadata mapping are being tested for the new Research
Data Discovery Service (RDDS) which is a national UK
initiative similar to EUDAT B2FIND but with a different
scope of research data records collected [11]. RDDS is
going to become another public channel for the
dissemination of experimental data collected by the ISIS
Figure 1 Technology stack for the facility-specific data
facility, along with EUDAT, DataCite and research
discovery service
papers that cite data DOIs. Figure 2 represents the flow
For the published information to be visible, it is of data records and data persistent identifiers between
necessary to register the jOAI server with a discovery different services of a common data discovery
service, such as B2FIND. The operation of the discovery ecosystem.
service is the responsibility of a third party such as
EUDAT.
The essential flow of work of the software is the
following:
Once per hour, the software connects to the
ICAT and requests details of any new records to
publish. A suitable record has a Digital Object
Identifier and a Release Date since the last time
the software was run;
For each record identified, the software
serializes the record as a QDC object and passes
it to the jOAI publisher;
Once per hour, the jOAI publisher checks for
new objects and publishes them.
In this way, new records created by the data owner,
are generally available within two hours, with no manual Figure 2 Data records and data DOIs flow
processing. No changes, other than configuration, are
After a period of testing with a few harvesting e-
required to the ICAT server, the jOAI server or the
infrastructures, the OAI-PMH stack has the potential to
discovery service. For the owner of the data, the
become part of the ICAT software distribution [4] that is
additional processing required to provide this service is
used by other neutron and photon facilities in Europe.
negligible. For the owner of the discovery service, the
This should make it easier for other facilities to supply
additional processing is negligible.
their data records to data discovery portals. It was not
possible during the course of the project described in this
4 Data discovery use case paper to assess the impact of this work on the various
The services that we have developed in course of this stakeholders. However, the existence of projects such as
work support the following data discovery use case. In EUDAT and RDDS and their active collaboration with
order to find data, the researcher uses a Google-style free this project supports our belief in the need for such
string search in the B2FIND data catalogue [8], and projects. As we continue to work in this area, we will
locates candidate datasets of interest. This is similar to learn more about the needs of the stakeholders, and
using any search engine, except that B2FIND is likely to change our implementation to support those needs.
be more relevant as it has a harvesting policy which
ensures that it searches a known set of sources; many of Conclusion
the sources known to B2FIND are of little general
interest, and are not harvested by general purpose search We considered the effort to implement the OAI-PMH
engines. endpoint and supply data records in e-infrastructures
Having received search results, the user selects one worthwhile for the following reasons:
of the candidates located by B2FIND. B2FIND presents large research facilities such as ISIS have an
more information about the chosen candidate. In the case interest in sharing data; it may be a legal or
of an ISIS record, this information includes the DOI policy requirement that they publish this data,
assigned to the dataset by the DataCite service [10]. The especially data that is collected in a publicly
DOI link references a web landing page supplied by the funded investigation; many investigators
ISIS facility; the landing page contains an actionable link consider that the provision of data enhances the
that allows the user to get the data collected during the value of their research and consider that data
experiment, with the user access to the actual citation is as valuable as publication citation,
157
hence more routes to citable data are beneficial public funds allocated for research and development, by
for researchers; providing new routes for data publishing and data reuse.
sharing data in multi-disciplinary catalogues like
B2FIND and RDDS attracts new collaborators, Acknowledgements
facilitates data reuse within a discipline, and This work is supported in part by Horizon 2020
encourages cross-discipline research; EUDAT and the UK JISC RDDS projects, although the
we are working within a community of European views expressed are the views of the authors and not
facilities which are adopting common standards necessarily of the projects.
for software and infrastructure [9]; the software
developed in the course of this work and shared References
in GitHub [12] provides added value in the [1] EUDAT: the collaborative Pan-European data
technology stack already adopted by similar infrastructure. http://www.eudat.eu
research centres, which makes our solution
[2] Open Archives Initiative Protocol for Metadata
organizationally scalable;
Harvesting. https://www.openarchives.org/pmh
other e-infrastructures can use the ISIS ICAT
OAI-PMH endpoint that is now running as beta- [3] ISIS neutron and muon research facility.
service [13], to harvest data records for ISIS http://www.isis.stfc.ac.uk
investigations with actionable links to publicly [4] ICAT project. http://icatproject.org
available data; metadata cross-walks need to be [5] jOAI. http://www.dlese.org/oai
defined between the OAI-PMH metadata and the [6] DCMI Metadata Terms.
e-infrastructure metadata; this is similar to http://dublincore.org/documents/dcmi-terms
EUDAT, and aims to avoid semantic [7] ISIS data policy. http://www.isis.stfc.ac.uk/user-
misinterpretation of metadata elements. office/data-policy11204.html
This work provides foundation IT-components and
[8] EUDAT B2FIND service. http://b2find.eudat.eu
from an organizational point of view, may serve as a
model for sharing data collected by large research [9] PaNdata initiative. http://pan-data.eu
facilities in common cross-disciplinary data [10] DataCite service. http://www.datacite.org
infrastructures. The work is a contribution to the [11] UK Research Data Discovery Service.
emerging European research data ecosystem comprising https://www.jisc.ac.uk/rd/projects/uk-research-
traditional research centres, common national and data-discovery
transnational e-infrastructures, research teams located in [12] PMH component in ICAT GitHub repository
smaller labs in universities and industry, as well as https://github.com/icatproject-contrib/pmh
individual researchers willing to share data. The work [13] ISIS ICAT OAI-PHM endpoint (beta-service).
aims to increase the efficacy and efficiency of using the http://oai.eudat.stfc.ac.uk/oai/provider?verb=Identify
158