<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Specifying and Implementing Data Infrastructures Enabling Data Intensive Sciences</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>_______________________________________ Proceedings of the XVII International Conference «Data Analytics and Management in Data Intensive Domains» (DAMDID/RCDL'2015)</institution>
          ,
          <addr-line>Obninsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Peter Wittenburg, Herman Stehouwer Max Planck Data and Compute Center</institution>
          ,
          <addr-line>Garching/Munich</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Examples from Psycholinguistics - a humanities discipline - show that data intensive research is changing all scientific disciplines dramatically. Data intensive sciences pose unprecedented challenges in data management and processing. A survey in Europe showed clearly that most of the research departments are not prepared for this step and that the methods that are used to manage, curate and process data are inefficient and too costly. The Research Data Alliance, as a bottomup organized global and cross-disciplinary initiative, has been established to accelerate the process of changing data practice. After only two years RDA produced its first concrete results, which have to demonstrate their potential. In particular, the infrastructure builders are requested to act as early adopters of RDA results. The European Commission and its member states have taken serious steps to establish an eco-system of research infrastructures and e-Infrastructures anticipating the challenges imposed by the data deluge which will enable broad uptake of the paradigm of data intensive science. Research organisations have recognised these challenges as well and taken first steps to adapt its structures. However, we need to understand that we are in a phase of gigantic changes which implies that measures currently being taken need to be interpreted as tests on the way to new solid and sustainable structures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Enabling Data Intensive Sciences</title>
      <p>
        Quite a number of scientific institutes have been
data oriented for a long time already. For instance,
most of the research of the experimental and
theoretical institutes of the Max Planck Society was
based on data. Even an institute that belongs to the
humanities section of the Max Planck Society such
as our former affiliation - the Institute for
Psycholinguistics [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was oriented from the start
towards the analysis of speech, eye movement and
gesture recordings, detecting meaningful patterns,
and building models to simulate speech perception.
In physics institutes (fusion, astronomy, etc.) of
course much larger volumes of data were being
processed and they can look back to a much longer
history of data oriented work.
      </p>
      <p>
        It was the book "The Fourth Paradigm – Data
Intensive Scientific Discovery" [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] edited by Tony
Hey and colleagues that introduced “data intensive
science” as the 4th paradigm of scientific discovery
by referring to a talk given by J. Gray. It raised
much attention for the concept behind this new
paradigm. Gray distinguishes 4 paradigms that are
co-existing today: (1) Empirical Science describing
natural phenomena, (2) Theoretical Science using
models to achieve generalizations, (3)
Computational Science simulating complex
phenomena and (4) Data exploration by unifying
theory, experiment and simulation. Indeed, we can
observe that science is changing in so far as finding
meaningful patterns in data sets becomes an
essential approach. Increasingly more powerful and
numerous sensors, improved network connections,
more powerful and numerous computers and more
advanced algorithms are key pillars for this
development. The "Riding the Wave" [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] report
created by a High Level Expert Group of the
European Commission (EC) was one of the
documents that summarized the specific data
challenges and opportunities, and requested actions
by the EC to enable data intensive sciences for a
large number of researchers and not only those that
have sufficient funding to curate all data and
software to be integrated to make use of it.
      </p>
      <p>We see a number of trends which we can
summarize as follows:
x An increasing number of research disciplines
adopted data intensive methods due to new
technological and methodological possibilities.
During the last decades these changes were
extreme in biological and neurological
disciplines.
x The amount of data and its complexity in terms
of creation contexts, data types and relations
are increasing extremely.
x The Internet allows us to offer data via the web
to be re-used by others.
x This enables us to combine data sets in new
ways across institutional, national and
discipline borders.
x
x</p>
      <p>Mathematical methods have advanced to cope
with heterogeneous data sets and we see large
libraries with statistical, stochastic and
machine learning methods becoming available.
The total amount of available CPU and storage
capacity allows researchers to do large
amounts of computations on increasingly large
data sets.</p>
      <p>Despite the increase in compute capacity,
however, we can also observe an increasing
analysis gap, i.e. the fraction of data we are able to
process in a way that we can extract knowledge is
getting smaller. The reasons for the analysis gap
are many and not subject of discussion in this
paper.</p>
      <p>Two examples taken from a humanities
discipline show the fundamental changes towards
data intensive science that could not have been
carried out a few years ago. When studying for
example the evolution of human languages over
thousands of years linguists until recently based
their theories on comparing fragmented
descriptions of colleagues about several languages.
Currently, large feature matrices are extracted
describing characteristics of all languages in a
particular region such as for example those spoken
in Austronesia and these matrices are fed into
phylogenetic algorithms to calculate most probable
dependency trees that indicate how languages may
have influenced each other over thousands of years.
For this research a large database is required and
also more powerful computers are needed than
linguists were using traditionally to let the
algorithms generate meaningful optima.</p>
      <p>The application of massive crowd sourcing
techniques in linguistics for example to understand
human communication including multimodal
interaction can be used as another example to
indicate the dramatic changes in research towards a
data centric perspective. These techniques generate
many parallel data streams originating from
smartphones that need to be annotated immediately
by machine processing tools to make them
available for scientific studies. This automatic
annotation requires smart pre-processing and smart
data management. In this setup an increasing
number of parallel operating detectors must be
trained to detect patterns in speech and video
streams in real time with the help of stochastic
machines. It is simply the shear amount of data
requiring new ways of processing to enable this
type of research leading to better assumptions
about what guides our interactions.</p>
      <p>The basis of such methods as described in the
example above is the availability of large amounts
of data to estimate the many free parameters of the
models and in both examples data cannot come
from one project or institute, but from many
research labs. Researchers doing this kind of
research know how difficult it is to find, access and
combine the required data. Such research is very
cost intensive and raises the questions whether we
can continue without serious changes, and whether
the available infrastructures are sufficient.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Human Brain Project</title>
      <p>
        An even more extreme example for the shift
towards the 4th paradigm is taken from life
sciences. The recently started Human Brain Project
(HBP) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (as an EC flagship project) has as visions
(a) to be able to simulate at physiologocial level
first rat brains and in a follow up phase human
brains (in silico experiments) and (b) to predict
brain diseases from patterns found in recorded data
sets at an early stage. The main goal of the latter
(medical informatics sub project in HBP) is being
illustrated in figure 1. Researchers would like to
correlate observed phenomena such as specific
deficits due to brain diseases with all types of
recordings that can be found from corresponding
patients such as brain images of different types,
gene sequences, protein data and perhaps even
reaction time measurements. Without having a
      </p>
      <p>Fig. 1: One example for this new paradigm as it is
used in neuro-sciences (HBP) is shown. For example
phenomena such as created by specific brain
diseases can be observed. Yet there is no chance to
model the complexity of the human brain to make
statements about their physiological origins. Data
from various sources are correlated with the
phenomena to find those patterns in the data that are
causing the observed deficits.
model of the human brain at hand this correlation
would allow researchers nevertheless to detect
cooccuring patterns in the data that seem to cause the
observed phenomena. Machine learning methods
are used to generate meaningful signatures from
physical features in the data that then can be used
to predict potential diseases from patients.</p>
      <p>No assumptions are made about the structure and
functioning of the brain, no assumptions are made
how genes may influence brain structure and
functioning, etc. since we don’t have sufficient
knowledge in these areas. Nevertheless, by using a
large database of aligned data it is assumed that
researchers can relate physical patterns with
phenomenological observations first for early
prediction, but later also for improved medication.
Full brain simulations will typically cover spatial
scales from nanometers (proteins) to centimeters
(brain) and energy scales from 10 femto Joule at
biological (Genome, Transcriptome, Proteome) up
to 1 Joule at complex brain level (cognition).</p>
      <p>To achieve its goals the HBP defined in total 13
sub-projects each of them having a size of a large
project. Here we will briefly describe the new
informatics-based platforms that are meant to offer
the research community the possibility to work on
human brain issues with the help of a set of strong
and highly integrated tools:
x Neuroinformatics (searchable atlases and
analysis of brain data)
x Brain Simulation (building and simulating
multi-level models of brain circuits and
functions, incl. for example models of neural
microcircuits of up to a million neurons)
x Medical Informatics (see figure 1)
x Neuromorphic Computing (brain-like</p>
      <p>functions implemented in hardware)
x Neurorobotics (testing brain models and
simulations in virtual environments)
x High Performance Computing (providing the
necessary computing power by architectures
that allow memory intensive applications and
new ways of visually interacting with
simulations)</p>
      <p>HPC facilities at 4 centers can be used for the
purposes of the HBP: Jülich (6 petaflops peak, 450
TB memory, 8 PB scratch file system) allowing
simulations to up to 100 Mio neurons (scale of
mouse brain), Swiss CSCS (836 teraflops peak, 64
T, 4 PB) in particular for software development and
optimization, Barcelona SC (1 petaflops peak, 100
TB) for molecular-level simulations, CINECA (2
petaflops, 200 TB, 5 PB) mainly for data analytics.
In addition KIT Karlsruhe provides 3 PB of
storage. All centers are linked with 10 Gbit/s. In the
neuromorphic area SpiNNaker chips are being used
that have 18 cores and share 128 MB RAM
allowing to simulate 16.000 neurons with 8 Mio
plastic synapses with 1 W energy budget.</p>
      <p>The goals are ambitious1 and it is admitted that
the gap between physiological modeling and
cognition is still huge. However, the HBP indicates
how data intensive science is pushed to its
extremes in life sciences: (a) huge amounts of data
addressing many different levels of brain
organization are needed to feed the atlases, to
enable analyses needed to feed and test the validity
of the models and (b) much computer power will
be required to carry out the necessary computations
first within the project and afterwards by the
interested researchers.</p>
      <p>In addition to the problems described in the next
section the HBP is confronted with difficult privacy
and ethical issues making access to data even more
problematic. Distributed data mining solutions are
investigated to overcome these problems for
example.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data Practices</title>
      <p>
        A large survey about data practices [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], based on
some 120 interactions with data practitioners2 from
various disciplines, and two RDA Europe
workshops with leading European scientists [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]
made very clear that the current data practices are
not adequate to support such data intensive science
in an efficient and cost-effective way.
      </p>
      <p>
        The major findings of this survey can be
summarized as:
x The ESFRI3 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] discussion process and its
project initiatives, as well as recent
developments in e-Infrastructures, raised much
awareness about data issues, the practices and
the interaction processes around data
management and access crossing discipline
boundaries.
x Open Access [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to publications and now also
to data is widely supported but in practice
there are so many hurdles that most data is still
not available.
x Finding data re-usable for data intensive
sciences using the web requires new
mechanisms to establish trust. At this moment
we are lacking such mechanisms.
x There is much legacy data out there the
integration of which in our re-usable data
domain will cost an enormous amount of
curation and thus funds. In addition, we are
1 It should be mentioned that there is a broad debate
about the question whether the ambitions of the
HBP are realistic.
2 The term "data practitioner" is used here as a term
describing skills of data scientists, data managers,
data stewards, data librarians, etc. since mostly
these terms are not well-defined yet.
3 European Strategy Forum on Research
Infrastructures
x
x
x
x
still creating legacy-style data despite all
advancements since it is not suitably organized
and described, which is mainly due to a lack of
trained experts and appropriate software.
      </p>
      <p>There is an increasing pressure for almost all
departments to participate in data intensive
sciences, but researchers see a lack of expertise
in adequate data management and workflow
creation/maintenance skills. Currently
researchers need to spend a large fraction of
their time (partly up to 75%) to find, access
and curate data to make it fit for their needs. In
addition, the practice of many researchers
working with manual steps or with ad hoc
scripts does not lead to reproducible science.
Data management is still widely based on file
systems which do not allow capturing the
increasing amount of “logical” information
Fig. 2: The typical decrease of available
information about data stored over time as
described by W. Michener is indicated
which results in great problems in making
use of data. There are various factors and</p>
      <p>moments that lead to this decrease of
information such as when PhDs leave an
institute without having documented their
data properly which is a very well-known
phenomenon. Assigning persistent identifiers
and creating appropriate metadata would
help to reduce the speed of losing</p>
      <p>
        information.
about the data (persistent identifiers, metadata,
rights, relations, etc.). Ad hoc solutions are
being used amplifying the problem of
"increasing data entropy" as W. Michener [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
called it (see figure 2).
      </p>
      <p>The use of persistent identifiers and metadata
which would help in identifying, finding and
re-using data is still in its infancy. Ad hoc
solutions such as handling spreadsheets do
only work for the duration of projects and
leave chaos afterwards given the increasing
amount of data.</p>
      <p>Despite some efforts for specific databases
there is in general a lack of explicitness with
respect to structure and semantic descriptions
of the content of data which creates
inefficiencies in particular when users do not
have direct relations with the creators.</p>
      <p>
        There is a clear trend towards using "trustful"
centres which offer researchers to host,
manage and access their data. However, there
are many hurdles for centres to offer
crossborder services although economy of scale
factors indicate that much can be gained due to
the available expertise. Existing certification
methods such as defined by Data Seal of
Approval [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] need to be applied by the centres
to raise the level of trust.
      </p>
      <p>It is widely agreed that there is a lack of
expertise and knowledge about data issues
(principles, organization, curation, etc.) and
that we need to train a new generation of data
practitioners. It is this lack of experts and
expertise that hampers progress.</p>
      <p>Senior scientists agree that changes in data
practices are urgently needed, but they hesitate to
take steps for mainly two reasons:
x they lack guidance towards certain agreed
solutions which prevents investments,
x they lack the experts that would turn
investments into appropriate solutions.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Achieving Changes through RDA</title>
      <p>
        This raises the questions who can give guidance
in navigating in the huge solution space with
respect to data issues and how can we train the new
generation towards harmonized solutions that
guarantee more efficiency and cost-effectiveness
which finally will boost data intensive sciences.
Here we would like to refer to the early phases of
the Internet where many solutions were suggested
with different competing approaches. It took about
15 years until agreements on simple principles such
as TCP/IP [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for global networks were accepted.
Basically these agreements led to the boost of
connectivity which we can now take profit from.
      </p>
      <p>
        Quite a number of policy level initiatives have
established rules and principles and there seems to
be wide agreement [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] about them. An increasing
number of funders are also requesting to add
socalled data management plans to grant applications
which certainly raise the level of awareness about
data issues for many researchers. But due to the
problems described above there is also great
uncertainty how to create such plans that make
sense for the many data use cases [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. An
increasing conviction of some data practitioners
and some funders emerged that an acceleration of
the process to come to agreements that help
changing data practices is urgently required. The
Internet history seems to offer a possible approach:
complement the policy level efforts by an
essentially bottom-up driven initiative where data
practitioners work on urgent barriers that need to
be overcome. To this end a first international
workshop was organized at the ICRI conference
2012 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] under the name "DAITF" which stands
for Data Access and Interoperability Task Force. A
joint effort from mainly European, US American
and Australian experts and funders led then to the
birth of the Research Data Alliance (RDA) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] in
autumn 2012. We like to use the similarity of some
characteristics with the Internet Engineering Task
Force, however, it is obvious that the data domain
has many more facets and challenges to deal with.
      </p>
      <p>
        We would like to cite Naoyuki Tsunematsu
(Senior Advisor of Japanese Council for Science
and Technology) who pointed to two observations
relevant in this context and which motivated Japan
to join the Research Data Alliance [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
x The value proposition for publically funded
research is about "stimulating
competitiveness" but a new strand needs to be
added which is "knowledge discovery on smart
data collections" where professional
infrastructures and human skills are the key
factors for success.
x There seems to be a correlation between a lack
of motivation to share data in the Japanese
academic world and thus a lack of openness
and a decrease in the number of top-level
international collaborations and of top-level
papers which is a concern for policy makers in
Japan4.
      </p>
      <p>
        After the workshop at ICRI 2012 the European
Commission, NSF and NIST in the US and the
Australian Government accepted grant proposals
from key experts in their respective regions that
allowed the practitioners to start the RDA work, i.e.
funding is given to consortiums in the three
regions. As one branch the RDA Europe [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
project was funded as a usual EC project, in
September 2015 already, the 3rd RDA Europe
project will start to allow us to continue the work
and EC’s new draft work programme 2016/17
indicates future perspectives for RDA. First, a
steering board was established between the three
funded initiatives to define a governance structure
and procedures for RDA, and it started stimulating
the practical work.
      </p>
      <p>
        RDA decided to have a very simple structure where
the key roles are given to the Working Groups and
Interest Groups5 that meet at plenaries and other
4 The recent G8 Open Data Report [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] indicates
that in the rating between G8 members Germany
and Russia are even behind with respect to
openness of data.
5 It should be noted here that the major difference
between the two groups is that the WGs need to
come with tangible results after 18 months.
meetings. Every RDA member can decide to
initiate such a group and to be successful a case
statement needs to be submitted that must fulfil a
number of criteria [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. A Council was setup that
has an overlooking role to ensure balanced progress
and adherence to quality rules and processes. A
Technical Advisory Board that is elected by the
RDA members6 will give advice to all actors on
content aspects, i.e. respond on questions such as
“do the intentions of the Working and Interest
Groups meet the scope of RDA, do they fulfil the
established requirements, do they involve existing
and relevant initiatives, do they intend to remove
practical barriers, etc.“. An Organisational
Advisory Board that represents all organizations
that are organizational members and thus
contribute with some funds to the success of RDA
gives advice on organizational and administrative
issues. In addition RDA has a Secretariat that
needs to organise the plenaries, keep control on the
processes and doing a variety of other
administration/ organisational tasks. A General
Secretary has been appointed leading the
secretarial work and taking responsibility for
managing RDA global.
      </p>
      <p>
        While RDA global is the platform where
agreements are being achieved in form of
guidelines, procedures, interface and protocol
specifications to overcome barriers, the regional
branches such as RDA Europe have the task to
raise awareness about RDA in their region,
convince experts to participate, interact with many
stakeholders to understand the needs and priorities,
organize the adoption of RDA results, taking care
of training and education and contributing to the
costs of RDA Global. RDA Europe for example
organises a number of meetings to meet the
requirements such as interacting with the EC and
member state ministries, European science
organisations, European leading scientists, large
scale European research infrastructures such as
ESFRI projects [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] and e-Infrastructures [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] such
as EUDAT [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and many research communities.
The meetings with leading scientists [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] are of
great importance and have led to useful
recommendations for RDA, most of which will be
implemented by RDA Europe from September
2015 on. The interactions with policy stakeholders
led for example to the Data Harvest Report [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]
setting priorities.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Early RDA Results</title>
      <p>Thus RDA's mission is about building the many
social and technical bridges that are required to
make data intensive work much more efficient and
thus to allow many researchers to participate in
6 Everyone who agrees with the basic rules of RDA
can become a member by registration.
extracting knowledge by processing virtual
collections existing of data coming from various
providers increasingly often across disciplines and
borders. Here we want to briefly indicate the major
results of the first working groups that finished
after roughly 20 months (or that will finish within
the coming few months) and their possible impact
on changing practices.</p>
    </sec>
    <sec id="sec-6">
      <title>5.1 Data Foundation and Terminology (DFT)</title>
      <p>
        Based on many use cases from various
disciplines and countries the DFT Working group
[
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] came up with a simple core data model and a
terminology for registered data. It introduces the
notion of the Digital Object which is represented
by a bitstream, can be stored in various
repositories, is identified by a persistent identifier
and described by metadata. The model includes a
few further definitions, but important is to note that
these definitions are fundamental and independent
of disciplines. If scientists worldwide would adhere
to such a simple model we could much more easily
understand each other when talking about data and
would be able to build harmonized software
leading to much higher interoperability.
      </p>
    </sec>
    <sec id="sec-7">
      <title>5.2 Data Type Registries (DTR)</title>
      <p>
        The DTR group [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] created a specification for
data type registries that allow users to link data
types of various sorts with functions (executable
code). Data types can be simple types such as
semantic categories (temperature, noun, etc.) or
complex types such as scientific digital objects
(complex annotated images, time series, tables,
etc.). DTRs can be used for example to carry out
mappings automatically when simple types such as
“temperature” occur or start for example
visualization software when complex types are
found. Such DTRs would overcome the problems
we so often have with unknown data types which
we receive and where we do not know how to
process and interpret them. Thus we see an
enormous impact for DTRs in daily practice.
      </p>
    </sec>
    <sec id="sec-8">
      <title>5.3 PID Information Types (PIT)</title>
      <p>
        The PIT working group [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] produced a
common API (Application Program Interface) to
unify access to Persistent Identifier (PID) service
providers. Currently there are different PID
systems (Handle/DOI7, AWK, etc.) and many
different service providers all having their own
regulations making it very cumbersome to get for
example the checksum of a Digital Object to check
its identity and integrity. Applying this unified API
together with some basic data types such as
7 DOIs are Handles with a special prefix and used
to refer to published collections. Handle/DOI
services are available worldwide.
“checksum” would allow application programmers
to simply provide one piece of software allowing
them to deal with all PID service providers in the
same way. Since PIDs will have such a central role
in data management and access the impact of a
unified API will be enormous.
      </p>
    </sec>
    <sec id="sec-9">
      <title>5.4 Practical Policies (PP)</title>
      <p>
        In particular data management and curation are
guided by specific policies which are then turned
into executable procedures such as "replicating a
data collection" or "checking digital objects'
integrity" that are mostly used in federated
environments. The PP group [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] is collecting
many such practical policies from various
institutions and projects, analysing and evaluating
them and suggesting best practices which then can
be offered as templates for proven operations.
Thus, these templates have the potential to increase
the trust level. The work of the group will not end
since there are so many areas where best practices
can improve the quality and reproducibility of data
practices. In collaboration with the EUDAT project
the group is working on an open registry standard
for such best practice PPs.
      </p>
    </sec>
    <sec id="sec-10">
      <title>5.5 Metadata Standard Registry (MDR)</title>
      <p>
        As has been described the usage of proper
metadata is still in its infancy and there are many
reasons for this. One reason certainly is that many
labs still do not know which metadata they should
use, where they can find suitable vocabularies and
tools, etc. The MDR group [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ] offers a registry
which allows researchers to look for most suitable
metadata schemas. Therefore this MDR will help
data practitioners that are looking for proper
metadata solutions. More work in the metadata area
is going on within RDA.
      </p>
    </sec>
    <sec id="sec-11">
      <title>5.6 Data Citation (DC)</title>
      <p>
        The Data Citation group [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] worked out
suggestions of how to cite so-called dynamic data,
i.e. data that changes while people are already
working with it and referring to it. All data coming
in from seismological sensors for example will
immediately be used when it becomes available for
processing even if data samples in the sequences
are missing due to transmission delays for example.
How can researchers refer back to these incomplete
versions of data? This is a problem that many
disciplines have and this group worked out a
suggestion how to solve this citation problem so
that it could be implemented in all software and
procedures.
      </p>
    </sec>
    <sec id="sec-12">
      <title>5.7 Repository Audit and Certification (RAC)</title>
      <p>
        As indicated above quality assessment of
repositories (centres) is increasingly important to
raise the level of trust and the RAC group [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]
wants to come up with a unified standard. A few
suggestions have been made such as by Data Seal
of Approval [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] and World Data Systems [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ].
These two suggestions are already widely used and
so similar that the responsible initiatives decided to
join forces to make their guidelines compatible
with each-other. It is widely agreed that the
resulting set of guidelines is a good basis to certify
trusted repositories worldwide8.
      </p>
    </sec>
    <sec id="sec-13">
      <title>5.8 New RDA Phase</title>
      <p>
        At the fifth plenary (P5) we had a first adoption
day [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ] where experts from different disciplines
and institutions presented their way of making use
of these early results. The presentations showed
that the RDA results were not just an academic
looking for further adopters of these results by
offering funding for collaboration projects.
      </p>
      <p>We should add here that RDA is obviously
entering a new phase. While the first 5 working
groups were started at the first plenary in March
2013 each of them focusing on their specific topic
under high time pressure, the experts now
understand that they need to synchronise more to
achieve the needed coherence of all results. One
consequence was to set up the Data Fabric Interest
Group (DFIG) which is now bundling forces to
understand all components that are required to
come to efficient and reproducible data intensive
sciences. Figure 3 indicates briefly the topic being
addressed9. Data production and consumption in
enterprise, but indeed fulfil concrete needs of early
adopters in particular since in some cases first
implementation versions are available and can be
used. Currently, RDA Europe is for example
8 We note here that there are several further
certification schemes that go more in-depth on
specific aspects such as the “Security for
Collaborating Infrastructures Assessment and
Modification Record” (SCI) for security aspects, or
the NESTOR seal (based on DIN 31644) or ISO
16363 certification for general data repository
aspects. The DIN and ISO certifications are
extremely detailed and thorough, and thus fairly
costly to implement.
the daily data driven work can be indicated by a
cycle where at a certain moment new raw data is
being created and in some form being
organised/registered and put into a store.
Researchers who want to make use of data define a
new (virtual) collection by selecting data from
repositories and then carry out some processing
steps on it which can be management or analytical
operations. The result is a new collection of data
which should be registered and stored again. The
questions addressed are now which components are
needed to run such a "fabric" efficiently and
selfdocumenting and how these components should
interact. Figure 3 also indicates how the finishing
working groups fit into this cycle.</p>
      <p>
        Currently the DFIG is collecting many Use Cases
to build on what people are already doing and to
abstract from these Use Cases to "common
components" that are required. Such common
components would include for example a global
PID system10 providing PID registration and
resolution mechanisms that can be used by
everyone. Everyone interested should be motivated
to contribute Use Cases that will influence the
discussions about common components. A first
paper to accelerate discussions has been made
available by a number of distinguished experts
from various regions [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ].
      </p>
    </sec>
    <sec id="sec-14">
      <title>5.9 RDA Summary</title>
      <p>RDA is still a very young initiative and its
success mainly depends on the willingness of data
practitioners to spend time on global and
crossdisciplinary11 problem solving, on the quality of
their results, and their uptake by scientific projects
worldwide. For TCP/IP in its early days, there was
nothing particular that distinguished it from other
suggestions. It was its layered approach and
robustly running code that finally convinced people
worldwide to adopt the standard. RDA needs to do
a lot to have similar success and it needs strong
infrastructure pillars that provide and maintain
services.</p>
    </sec>
    <sec id="sec-15">
      <title>6. Infrastructure Pillars</title>
      <p>As described, RDA is only working on
specifications and it is neither providing services
nor maintaining code. It will rely on powerful
centres and federations to provide the
infrastructures that are finally required to transform
specifications into real services that enable efficient
data intensive sciences. In the same way we can
state that researchers in general are not so much
interested in specifications of interfaces for
example, but in the services that will facilitate their
work. In a simplified way figure 4 indicates the
essential relationships between researchers as
consumers of facilitating services who would also
like to influence specification building to ensure
the emergence of useful services, infrastructures
that are built compliant to the specifications to
ensure interoperability of the services and
initiatives such as RDA which establish the
10 The Handle System (http://www.handle.net/) is
such a global PID system supervised and managed
by the international DONA Foundation and it is
also basis of the DOI and other service providers
such as EPIC in Europe.
11 RDA also includes some disciplinary groups
which are using the global nature of RDA to
achieve community agreements.</p>
      <p>researchers
influence
specifications</p>
      <p>facilitate
enable
Fig. 4: It indicates schematically the essential
relationships between researchers,
infrastructures and the specification work such
as in RDA.
specifications as a joint effort of data practitioners,
i.e. researchers and infrastructure providers.</p>
      <p>
        Information infrastructures in our distributed
landscape of data and computational services get
very complex and involve several layers, which is
sketched in the diagram drawn by the High Level
Expert Group on Scientific Data (Figure 5) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
This diagram aims to work out the difference
between discipline specific and common services
that users (top layer) will use probably without
noticing who will give the services they are using.
Initiatives such as EUDAT were started to offer
common services (bottom layer) and thus to
      </p>
      <p>Fig. 5; It schematically indicates 3 layers of the
socalled Collaborative Data Infrastructure where
community based infrastructures offer community
specific services and e-Infrastructures offer common
discipline crossing services. This was seen by the EC as
a blueprint for funding programs.
complement the typical ESFRI layer (middle layer)
with many European research infrastructures in
various research disciplines.</p>
      <p>
        The first ESFRI roadmap from 2006 [
        <xref ref-type="bibr" rid="ref36">36</xref>
        ] led to
44 research infrastructures leading to an intensive
and concerted European activity across many
disciplines. Most of these infrastructure initiatives
are heading towards building persistent distributed
information infrastructures.
      </p>
      <p>
        One example is the CLARIN initiative [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ] in the
area of language resources and technology which
has recently achieved the status of an ERIC12.
CLARIN is based on strong and federated centres
in a variety of European countries that share the
effort in defining standards together with the
EUDAT to make use of the advanced services that
are offered by them.
      </p>
    </sec>
    <sec id="sec-16">
      <title>6.1 EUDAT</title>
      <p>
        EUDAT is a federation of well-resourced and
partly national data and compute centres in various
Fig. 6 shows the federation of centres across Europe that is the basis of EUDAT’s e-Infrastructure and the 5
basic user services it offers to the research community. In addition to the 5 user services it established
system services such as an authentication and authorisation infrastructure and a service to register and
resolve persistent identifiers.
community, in aggregating digital language
resources, and in offering joint services, in
managing and curating data with discipline specific
knowledge and others. The services offered by
CLARIN include deposit possibilities, a joint
metadata catalogue called Virtual Language
Observatory [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], a distributed workflow tool
allowing users to analyse texts in various languages
and many smaller services. However, CLARIN
centres are not equipped to offer massive compute
power to all possible users from all over Europe
who may want to execute workflows or use large
storage systems to manage large data sets.
      </p>
      <p>Therefore research infrastructures such as CLARIN
make liaisons with e-Infrastructures such as
EUDAT and pay for such common services. All
research infrastructures from the different research
domains are looking for similar options if they are
data and compute oriented.</p>
      <p>
        The ESFRI organisation and the EC are still
actively starting new research infrastructures. To
come to an optimized eco-system of information
infrastructures all ESFRI projects and beyond (such
as Human Brain Project) are seeking collaborations
with e-Infrastructures such as PRACE [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] and
12 ERIC is a special organisational template
invented to allow ESFRI research infrastructures to
become European legal entities.
countries as figure 6 indicates. Within its first three
years EUDAT invested all efforts in developing 5
basic services in collaboration with at the
beginning 5 communities13 (climate modelling,
earth plate observation, human physiology,
biodiversity and language resources and
technology). B2SHARE, B2DROP and B2FIND
are services directed to the end users meant for
dealing with long tail type data. B2SAFE is a
service that allows replicating large data sets
between a community centre and the EUDAT
centre network. The B2STAGE service is meant to
move data sets from the EUDAT store to the
workspaces of powerful computers of different
types (HPC, etc.) to carry out computations and to
return the results. All data in EUDAT are
registered, i.e. all digital objects have PIDs and are
associated with metadata to make them findable
and accessible.
      </p>
      <p>It should be added here that federating data
centres and their collections was and is a major
challenge and currently not scalable. The reason for
this can be found mainly in the data organisations
where each centre has chosen a different solution.</p>
      <p>This lack of interoperability leading to enormous
costs is one of the reasons why EUDAT is very
13 Currently EUDAT is closely interacting with 32
communities.
much interested in harmonised solutions being
worked out by RDA, for example in the DFT
group. Due to this close interest EUDAT declared
that it will try out RDA outputs where possible and
thus act as an RDA testbed in Europe.</p>
      <p>EUDAT just received its 2nd funding grant for 3
years which needs to be used to stabilize and
improve the services being offered, work out a
sustainable funding model and look for
collaborations with other European
eInfrastructures such as PRACE. This led a to an
additional work item which is devoted to
improving the exchange of data between EUDAT
and PRACE and demonstrating this as an efficient
service with the help of concrete data and compute
bound projects. Future challenges are anticipated
by also strengthening the work on executing
automatic workflows. It is understood that data
science needs to turn increasingly often to
automatic and self-documenting workflows to
make its results reproducible. Yet the challenges to
let users quickly deploy and execute complex
software close to where the data is stored, i.e.
operate in a distributed environment, are huge and
severe barriers need to be removed. But EUDAT
needs to demonstrate that it finally can offer
services similar to Amazon and other companies
where users can execute their software in a virtual
machine environment and basically pay for the
cycles used.</p>
      <p>
        In the coming period EUDAT will also be faced
by a new initiative and request of the European
Commission in the realm of Open Science and
Innovation [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] called the European Open Science
Cloud. The EC wants to have a “cloud service” for
all European researchers without having defined its
exact specifications yet. A high level expert group
is being formed that will work out the
requirements. According to EC experts the term
“cloud service” is meant in the broad sense, i.e. it
needs to include the necessary structures for
persistent identifiers, metadata, relations, etc.
      </p>
    </sec>
    <sec id="sec-17">
      <title>6.2 National Data Service (NDS)</title>
      <p>
        Also in the USA an attempt is being made under
the lead of NSCA [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] to setup a National Data
Service (NDS) [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ] and to offer similar
crossdisciplinary data services compared to EUDAT in
Europe and ANDS [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ] in Australia. The NDS is
an emerging vision for how scientists and
researchers across all disciplines can find, reuse,
and publish data. It wants to build on the data
archiving and sharing efforts already underway
within specific communities and to link them
together with a common set of tools.
      </p>
      <p>Currently the NDS is focusing on collaborations
with some communities to find out what kind of
services they are expecting. Yet the stakeholders
are still discussing which concept will be the best
to address the eminent challenges posed by the data
deluge and the need to optimize data sharing and
re-use in the USA. Recently the leading persons in
RDA US agreed to ask NDS to act as national
testbed center for RDA results.</p>
    </sec>
    <sec id="sec-18">
      <title>7. National Level Pillars</title>
      <p>Also at the national level in Europe new
organisational structures are being tested and
established to meet the challenges of data intensive
sciences.</p>
    </sec>
    <sec id="sec-19">
      <title>7.1 Max Planck Society</title>
      <p>In the Max Planck Society an IT Strategy
Committee was founded a few years ago to come
up with advice how to reshape the IT service
structure in its organisation to maintain
competitiveness of its research. With the
introduction of parallel computers many years ago
the Computer Centre in Garching got the task to
provide not only high performance compute
capacity but also to provide expertise in
parallelising relevant domain specific software
codes for simulation and analytics. In collaboration
with domain experts such code was optimised
allowing optimal use of HPC architectures. The
optimal solution for such code parallelization was
thus found by bringing together expertise and
resources of each institute with central expertise
and resources such as storage capacity and compute
power. The strategy committee realized that the
huge increase of data and the challenges of data
intensive sciences require a new approach in so far
as it makes sense to also provide central expertise
and facilities in data management, curation and
analytics.</p>
      <p>As a consequence, the centre in Garching got a
new name (Max Planck Computing and Data
Facility, MPCDF) to indicate the change in focus,
and was extended with data experts having
expertise in mathematics and algorithms in typical
data analytics applications which are widely
discipline unspecific. The idea is to carry out
collaborations between the centre and the various
institutes and their departments that cannot invest
in the specific knowledge required and that do not
have the local resources to store and manage all
data and to carry out the required computations.</p>
      <p>
        We will use the NoMaD (Novel Materials
Discovery) Repository project [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ] which has been
selected as one of the European Centres of
Excellence projects as an example for the typical
collaboration between a leading research institute
in the MPS and its MPCDF centre.
      </p>
      <p>Theoretical material scientists worldwide are
doing experiments with a number of well-known
chemical software packages (some at petascale
performance) to compute possible characteristics
for materials. These simulations are typically run
on HPC machines after having carried out deep
optimization of the software code tuned to certain
architectures. Until now the resulting data has been
used to write scientific papers, but was not
considered valuable as such. This attitude is
changing due to the fact that as in other research
disciplines the researchers see a value in re-using
data in different contexts, in allowing others to do
new kinds of computations and to prevent doubling
the work. The repository is meant to be a centre for
storing results of simulation runs being identified
by DOIs and described by proper metadata. Thus
proper data organization and stewardship is basis of
the work.</p>
      <p>Fig. 7 indicates the intentions of the Novel
Materials Discovery project (NoMaD) project to
federate and aggregate all data about stemming
from material experiments to enable easy access
and re-use.</p>
      <p>In collaboration with the researchers of the
FritzHaber Institute the MPCDF experts are developing
software to transform the incoming data to a
normalized and compressed format, developing the
repository software, the user upload, access and
search interfaces, and the needed data management
tools. In addition, novel analytic tools are being
developed in collaboration between the involved
centres to allow graphical searches, to carry out
machine-learning based comparisons on data sets,
to do smart visualizations supporting voyaging
methods, etc. Typically all these operations on the
aggregated data will be executed by making use of
“trivial” parallelization techniques such as enabled
by Map-Reduce methods on appropriate Hadoop
clusters, i.e. the repository will be hosted at
MPCDF and the computations will be carried out
on computers offered by MPCDF.</p>
    </sec>
    <sec id="sec-20">
      <title>7.2 Approaches in NL</title>
      <p>
        Also in countries such as for example the
Netherlands new strategies are being tested. In
addition to strengthen domain specific centres of
different types new centres have been established
to structure the data landscape. DANS [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ] and
3TU [
        <xref ref-type="bibr" rid="ref46">46</xref>
        ] have received the task to specialise on
data management and curation. They should make
use of the data services of the national data and
compute centre SARA.[
        <xref ref-type="bibr" rid="ref47">47</xref>
        ]. In addition the
eScience Centre [
        <xref ref-type="bibr" rid="ref48">48</xref>
        ] has been established to run
collaborative projects where discipline experts and
experts with centrally aggregated expertise are
shared to meet the challenges of data intensive
science. All these national service providers are
requested to synchronise their activities to come to
an efficiently organised eco-system of
infrastructure pillars and services.
      </p>
    </sec>
    <sec id="sec-21">
      <title>8. Conclusions</title>
      <p>Data Intensive Science (DIS) is one facet of the
digital change which we are currently experiencing
and which will change not only science but also
societies substantially. DIS which will be open to
many to exploit its full innovative power and not
exclusive to a few will depend on a change of
culture towards open data and accessibility of
services. In the European Union and its member
states community-driven research infrastructures
and e-Infrastructures tackling common
crossdisciplinary challenges have been started to address
the needs for an efficient eco-system of services
enabling data intensive work. The US did not make
this distinction, but under the term
“cyberinfrastructure” also community-driven and
more commons-driven projects were initiated.</p>
      <p>After almost a decade of experience in
infrastructure building it is obvious that there are
still many social and technical barriers prohibiting
efficient and cost-effective data usage and
reproducible results. In fact one can argue that only
active infrastructure building made many of the
barriers visible to all stakeholders. The time period
between the invention of TP/IP and its broad
uptake to enable efficient communication between
compute nodes took about 15 years. Several data
scientists and infrastructure builders from mainly
Europe, US and Australia agreed that it is time to
accelerate the process of overcoming the many
barriers for efficient data usage since waiting for
another decade to overcome the most severe
barriers is acceptable. Setting up the RDA based on
similar principles as IETF (bottom-up, rough
consensus, running code, lean governance) was the
preferred choice of the data experts and this choice
was supported by the funding organizations.
With this background in mind it is not surprising
that almost all strong European infrastructure
centres are very active in EUDAT as well as in
RDA and that for example also ANDS and NDS
engage actively in RDA. The Max Planck
Computing and Data Facility for example will
coordinate RDA Europe from September 2015, and
its members are in the Technical Advisory Board,
co-chairing the Data Foundation and Terminology
and Data Fabric Interest Groups and are leading a
Work Package in EUDAT, SARA and DANS for
example are also leading activities in EUDAT and
are actively engaged in RDA groups. NDS is
cochairing for example the Data Fabric Interest
Group and ANDS is represented in the Council and
Technical Advisory Board of RDA.</p>
      <p>In addition to accelerating global agreement
finding to improve data sharing and re-use and thus
to enable inclusive data intensive science two main
reasons can be mentioned for the engagement: a)
engaging its experts in cutting-edge developments
will make them fit for the coming challenges and b)
bringing in their expertise will influence decision
taking. So far RDA is too young to present final
conclusions about the question whether the
expectations were met.</p>
      <p>We need to accept that the data landscape is
changing rapidly and that new structures that have
been set up to facilitate data intensive sciences are
often still in a test phase. Essential questions in the
data domain are still not fully answered yet such as:
Which persistent structures need to be funded in
addition to libraries that often do not yet have the
skills to participate in the emerging data services
domain? What is the optimal division between
discipline specific and common services? What is
the most optimal way to share specialised and
expensive data experts that are scarce? Which are
the common components that need to be specified
to come to global, interoperable and
wellmaintained services supporting data intensive
sciences optimally?</p>
      <p>The EU and several of its member states as well
as the US decided to take an active role to exploit
the possibilities by taking concrete actions and by
asking data science experts to develop and test out
bottom-up driven models.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] MPI for Psycholinguistics</article-title>
          , http://www.mpi.nl
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Tony</given-names>
            <surname>Hey</surname>
          </string-name>
          et.al.,
          <source>The Fourth Paradigm - Data Intensive Scientific Discovery</source>
          ,
          <year>2009</year>
          , http://research.microsoft.com/enus/collaboration/fourthparadigm/4th_
          <article-title>paradigm_bo ok_complete_lr</article-title>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>John</given-names>
            <surname>Wood</surname>
          </string-name>
          et.al.,
          <source>Riding the Wave Report</source>
          ,
          <year>2012</year>
          , http://cordis.europa.eu/fp7/ict/einfrastructure/docs/hlg-sdi
          <source>-report.pdf</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Human</given-names>
            <surname>Brain</surname>
          </string-name>
          Project: https://www.humanbrainproject.eu/
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Herman</given-names>
            <surname>Stehouwer</surname>
          </string-name>
          , Peter Wittenburg,
          <source>RDA Data Practice Report</source>
          ,
          <year>2014</year>
          , http://europe.rdalliance.org/sites/default/files/RDA-Europe-D2.
          <fpage>5</fpage>
          -
          <string-name>
            <surname>Second-Year-Report-RDA-Europe-</surname>
          </string-name>
          ForumAnalysis-Programme.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[6] ESFRI Roadmap</source>
          <year>2006</year>
          , http://ec.europa.eu/research/infrastructures/index_e n.
          <article-title>cfm?pg=esfri-roadmap</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Open</given-names>
            <surname>Access</surname>
          </string-name>
          , http://en.wikipedia.org/wiki/Open_access
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] http://research.microsoft.com/enus/um/redmond/events/fs2010/presentations/miche ner_
          <source>environ_data_mgmt_rfs_71210</source>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Data</given-names>
            <surname>Seal</surname>
          </string-name>
          of Approval: http://datasealofapproval.org/en/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>[10] TCP/IP Protocol: http://en.wikipedia.org/wiki/Internet_protocol_suit e</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Herman</surname>
            <given-names>Stehouwer</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Wittenburg</surname>
          </string-name>
          ,
          <article-title>Principles for Data Sharing and Re-use: are they all the same</article-title>
          ?,
          <year>2015</year>
          http://hdl.handle.net/11304/1aab3df4-f3ce
          <string-name>
            <surname>-</surname>
          </string-name>
          11e4
          <string-name>
            <surname>-</surname>
          </string-name>
          ac7e-860aa0063d1f
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Wittenburg</surname>
          </string-name>
          , Leif Laaksonen, Hermann Stehouwer, Raphael Ritz,
          <source>Living with Data Management Plans</source>
          ,
          <year>2015</year>
          http://hdl.handle.net/11304/ea286e5a-f3d1
          <string-name>
            <surname>-</surname>
          </string-name>
          11e4
          <string-name>
            <surname>-</surname>
          </string-name>
          ac7e-860aa0063d1f
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>[13] ICRI 2012 Conference Copenhagen: http://www.icri2012.dk/www.ereg.me/ehome/index 06e1.html</mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>[14] Research Data Alliance: http://rd-alliance.org</mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Naoyuki</surname>
            <given-names>Tsunematsu</given-names>
          </string-name>
          ,
          <source>RDA plenary Keynote</source>
          , San Diego,
          <year>2015</year>
          : https://rd-alliance.org/keynotenaoyuki-tsunematsu.html
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Castro</surname>
          </string-name>
          , Travis Korte,
          <source>Open Data in the G8</source>
          ,
          <year>2015</year>
          , http://www2.datainnovation.org/2015-open-datag8.pdf
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Research</given-names>
            <surname>Data</surname>
          </string-name>
          Alliance - Europe, http://europe.rd-alliance.org
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>RDA</given-names>
            <surname>Case Statements</surname>
          </string-name>
          , https://rdalliance.org/working-and
          <article-title>-interest-groups/casestatements</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>ESFRI</given-names>
            <surname>Projects</surname>
          </string-name>
          , http://ec.europa.eu/research/infrastructures/index_e n.cfm?pg=esfri
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>EU</surname>
          </string-name>
          e-Infrastructures, http://cordis.europa.eu/fp7/ict/e-infrastructure/
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>EUDAT</surname>
          </string-name>
          e-Infrastructure, http://www.eudat.eu
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <article-title>Bernard Schutz et</article-title>
          .al.,
          <source>RDA Europe Science Workshop Report</source>
          ,
          <year>2014</year>
          , http://europe.rdalliance.org/documents/publications-reports/rdaeurope-science-workshop-report
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <article-title>John Wood et</article-title>
          .al.,
          <source>The Data Harvest</source>
          ,
          <year>2014</year>
          , https://europe.rdalliance.org/documents/publications-reports/
          <article-title>dataharvest-how-sharing-research-data-can-yieldknowledge-jobs-and</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>RDA</given-names>
            <surname>Data Foundation and Terminology</surname>
          </string-name>
          <string-name>
            <surname>WG</surname>
          </string-name>
          , https://rd-alliance.
          <article-title>org/groups/data-foundation-andterminology-wg</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>RDA</given-names>
            <surname>Data Type Registry</surname>
          </string-name>
          <string-name>
            <surname>WG</surname>
          </string-name>
          , https://rdalliance.org
          <article-title>/groups/data-type-registries-wg</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>RDA PID Information Type</surname>
            <given-names>WG</given-names>
          </string-name>
          , https://rdalliance.org/groups/pid
          <article-title>-information-types-wg</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>RDA</given-names>
            <surname>Practical Policy</surname>
          </string-name>
          <string-name>
            <surname>WG</surname>
          </string-name>
          , https://rdalliance.org/groups/practical-policy-wg.html
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>RDA</given-names>
            <surname>Metadata Standards Directory</surname>
          </string-name>
          <string-name>
            <surname>WG</surname>
          </string-name>
          , https://rd-alliance.org/groups/metadata-standardsdirectory-working-group.html
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>RDA</given-names>
            <surname>Data Citation</surname>
          </string-name>
          <string-name>
            <surname>WG</surname>
          </string-name>
          , https://rdalliance.org/groups/data-citation-wg.html
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>RDA</given-names>
            <surname>Repository Audit and Certification</surname>
          </string-name>
          <string-name>
            <surname>WG</surname>
          </string-name>
          , https://rd-alliance.
          <article-title>org/groups/repository-audit-andcertification-dsa%E2%80%93wds-partnershipwg</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>[31] Data Seal of Approval</source>
          , http://datasealofapproval.org/en/
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>[32] World Data Systems</source>
          , https://www.icsuwds.org/
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>RDA</given-names>
            <surname>Adoption</surname>
          </string-name>
          <string-name>
            <surname>Day</surname>
          </string-name>
          , San Diego,
          <year>2015</year>
          , https://www.rd-alliance.org/plenary-meetings/fifthplenary/programme/adoption-day.html
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>RDA</given-names>
            <surname>Data Fabric</surname>
          </string-name>
          <string-name>
            <surname>IG</surname>
          </string-name>
          , https://www.rdalliance.org/group/data-fabric-ig.html
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <article-title>Bridget Almas et</article-title>
          .al.,
          <source>Data Management Trends</source>
          , Principles and Components - What Needs to be Done Next?,
          <year>2015</year>
          , http://hdl.handle.net/11304/33430f2e-f598
          <string-name>
            <surname>-</surname>
          </string-name>
          11e4
          <string-name>
            <surname>-</surname>
          </string-name>
          ac7e-860aa0063d1f
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <source>[36] ESFRI Roadmap</source>
          <year>2006</year>
          , http://ec.europa.eu/research/infrastructures/index_e n.
          <article-title>cfm?pg=esfri-roadmap&amp;section=roadmap-2006</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>[37] CLARIN Research Infrastructure, http://www.clarin.eu/</mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>CLARIN</given-names>
            <surname>Virtual Language</surname>
          </string-name>
          <string-name>
            <surname>Observatory</surname>
          </string-name>
          , http://clarin.eu/content/virtual-languageobservatory
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <surname>PRACE</surname>
          </string-name>
          e-Infrastructure, http://www.praceri.eu/
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>EC</given-names>
            <surname>Open</surname>
          </string-name>
          <article-title>Science</article-title>
          and Innovation, http://ec.europa.eu/research/conferences/2015/eraof-innovation/index.cfm
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <article-title>National Center for Supercomputer Applications</article-title>
          , http://www.ncsa.illinois.edu/
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>National</given-names>
            <surname>Data Service</surname>
          </string-name>
          , http://www.nationaldataservice.org/
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>Australian</given-names>
            <surname>National Data Service</surname>
          </string-name>
          , http://www.ands.org.au/
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <surname>NoMaD - Novel Materials</surname>
          </string-name>
          Discovery Project, http://nomad-repository.eu/cms/
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>Data</given-names>
            <surname>Archiving</surname>
          </string-name>
          and Networked Service, http://www.dans.knaw.nl/nl
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <source>[46] 3TU Data Centrum</source>
          , http://datacentrum.3tu.nl/home/
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>SURF</given-names>
            <surname>Sara</surname>
          </string-name>
          , https://surfsara.nl/
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <surname>Netherlands</surname>
          </string-name>
          eScience Center, https://www.esciencecenter.nl/
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>