=Paper=
{{Paper
|id=Vol-1536/paper1
|storemode=property
|title=Specifying and Implementing Data Infrastructures Enabling Data Intensive Science
|pdfUrl=https://ceur-ws.org/Vol-1536/paper1.pdf
|volume=Vol-1536
|dblpUrl=https://dblp.org/rec/conf/rcdl/WittenburgS15
}}
==Specifying and Implementing Data Infrastructures Enabling Data Intensive Science==
Specifying and Implementing Data Infrastructures
Enabling Data Intensive Sciences
© Peter Wittenburg, Herman Stehouwer
Max Planck Data and Compute Center, Garching/Munich
peter.wittenburg@mpi.nl, herman.stehouwer@rzg.mpg.de
Abstract course much larger volumes of data were being
Examples from Psycholinguistics – a humanities processed and they can look back to a much longer
discipline – show that data intensive research is history of data oriented work.
changing all scientific disciplines dramatically.
Data intensive sciences pose unprecedented It was the book "The Fourth Paradigm – Data
challenges in data management and processing. A Intensive Scientific Discovery" [2] edited by Tony
survey in Europe showed clearly that most of the Hey and colleagues that introduced “data intensive
research departments are not prepared for this step science” as the 4th paradigm of scientific discovery
and that the methods that are used to manage, by referring to a talk given by J. Gray. It raised
curate and process data are inefficient and too much attention for the concept behind this new
costly. The Research Data Alliance, as a bottom- paradigm. Gray distinguishes 4 paradigms that are
up organized global and cross-disciplinary co-existing today: (1) Empirical Science describing
initiative, has been established to accelerate the natural phenomena, (2) Theoretical Science using
process of changing data practice. After only two models to achieve generalizations, (3)
years RDA produced its first concrete results, Computational Science simulating complex
which have to demonstrate their potential. In phenomena and (4) Data exploration by unifying
particular, the infrastructure builders are requested theory, experiment and simulation. Indeed, we can
to act as early adopters of RDA results. The observe that science is changing in so far as finding
European Commission and its member states have meaningful patterns in data sets becomes an
taken serious steps to establish an eco-system of essential approach. Increasingly more powerful and
research infrastructures and e-Infrastructures numerous sensors, improved network connections,
anticipating the challenges imposed by the data more powerful and numerous computers and more
deluge which will enable broad uptake of the advanced algorithms are key pillars for this
paradigm of data intensive science. Research development. The "Riding the Wave" [3] report
organisations have recognised these challenges as created by a High Level Expert Group of the
well and taken first steps to adapt its structures. European Commission (EC) was one of the
However, we need to understand that we are in a documents that summarized the specific data
phase of gigantic changes which implies that challenges and opportunities, and requested actions
measures currently being taken need to be by the EC to enable data intensive sciences for a
interpreted as tests on the way to new solid and large number of researchers and not only those that
sustainable structures. have sufficient funding to curate all data and
software to be integrated to make use of it.
1. Enabling Data Intensive Sciences
Quite a number of scientific institutes have been We see a number of trends which we can
data oriented for a long time already. For instance, summarize as follows:
most of the research of the experimental and x An increasing number of research disciplines
theoretical institutes of the Max Planck Society was adopted data intensive methods due to new
based on data. Even an institute that belongs to the technological and methodological possibilities.
humanities section of the Max Planck Society such During the last decades these changes were
as our former affiliation - the Institute for extreme in biological and neurological
Psycholinguistics [1] was oriented from the start disciplines.
towards the analysis of speech, eye movement and x The amount of data and its complexity in terms
gesture recordings, detecting meaningful patterns, of creation contexts, data types and relations
and building models to simulate speech perception. are increasing extremely.
In physics institutes (fusion, astronomy, etc.) of x The Internet allows us to offer data via the web
to be re-used by others.
_______________________________________
x This enables us to combine data sets in new
Proceedings of the XVII International
ways across institutional, national and
Conference «Data Analytics and Management
discipline borders.
in Data Intensive Domains»
(DAMDID/RCDL’2015), Obninsk, Russia,
1
x Mathematical methods have advanced to cope models and in both examples data cannot come
with heterogeneous data sets and we see large from one project or institute, but from many
libraries with statistical, stochastic and research labs. Researchers doing this kind of
machine learning methods becoming available. research know how difficult it is to find, access and
x The total amount of available CPU and storage combine the required data. Such research is very
capacity allows researchers to do large cost intensive and raises the questions whether we
amounts of computations on increasingly large can continue without serious changes, and whether
data sets. the available infrastructures are sufficient.
Despite the increase in compute capacity, 2. Human Brain Project
however, we can also observe an increasing An even more extreme example for the shift
analysis gap, i.e. the fraction of data we are able to towards the 4th paradigm is taken from life
process in a way that we can extract knowledge is sciences. The recently started Human Brain Project
getting smaller. The reasons for the analysis gap (HBP) [4] (as an EC flagship project) has as visions
are many and not subject of discussion in this (a) to be able to simulate at physiologocial level
paper. first rat brains and in a follow up phase human
brains (in silico experiments) and (b) to predict
Two examples taken from a humanities brain diseases from patterns found in recorded data
discipline show the fundamental changes towards sets at an early stage. The main goal of the latter
data intensive science that could not have been (medical informatics sub project in HBP) is being
carried out a few years ago. When studying for illustrated in figure 1. Researchers would like to
example the evolution of human languages over correlate observed phenomena such as specific
thousands of years linguists until recently based deficits due to brain diseases with all types of
their theories on comparing fragmented recordings that can be found from corresponding
descriptions of colleagues about several languages. patients such as brain images of different types,
Currently, large feature matrices are extracted gene sequences, protein data and perhaps even
describing characteristics of all languages in a reaction time measurements. Without having a
particular region such as for example those spoken
in Austronesia and these matrices are fed into
phylogenetic algorithms to calculate most probable
dependency trees that indicate how languages may
have influenced each other over thousands of years.
For this research a large database is required and
also more powerful computers are needed than
linguists were using traditionally to let the
algorithms generate meaningful optima.
The application of massive crowd sourcing
techniques in linguistics for example to understand
human communication including multimodal
interaction can be used as another example to
indicate the dramatic changes in research towards a
data centric perspective. These techniques generate
many parallel data streams originating from
smartphones that need to be annotated immediately
by machine processing tools to make them Fig. 1: One example for this new paradigm as it is
available for scientific studies. This automatic used in neuro-sciences (HBP) is shown. For example
annotation requires smart pre-processing and smart phenomena such as created by specific brain
data management. In this setup an increasing diseases can be observed. Yet there is no chance to
number of parallel operating detectors must be model the complexity of the human brain to make
trained to detect patterns in speech and video statements about their physiological origins. Data
streams in real time with the help of stochastic from various sources are correlated with the
machines. It is simply the shear amount of data phenomena to find those patterns in the data that are
requiring new ways of processing to enable this causing the observed deficits.
type of research leading to better assumptions
about what guides our interactions.
model of the human brain at hand this correlation
The basis of such methods as described in the would allow researchers nevertheless to detect co-
example above is the availability of large amounts occuring patterns in the data that seem to cause the
of data to estimate the many free parameters of the observed phenomena. Machine learning methods
2
are used to generate meaningful signatures from The goals are ambitious1 and it is admitted that
physical features in the data that then can be used the gap between physiological modeling and
to predict potential diseases from patients. cognition is still huge. However, the HBP indicates
how data intensive science is pushed to its
No assumptions are made about the structure and extremes in life sciences: (a) huge amounts of data
functioning of the brain, no assumptions are made addressing many different levels of brain
how genes may influence brain structure and organization are needed to feed the atlases, to
functioning, etc. since we don’t have sufficient enable analyses needed to feed and test the validity
knowledge in these areas. Nevertheless, by using a of the models and (b) much computer power will
large database of aligned data it is assumed that be required to carry out the necessary computations
researchers can relate physical patterns with first within the project and afterwards by the
phenomenological observations first for early interested researchers.
prediction, but later also for improved medication.
Full brain simulations will typically cover spatial In addition to the problems described in the next
scales from nanometers (proteins) to centimeters section the HBP is confronted with difficult privacy
(brain) and energy scales from 10 femto Joule at and ethical issues making access to data even more
biological (Genome, Transcriptome, Proteome) up problematic. Distributed data mining solutions are
to 1 Joule at complex brain level (cognition). investigated to overcome these problems for
example.
To achieve its goals the HBP defined in total 13
sub-projects each of them having a size of a large 3. Data Practices
project. Here we will briefly describe the new A large survey about data practices [5], based on
informatics-based platforms that are meant to offer some 120 interactions with data practitioners2 from
the research community the possibility to work on various disciplines, and two RDA Europe
human brain issues with the help of a set of strong workshops with leading European scientists [22]
and highly integrated tools: made very clear that the current data practices are
x Neuroinformatics (searchable atlases and not adequate to support such data intensive science
analysis of brain data) in an efficient and cost-effective way.
x Brain Simulation (building and simulating
multi-level models of brain circuits and The major findings of this survey can be
functions, incl. for example models of neural summarized as:
microcircuits of up to a million neurons) x The ESFRI3 [6] discussion process and its
x Medical Informatics (see figure 1) project initiatives, as well as recent
x Neuromorphic Computing (brain-like developments in e-Infrastructures, raised much
functions implemented in hardware) awareness about data issues, the practices and
x Neurorobotics (testing brain models and the interaction processes around data
simulations in virtual environments) management and access crossing discipline
x High Performance Computing (providing the boundaries.
necessary computing power by architectures x Open Access [7] to publications and now also
that allow memory intensive applications and to data is widely supported but in practice
new ways of visually interacting with there are so many hurdles that most data is still
simulations) not available.
x Finding data re-usable for data intensive
HPC facilities at 4 centers can be used for the sciences using the web requires new
purposes of the HBP: Jülich (6 petaflops peak, 450 mechanisms to establish trust. At this moment
TB memory, 8 PB scratch file system) allowing we are lacking such mechanisms.
simulations to up to 100 Mio neurons (scale of x There is much legacy data out there the
mouse brain), Swiss CSCS (836 teraflops peak, 64 integration of which in our re-usable data
T, 4 PB) in particular for software development and domain will cost an enormous amount of
optimization, Barcelona SC (1 petaflops peak, 100 curation and thus funds. In addition, we are
TB) for molecular-level simulations, CINECA (2
petaflops, 200 TB, 5 PB) mainly for data analytics. 1
It should be mentioned that there is a broad debate
In addition KIT Karlsruhe provides 3 PB of about the question whether the ambitions of the
storage. All centers are linked with 10 Gbit/s. In the HBP are realistic.
neuromorphic area SpiNNaker chips are being used 2
The term "data practitioner" is used here as a term
that have 18 cores and share 128 MB RAM describing skills of data scientists, data managers,
allowing to simulate 16.000 neurons with 8 Mio data stewards, data librarians, etc. since mostly
plastic synapses with 1 W energy budget. these terms are not well-defined yet.
3
European Strategy Forum on Research
Infrastructures
3
still creating legacy-style data despite all inefficiencies in particular when users do not
advancements since it is not suitably organized have direct relations with the creators.
and described, which is mainly due to a lack of x There is a clear trend towards using "trustful"
trained experts and appropriate software. centres which offer researchers to host,
x There is an increasing pressure for almost all manage and access their data. However, there
departments to participate in data intensive are many hurdles for centres to offer cross-
sciences, but researchers see a lack of expertise border services although economy of scale
in adequate data management and workflow factors indicate that much can be gained due to
creation/maintenance skills. Currently the available expertise. Existing certification
researchers need to spend a large fraction of methods such as defined by Data Seal of
their time (partly up to 75%) to find, access Approval [9] need to be applied by the centres
and curate data to make it fit for their needs. In to raise the level of trust.
addition, the practice of many researchers x It is widely agreed that there is a lack of
working with manual steps or with ad hoc expertise and knowledge about data issues
scripts does not lead to reproducible science. (principles, organization, curation, etc.) and
x Data management is still widely based on file that we need to train a new generation of data
systems which do not allow capturing the practitioners. It is this lack of experts and
increasing amount of “logical” information expertise that hampers progress.
Senior scientists agree that changes in data
practices are urgently needed, but they hesitate to
take steps for mainly two reasons:
x they lack guidance towards certain agreed
solutions which prevents investments,
x they lack the experts that would turn
investments into appropriate solutions.
4. Achieving Changes through RDA
This raises the questions who can give guidance
Fig. 2: The typical decrease of available in navigating in the huge solution space with
information about data stored over time as respect to data issues and how can we train the new
described by W. Michener is indicated generation towards harmonized solutions that
which results in great problems in making guarantee more efficiency and cost-effectiveness
use of data. There are various factors and which finally will boost data intensive sciences.
moments that lead to this decrease of Here we would like to refer to the early phases of
information such as when PhDs leave an the Internet where many solutions were suggested
institute without having documented their with different competing approaches. It took about
data properly which is a very well-known 15 years until agreements on simple principles such
phenomenon. Assigning persistent identifiers as TCP/IP [10] for global networks were accepted.
and creating appropriate metadata would Basically these agreements led to the boost of
help to reduce the speed of losing connectivity which we can now take profit from.
information.
Quite a number of policy level initiatives have
about the data (persistent identifiers, metadata, established rules and principles and there seems to
rights, relations, etc.). Ad hoc solutions are be wide agreement [11] about them. An increasing
being used amplifying the problem of number of funders are also requesting to add so-
"increasing data entropy" as W. Michener [8] called data management plans to grant applications
called it (see figure 2). which certainly raise the level of awareness about
data issues for many researchers. But due to the
x The use of persistent identifiers and metadata
problems described above there is also great
which would help in identifying, finding and
uncertainty how to create such plans that make
re-using data is still in its infancy. Ad hoc
sense for the many data use cases [12]. An
solutions such as handling spreadsheets do
increasing conviction of some data practitioners
only work for the duration of projects and
and some funders emerged that an acceleration of
leave chaos afterwards given the increasing
the process to come to agreements that help
amount of data.
changing data practices is urgently required. The
x Despite some efforts for specific databases
Internet history seems to offer a possible approach:
there is in general a lack of explicitness with
complement the policy level efforts by an
respect to structure and semantic descriptions
essentially bottom-up driven initiative where data
of the content of data which creates
practitioners work on urgent barriers that need to
4
be overcome. To this end a first international meetings. Every RDA member can decide to
workshop was organized at the ICRI conference initiate such a group and to be successful a case
2012 [13] under the name "DAITF" which stands statement needs to be submitted that must fulfil a
for Data Access and Interoperability Task Force. A number of criteria [18]. A Council was setup that
joint effort from mainly European, US American has an overlooking role to ensure balanced progress
and Australian experts and funders led then to the and adherence to quality rules and processes. A
birth of the Research Data Alliance (RDA) [14] in Technical Advisory Board that is elected by the
autumn 2012. We like to use the similarity of some RDA members6 will give advice to all actors on
characteristics with the Internet Engineering Task content aspects, i.e. respond on questions such as
Force, however, it is obvious that the data domain “do the intentions of the Working and Interest
has many more facets and challenges to deal with. Groups meet the scope of RDA, do they fulfil the
established requirements, do they involve existing
We would like to cite Naoyuki Tsunematsu and relevant initiatives, do they intend to remove
(Senior Advisor of Japanese Council for Science practical barriers, etc.“. An Organisational
and Technology) who pointed to two observations Advisory Board that represents all organizations
relevant in this context and which motivated Japan that are organizational members and thus
to join the Research Data Alliance [15]. contribute with some funds to the success of RDA
x The value proposition for publically funded gives advice on organizational and administrative
research is about "stimulating issues. In addition RDA has a Secretariat that
competitiveness" but a new strand needs to be needs to organise the plenaries, keep control on the
added which is "knowledge discovery on smart processes and doing a variety of other
data collections" where professional administration/ organisational tasks. A General
infrastructures and human skills are the key Secretary has been appointed leading the
factors for success. secretarial work and taking responsibility for
x There seems to be a correlation between a lack managing RDA global.
of motivation to share data in the Japanese
academic world and thus a lack of openness While RDA global is the platform where
and a decrease in the number of top-level agreements are being achieved in form of
international collaborations and of top-level guidelines, procedures, interface and protocol
papers which is a concern for policy makers in specifications to overcome barriers, the regional
Japan4. branches such as RDA Europe have the task to
raise awareness about RDA in their region,
After the workshop at ICRI 2012 the European convince experts to participate, interact with many
Commission, NSF and NIST in the US and the stakeholders to understand the needs and priorities,
Australian Government accepted grant proposals organize the adoption of RDA results, taking care
from key experts in their respective regions that of training and education and contributing to the
allowed the practitioners to start the RDA work, i.e. costs of RDA Global. RDA Europe for example
funding is given to consortiums in the three organises a number of meetings to meet the
regions. As one branch the RDA Europe [17] requirements such as interacting with the EC and
project was funded as a usual EC project, in member state ministries, European science
September 2015 already, the 3rd RDA Europe organisations, European leading scientists, large
project will start to allow us to continue the work scale European research infrastructures such as
and EC’s new draft work programme 2016/17 ESFRI projects [19] and e-Infrastructures [20] such
indicates future perspectives for RDA. First, a as EUDAT [21] and many research communities.
steering board was established between the three The meetings with leading scientists [22] are of
funded initiatives to define a governance structure great importance and have led to useful
and procedures for RDA, and it started stimulating recommendations for RDA, most of which will be
the practical work. implemented by RDA Europe from September
2015 on. The interactions with policy stakeholders
RDA decided to have a very simple structure where led for example to the Data Harvest Report [23]
the key roles are given to the Working Groups and setting priorities.
Interest Groups5 that meet at plenaries and other
5. Early RDA Results
4 Thus RDA's mission is about building the many
The recent G8 Open Data Report [16] indicates social and technical bridges that are required to
that in the rating between G8 members Germany make data intensive work much more efficient and
and Russia are even behind with respect to thus to allow many researchers to participate in
openness of data.
5
It should be noted here that the major difference
6
between the two groups is that the WGs need to Everyone who agrees with the basic rules of RDA
come with tangible results after 18 months. can become a member by registration.
5
extracting knowledge by processing virtual “checksum” would allow application programmers
collections existing of data coming from various to simply provide one piece of software allowing
providers increasingly often across disciplines and them to deal with all PID service providers in the
borders. Here we want to briefly indicate the major same way. Since PIDs will have such a central role
results of the first working groups that finished in data management and access the impact of a
after roughly 20 months (or that will finish within unified API will be enormous.
the coming few months) and their possible impact
on changing practices. 5.4 Practical Policies (PP)
In particular data management and curation are
5.1 Data Foundation and Terminology guided by specific policies which are then turned
(DFT) into executable procedures such as "replicating a
Based on many use cases from various data collection" or "checking digital objects'
disciplines and countries the DFT Working group integrity" that are mostly used in federated
[24] came up with a simple core data model and a environments. The PP group [27] is collecting
terminology for registered data. It introduces the many such practical policies from various
notion of the Digital Object which is represented institutions and projects, analysing and evaluating
by a bitstream, can be stored in various them and suggesting best practices which then can
repositories, is identified by a persistent identifier be offered as templates for proven operations.
and described by metadata. The model includes a Thus, these templates have the potential to increase
few further definitions, but important is to note that the trust level. The work of the group will not end
these definitions are fundamental and independent since there are so many areas where best practices
of disciplines. If scientists worldwide would adhere can improve the quality and reproducibility of data
to such a simple model we could much more easily practices. In collaboration with the EUDAT project
understand each other when talking about data and the group is working on an open registry standard
would be able to build harmonized software for such best practice PPs.
leading to much higher interoperability.
5.5 Metadata Standard Registry (MDR)
5.2 Data Type Registries (DTR) As has been described the usage of proper
The DTR group [25] created a specification for metadata is still in its infancy and there are many
data type registries that allow users to link data reasons for this. One reason certainly is that many
types of various sorts with functions (executable labs still do not know which metadata they should
code). Data types can be simple types such as use, where they can find suitable vocabularies and
semantic categories (temperature, noun, etc.) or tools, etc. The MDR group [28] offers a registry
complex types such as scientific digital objects which allows researchers to look for most suitable
(complex annotated images, time series, tables, metadata schemas. Therefore this MDR will help
etc.). DTRs can be used for example to carry out data practitioners that are looking for proper
mappings automatically when simple types such as metadata solutions. More work in the metadata area
“temperature” occur or start for example is going on within RDA.
visualization software when complex types are
found. Such DTRs would overcome the problems 5.6 Data Citation (DC)
we so often have with unknown data types which The Data Citation group [29] worked out
we receive and where we do not know how to suggestions of how to cite so-called dynamic data,
process and interpret them. Thus we see an i.e. data that changes while people are already
enormous impact for DTRs in daily practice. working with it and referring to it. All data coming
in from seismological sensors for example will
5.3 PID Information Types (PIT) immediately be used when it becomes available for
The PIT working group [26] produced a processing even if data samples in the sequences
common API (Application Program Interface) to are missing due to transmission delays for example.
unify access to Persistent Identifier (PID) service How can researchers refer back to these incomplete
providers. Currently there are different PID versions of data? This is a problem that many
systems (Handle/DOI7, AWK, etc.) and many disciplines have and this group worked out a
different service providers all having their own suggestion how to solve this citation problem so
regulations making it very cumbersome to get for that it could be implemented in all software and
example the checksum of a Digital Object to check procedures.
its identity and integrity. Applying this unified API
together with some basic data types such as 5.7 Repository Audit and Certification
(RAC)
7
As indicated above quality assessment of
DOIs are Handles with a special prefix and used repositories (centres) is increasingly important to
to refer to published collections. Handle/DOI raise the level of trust and the RAC group [30]
services are available worldwide.
6
wants to come up with a unified standard. A few looking for further adopters of these results by
suggestions have been made such as by Data Seal offering funding for collaboration projects.
of Approval [31] and World Data Systems [32].
These two suggestions are already widely used and We should add here that RDA is obviously
so similar that the responsible initiatives decided to entering a new phase. While the first 5 working
join forces to make their guidelines compatible groups were started at the first plenary in March
with each-other. It is widely agreed that the 2013 each of them focusing on their specific topic
resulting set of guidelines is a good basis to certify under high time pressure, the experts now
trusted repositories worldwide8. understand that they need to synchronise more to
achieve the needed coherence of all results. One
5.8 New RDA Phase consequence was to set up the Data Fabric Interest
At the fifth plenary (P5) we had a first adoption Group (DFIG) which is now bundling forces to
day [33] where experts from different disciplines understand all components that are required to
and institutions presented their way of making use come to efficient and reproducible data intensive
of these early results. The presentations showed sciences. Figure 3 indicates briefly the topic being
that the RDA results were not just an academic addressed9. Data production and consumption in
Fig. 3: It indicates at an abstract level the typical data creation and consumption cycle as it is being used in
the labs doing data intensive sciences. DFIG's questions are now which components are needed to run such a
cycle efficient and self-documenting and how these components need to interact. The figure also indicates
how the working groups that finished or are finishing fit into this cycle.
enterprise, but indeed fulfil concrete needs of early the daily data driven work can be indicated by a
adopters in particular since in some cases first cycle where at a certain moment new raw data is
implementation versions are available and can be being created and in some form being
used. Currently, RDA Europe is for example organised/registered and put into a store.
Researchers who want to make use of data define a
new (virtual) collection by selecting data from
8
We note here that there are several further repositories and then carry out some processing
certification schemes that go more in-depth on steps on it which can be management or analytical
specific aspects such as the “Security for operations. The result is a new collection of data
Collaborating Infrastructures Assessment and which should be registered and stored again. The
Modification Record” (SCI) for security aspects, or questions addressed are now which components are
the NESTOR seal (based on DIN 31644) or ISO needed to run such a "fabric" efficiently and self-
16363 certification for general data repository documenting and how these components should
aspects. The DIN and ISO certifications are
9
extremely detailed and thorough, and thus fairly A White Paper describes DFIG in more detail
costly to implement. [34].
7
researchers
influence facilitate
interact. Figure 3 also indicates how the finishing
working groups fit into this cycle. specifications
Currently the DFIG is collecting many Use Cases
enable
to build on what people are already doing and to
abstract from these Use Cases to "common Fig. 4: It indicates schematically the essential
components" that are required. Such common relationships between researchers,
components would include for example a global infrastructures and the specification work such
PID system10 providing PID registration and as in RDA.
resolution mechanisms that can be used by
everyone. Everyone interested should be motivated specifications as a joint effort of data practitioners,
to contribute Use Cases that will influence the i.e. researchers and infrastructure providers.
discussions about common components. A first
paper to accelerate discussions has been made Information infrastructures in our distributed
available by a number of distinguished experts landscape of data and computational services get
from various regions [35]. very complex and involve several layers, which is
sketched in the diagram drawn by the High Level
5.9 RDA Summary Expert Group on Scientific Data (Figure 5) [3].
RDA is still a very young initiative and its This diagram aims to work out the difference
success mainly depends on the willingness of data between discipline specific and common services
practitioners to spend time on global and cross- that users (top layer) will use probably without
disciplinary11 problem solving, on the quality of noticing who will give the services they are using.
their results, and their uptake by scientific projects Initiatives such as EUDAT were started to offer
worldwide. For TCP/IP in its early days, there was common services (bottom layer) and thus to
nothing particular that distinguished it from other
suggestions. It was its layered approach and
robustly running code that finally convinced people
worldwide to adopt the standard. RDA needs to do
a lot to have similar success and it needs strong
infrastructure pillars that provide and maintain
services.
6. Infrastructure Pillars
As described, RDA is only working on
specifications and it is neither providing services
nor maintaining code. It will rely on powerful
centres and federations to provide the
infrastructures that are finally required to transform
specifications into real services that enable efficient
data intensive sciences. In the same way we can Fig. 5; It schematically indicates 3 layers of the so-
state that researchers in general are not so much called Collaborative Data Infrastructure where
interested in specifications of interfaces for community based infrastructures offer community
example, but in the services that will facilitate their specific services and e-Infrastructures offer common
work. In a simplified way figure 4 indicates the discipline crossing services. This was seen by the EC as
essential relationships between researchers as a blueprint for funding programs.
consumers of facilitating services who would also
like to influence specification building to ensure complement the typical ESFRI layer (middle layer)
the emergence of useful services, infrastructures with many European research infrastructures in
that are built compliant to the specifications to various research disciplines.
ensure interoperability of the services and
initiatives such as RDA which establish the The first ESFRI roadmap from 2006 [36] led to
44 research infrastructures leading to an intensive
10
and concerted European activity across many
The Handle System (http://www.handle.net/) is disciplines. Most of these infrastructure initiatives
such a global PID system supervised and managed are heading towards building persistent distributed
by the international DONA Foundation and it is information infrastructures.
also basis of the DOI and other service providers
such as EPIC in Europe.
11
RDA also includes some disciplinary groups
which are using the global nature of RDA to
achieve community agreements.
8
One example is the CLARIN initiative [37] in the EUDAT to make use of the advanced services that
area of language resources and technology which are offered by them.
has recently achieved the status of an ERIC 12.
CLARIN is based on strong and federated centres 6.1 EUDAT
in a variety of European countries that share the EUDAT is a federation of well-resourced and
effort in defining standards together with the partly national data and compute centres in various
Fig. 6 shows the federation of centres across Europe that is the basis of EUDAT’s e-Infrastructure and the 5
basic user services it offers to the research community. In addition to the 5 user services it established
system services such as an authentication and authorisation infrastructure and a service to register and
resolve persistent identifiers.
community, in aggregating digital language countries as figure 6 indicates. Within its first three
resources, and in offering joint services, in years EUDAT invested all efforts in developing 5
managing and curating data with discipline specific basic services in collaboration with at the
knowledge and others. The services offered by beginning 5 communities13 (climate modelling,
CLARIN include deposit possibilities, a joint earth plate observation, human physiology,
metadata catalogue called Virtual Language biodiversity and language resources and
Observatory [38], a distributed workflow tool technology). B2SHARE, B2DROP and B2FIND
allowing users to analyse texts in various languages are services directed to the end users meant for
and many smaller services. However, CLARIN dealing with long tail type data. B2SAFE is a
centres are not equipped to offer massive compute service that allows replicating large data sets
power to all possible users from all over Europe between a community centre and the EUDAT
who may want to execute workflows or use large centre network. The B2STAGE service is meant to
storage systems to manage large data sets. move data sets from the EUDAT store to the
Therefore research infrastructures such as CLARIN workspaces of powerful computers of different
make liaisons with e-Infrastructures such as types (HPC, etc.) to carry out computations and to
EUDAT and pay for such common services. All return the results. All data in EUDAT are
research infrastructures from the different research registered, i.e. all digital objects have PIDs and are
domains are looking for similar options if they are associated with metadata to make them findable
data and compute oriented. and accessible.
The ESFRI organisation and the EC are still It should be added here that federating data
actively starting new research infrastructures. To centres and their collections was and is a major
come to an optimized eco-system of information challenge and currently not scalable. The reason for
infrastructures all ESFRI projects and beyond (such this can be found mainly in the data organisations
as Human Brain Project) are seeking collaborations where each centre has chosen a different solution.
with e-Infrastructures such as PRACE [39] and This lack of interoperability leading to enormous
costs is one of the reasons why EUDAT is very
12
ERIC is a special organisational template
13
invented to allow ESFRI research infrastructures to Currently EUDAT is closely interacting with 32
become European legal entities. communities.
9
much interested in harmonised solutions being services they are expecting. Yet the stakeholders
worked out by RDA, for example in the DFT are still discussing which concept will be the best
group. Due to this close interest EUDAT declared to address the eminent challenges posed by the data
that it will try out RDA outputs where possible and deluge and the need to optimize data sharing and
thus act as an RDA testbed in Europe. re-use in the USA. Recently the leading persons in
RDA US agreed to ask NDS to act as national
EUDAT just received its 2 nd funding grant for 3 testbed center for RDA results.
years which needs to be used to stabilize and
improve the services being offered, work out a 7. National Level Pillars
sustainable funding model and look for Also at the national level in Europe new
collaborations with other European e- organisational structures are being tested and
Infrastructures such as PRACE. This led a to an established to meet the challenges of data intensive
additional work item which is devoted to sciences.
improving the exchange of data between EUDAT
and PRACE and demonstrating this as an efficient 7.1 Max Planck Society
service with the help of concrete data and compute In the Max Planck Society an IT Strategy
bound projects. Future challenges are anticipated Committee was founded a few years ago to come
by also strengthening the work on executing up with advice how to reshape the IT service
automatic workflows. It is understood that data structure in its organisation to maintain
science needs to turn increasingly often to competitiveness of its research. With the
automatic and self-documenting workflows to introduction of parallel computers many years ago
make its results reproducible. Yet the challenges to the Computer Centre in Garching got the task to
let users quickly deploy and execute complex provide not only high performance compute
software close to where the data is stored, i.e. capacity but also to provide expertise in
operate in a distributed environment, are huge and parallelising relevant domain specific software
severe barriers need to be removed. But EUDAT codes for simulation and analytics. In collaboration
needs to demonstrate that it finally can offer with domain experts such code was optimised
services similar to Amazon and other companies allowing optimal use of HPC architectures. The
where users can execute their software in a virtual optimal solution for such code parallelization was
machine environment and basically pay for the thus found by bringing together expertise and
cycles used. resources of each institute with central expertise
and resources such as storage capacity and compute
In the coming period EUDAT will also be faced power. The strategy committee realized that the
by a new initiative and request of the European huge increase of data and the challenges of data
Commission in the realm of Open Science and intensive sciences require a new approach in so far
Innovation [40] called the European Open Science as it makes sense to also provide central expertise
Cloud. The EC wants to have a “cloud service” for and facilities in data management, curation and
all European researchers without having defined its analytics.
exact specifications yet. A high level expert group
is being formed that will work out the As a consequence, the centre in Garching got a
requirements. According to EC experts the term new name (Max Planck Computing and Data
“cloud service” is meant in the broad sense, i.e. it Facility, MPCDF) to indicate the change in focus,
needs to include the necessary structures for and was extended with data experts having
persistent identifiers, metadata, relations, etc. expertise in mathematics and algorithms in typical
data analytics applications which are widely
6.2 National Data Service (NDS) discipline unspecific. The idea is to carry out
Also in the USA an attempt is being made under collaborations between the centre and the various
the lead of NSCA [41] to setup a National Data institutes and their departments that cannot invest
Service (NDS) [42] and to offer similar cross- in the specific knowledge required and that do not
disciplinary data services compared to EUDAT in have the local resources to store and manage all
Europe and ANDS [43] in Australia. The NDS is data and to carry out the required computations.
an emerging vision for how scientists and
researchers across all disciplines can find, reuse, We will use the NoMaD (Novel Materials
and publish data. It wants to build on the data Discovery) Repository project [44] which has been
archiving and sharing efforts already underway selected as one of the European Centres of
within specific communities and to link them Excellence projects as an example for the typical
together with a common set of tools. collaboration between a leading research institute
in the MPS and its MPCDF centre.
Currently the NDS is focusing on collaborations
with some communities to find out what kind of
10
Theoretical material scientists worldwide are 7.2 Approaches in NL
doing experiments with a number of well-known Also in countries such as for example the
chemical software packages (some at petascale Netherlands new strategies are being tested. In
performance) to compute possible characteristics addition to strengthen domain specific centres of
for materials. These simulations are typically run different types new centres have been established
on HPC machines after having carried out deep to structure the data landscape. DANS [45] and
optimization of the software code tuned to certain 3TU [46] have received the task to specialise on
architectures. Until now the resulting data has been data management and curation. They should make
used to write scientific papers, but was not use of the data services of the national data and
considered valuable as such. This attitude is compute centre SARA.[47]. In addition the
changing due to the fact that as in other research eScience Centre [48] has been established to run
disciplines the researchers see a value in re-using collaborative projects where discipline experts and
data in different contexts, in allowing others to do experts with centrally aggregated expertise are
new kinds of computations and to prevent doubling shared to meet the challenges of data intensive
the work. The repository is meant to be a centre for science. All these national service providers are
storing results of simulation runs being identified requested to synchronise their activities to come to
by DOIs and described by proper metadata. Thus an efficiently organised eco-system of
proper data organization and stewardship is basis of infrastructure pillars and services.
the work.
8. Conclusions
Data Intensive Science (DIS) is one facet of the
digital change which we are currently experiencing
and which will change not only science but also
societies substantially. DIS which will be open to
many to exploit its full innovative power and not
exclusive to a few will depend on a change of
culture towards open data and accessibility of
services. In the European Union and its member
states community-driven research infrastructures
and e-Infrastructures tackling common cross-
disciplinary challenges have been started to address
the needs for an efficient eco-system of services
Fig. 7 indicates the intentions of the Novel enabling data intensive work. The US did not make
Materials Discovery project (NoMaD) project to this distinction, but under the term
federate and aggregate all data about stemming “cyberinfrastructure” also community-driven and
from material experiments to enable easy access more commons-driven projects were initiated.
and re-use.
After almost a decade of experience in
infrastructure building it is obvious that there are
In collaboration with the researchers of the Fritz- still many social and technical barriers prohibiting
Haber Institute the MPCDF experts are developing efficient and cost-effective data usage and
software to transform the incoming data to a reproducible results. In fact one can argue that only
normalized and compressed format, developing the active infrastructure building made many of the
repository software, the user upload, access and barriers visible to all stakeholders. The time period
search interfaces, and the needed data management between the invention of TP/IP and its broad
tools. In addition, novel analytic tools are being uptake to enable efficient communication between
developed in collaboration between the involved compute nodes took about 15 years. Several data
centres to allow graphical searches, to carry out scientists and infrastructure builders from mainly
machine-learning based comparisons on data sets, Europe, US and Australia agreed that it is time to
to do smart visualizations supporting voyaging accelerate the process of overcoming the many
methods, etc. Typically all these operations on the barriers for efficient data usage since waiting for
aggregated data will be executed by making use of another decade to overcome the most severe
“trivial” parallelization techniques such as enabled barriers is acceptable. Setting up the RDA based on
by Map-Reduce methods on appropriate Hadoop similar principles as IETF (bottom-up, rough
clusters, i.e. the repository will be hosted at consensus, running code, lean governance) was the
MPCDF and the computations will be carried out preferred choice of the data experts and this choice
on computers offered by MPCDF. was supported by the funding organizations.
With this background in mind it is not surprising
that almost all strong European infrastructure
11
centres are very active in EUDAT as well as in [4] Human Brain Project:
RDA and that for example also ANDS and NDS https://www.humanbrainproject.eu/
engage actively in RDA. The Max Planck [5] Herman Stehouwer, Peter Wittenburg, RDA
Computing and Data Facility for example will Data Practice Report, 2014, http://europe.rd-
coordinate RDA Europe from September 2015, and alliance.org/sites/default/files/RDA-Europe-D2.5-
its members are in the Technical Advisory Board, Second-Year-Report-RDA-Europe-Forum-
co-chairing the Data Foundation and Terminology Analysis-Programme.pdf
and Data Fabric Interest Groups and are leading a [6] ESFRI Roadmap 2006,
Work Package in EUDAT, SARA and DANS for http://ec.europa.eu/research/infrastructures/index_e
example are also leading activities in EUDAT and n.cfm?pg=esfri-roadmap
are actively engaged in RDA groups. NDS is co- [7] Open Access,
chairing for example the Data Fabric Interest http://en.wikipedia.org/wiki/Open_access
Group and ANDS is represented in the Council and [8] http://research.microsoft.com/en-
Technical Advisory Board of RDA. us/um/redmond/events/fs2010/presentations/miche
ner_environ_data_mgmt_rfs_71210.pdf
In addition to accelerating global agreement [9] Data Seal of Approval:
finding to improve data sharing and re-use and thus http://datasealofapproval.org/en/
to enable inclusive data intensive science two main [10] TCP/IP Protocol:
reasons can be mentioned for the engagement: a) http://en.wikipedia.org/wiki/Internet_protocol_suit
engaging its experts in cutting-edge developments e
will make them fit for the coming challenges and b) [11] Herman Stehouwer, Peter Wittenburg,
bringing in their expertise will influence decision Principles for Data Sharing and Re-use: are they all
taking. So far RDA is too young to present final the same?, 2015
conclusions about the question whether the http://hdl.handle.net/11304/1aab3df4-f3ce-11e4-
expectations were met. ac7e-860aa0063d1f
[12] Peter Wittenburg, Leif Laaksonen, Hermann
We need to accept that the data landscape is Stehouwer, Raphael Ritz, Living with Data
changing rapidly and that new structures that have Management Plans, 2015
been set up to facilitate data intensive sciences are http://hdl.handle.net/11304/ea286e5a-f3d1-11e4-
often still in a test phase. Essential questions in the ac7e-860aa0063d1f
data domain are still not fully answered yet such as: [13] ICRI 2012 Conference Copenhagen:
Which persistent structures need to be funded in http://www.icri2012.dk/www.ereg.me/ehome/index
addition to libraries that often do not yet have the 06e1.html
skills to participate in the emerging data services [14] Research Data Alliance: http://rd-alliance.org
domain? What is the optimal division between [15] Naoyuki Tsunematsu, RDA plenary Keynote,
discipline specific and common services? What is San Diego, 2015: https://rd-alliance.org/keynote-
the most optimal way to share specialised and naoyuki-tsunematsu.html
expensive data experts that are scarce? Which are [16] Daniel Castro, Travis Korte, Open Data in the
the common components that need to be specified G8, 2015,
to come to global, interoperable and well- http://www2.datainnovation.org/2015-open-data-
maintained services supporting data intensive g8.pdf
sciences optimally? [17] Research Data Alliance - Europe,
http://europe.rd-alliance.org
The EU and several of its member states as well [18] RDA Case Statements, https://rd-
as the US decided to take an active role to exploit alliance.org/working-and-interest-groups/case-
the possibilities by taking concrete actions and by statements.html
asking data science experts to develop and test out [19] ESFRI Projects,
bottom-up driven models. http://ec.europa.eu/research/infrastructures/index_e
n.cfm?pg=esfri
9. References [20] EU e-Infrastructures,
[1] MPI for Psycholinguistics, http://www.mpi.nl http://cordis.europa.eu/fp7/ict/e-infrastructure/
[2] Tony Hey et.al., The Fourth Paradigm - Data [21] EUDAT e-Infrastructure, http://www.eudat.eu
Intensive Scientific Discovery, 2009, [22] Bernard Schutz et.al., RDA Europe Science
http://research.microsoft.com/en- Workshop Report, 2014, http://europe.rd-
us/collaboration/fourthparadigm/4th_paradigm_bo alliance.org/documents/publications-reports/rda-
ok_complete_lr.pdf europe-science-workshop-report
[3] John Wood et.al., Riding the Wave Report, [23] John Wood et.al., The Data Harvest, 2014,
2012, http://cordis.europa.eu/fp7/ict/e- https://europe.rd-
infrastructure/docs/hlg-sdi-report.pdf alliance.org/documents/publications-reports/data-
12
harvest-how-sharing-research-data-can-yield- http://hdl.handle.net/11304/33430f2e-f598-11e4-
knowledge-jobs-and ac7e-860aa0063d1f
[24] RDA Data Foundation and Terminology WG, [36] ESFRI Roadmap 2006,
https://rd-alliance.org/groups/data-foundation-and- http://ec.europa.eu/research/infrastructures/index_e
terminology-wg.html n.cfm?pg=esfri-roadmap§ion=roadmap-2006
[25] RDA Data Type Registry WG, https://rd- [37] CLARIN Research Infrastructure,
alliance.org/groups/data-type-registries-wg.html http://www.clarin.eu/
[26] RDA PID Information Type WG, https://rd- [38] CLARIN Virtual Language Observatory,
alliance.org/groups/pid-information-types-wg.html http://clarin.eu/content/virtual-language-
[27] RDA Practical Policy WG, https://rd- observatory
alliance.org/groups/practical-policy-wg.html [39] PRACE e-Infrastructure, http://www.prace-
[28] RDA Metadata Standards Directory WG, ri.eu/
https://rd-alliance.org/groups/metadata-standards- [40] EC Open Science and Innovation,
directory-working-group.html http://ec.europa.eu/research/conferences/2015/era-
[29] RDA Data Citation WG, https://rd- of-innovation/index.cfm
alliance.org/groups/data-citation-wg.html [41] National Center for Supercomputer
[30] RDA Repository Audit and Certification WG, Applications, http://www.ncsa.illinois.edu/
https://rd-alliance.org/groups/repository-audit-and- [42] National Data Service,
certification-dsa%E2%80%93wds-partnership- http://www.nationaldataservice.org/
wg.html [43] Australian National Data Service,
[31] Data Seal of Approval, http://www.ands.org.au/
http://datasealofapproval.org/en/ [44] NoMaD - Novel Materials Discovery Project,
[32] World Data Systems, https://www.icsu- http://nomad-repository.eu/cms/
wds.org/ [45] Data Archiving and Networked Service,
[33] RDA Adoption Day, San Diego, 2015, http://www.dans.knaw.nl/nl
https://www.rd-alliance.org/plenary-meetings/fifth- [46] 3TU Data Centrum,
plenary/programme/adoption-day.html http://datacentrum.3tu.nl/home/
[34] RDA Data Fabric IG, https://www.rd- [47] SURF Sara, https://surfsara.nl/
alliance.org/group/data-fabric-ig.html [48] Netherlands eScience Center,
[35] Bridget Almas et.al., Data Management https://www.esciencecenter.nl/
Trends, Principles and Components - What Needs
to be Done Next?, 2015,
13