Specifying and Implementing Data Infrastructures
                Enabling Data Intensive Sciences
                                  © Peter Wittenburg, Herman Stehouwer
                         Max Planck Data and Compute Center, Garching/Munich
                         peter.wittenburg@mpi.nl, herman.stehouwer@rzg.mpg.de


                     Abstract                               course much larger volumes of data were being
   Examples from Psycholinguistics – a humanities           processed and they can look back to a much longer
discipline – show that data intensive research is           history of data oriented work.
changing all scientific disciplines dramatically.
Data intensive sciences pose unprecedented                     It was the book "The Fourth Paradigm – Data
challenges in data management and processing. A             Intensive Scientific Discovery" [2] edited by Tony
survey in Europe showed clearly that most of the            Hey and colleagues that introduced “data intensive
research departments are not prepared for this step         science” as the 4th paradigm of scientific discovery
and that the methods that are used to manage,               by referring to a talk given by J. Gray. It raised
curate and process data are inefficient and too             much attention for the concept behind this new
costly. The Research Data Alliance, as a bottom-            paradigm. Gray distinguishes 4 paradigms that are
up organized global and cross-disciplinary                  co-existing today: (1) Empirical Science describing
initiative, has been established to accelerate the          natural phenomena, (2) Theoretical Science using
process of changing data practice. After only two           models      to    achieve     generalizations,    (3)
years RDA produced its first concrete results,              Computational Science simulating complex
which have to demonstrate their potential. In               phenomena and (4) Data exploration by unifying
particular, the infrastructure builders are requested       theory, experiment and simulation. Indeed, we can
to act as early adopters of RDA results. The                observe that science is changing in so far as finding
European Commission and its member states have              meaningful patterns in data sets becomes an
taken serious steps to establish an eco-system of           essential approach. Increasingly more powerful and
research infrastructures and e-Infrastructures              numerous sensors, improved network connections,
anticipating the challenges imposed by the data             more powerful and numerous computers and more
deluge which will enable broad uptake of the                advanced algorithms are key pillars for this
paradigm of data intensive science. Research                development. The "Riding the Wave" [3] report
organisations have recognised these challenges as           created by a High Level Expert Group of the
well and taken first steps to adapt its structures.         European Commission (EC) was one of the
However, we need to understand that we are in a             documents that summarized the specific data
phase of gigantic changes which implies that                challenges and opportunities, and requested actions
measures currently being taken need to be                   by the EC to enable data intensive sciences for a
interpreted as tests on the way to new solid and            large number of researchers and not only those that
sustainable structures.                                     have sufficient funding to curate all data and
                                                            software to be integrated to make use of it.
1. Enabling Data Intensive Sciences
   Quite a number of scientific institutes have been          We see a number of trends which we can
data oriented for a long time already. For instance,        summarize as follows:
most of the research of the experimental and                x An increasing number of research disciplines
theoretical institutes of the Max Planck Society was           adopted data intensive methods due to new
based on data. Even an institute that belongs to the           technological and methodological possibilities.
humanities section of the Max Planck Society such              During the last decades these changes were
as our former affiliation - the Institute for                  extreme in biological and neurological
Psycholinguistics [1] was oriented from the start              disciplines.
towards the analysis of speech, eye movement and            x The amount of data and its complexity in terms
gesture recordings, detecting meaningful patterns,             of creation contexts, data types and relations
and building models to simulate speech perception.             are increasing extremely.
In physics institutes (fusion, astronomy, etc.) of          x The Internet allows us to offer data via the web
                                                               to be re-used by others.
 _______________________________________
                                                            x This enables us to combine data sets in new
 Proceedings of the XVII International
                                                               ways across institutional, national and
 Conference «Data Analytics and Management
                                                               discipline borders.
 in Data Intensive Domains»
 (DAMDID/RCDL’2015), Obninsk, Russia,


                                                        1
x   Mathematical methods have advanced to cope                models and in both examples data cannot come
    with heterogeneous data sets and we see large             from one project or institute, but from many
    libraries with statistical, stochastic and                research labs. Researchers doing this kind of
    machine learning methods becoming available.              research know how difficult it is to find, access and
x   The total amount of available CPU and storage             combine the required data. Such research is very
    capacity allows researchers to do large                   cost intensive and raises the questions whether we
    amounts of computations on increasingly large             can continue without serious changes, and whether
    data sets.                                                the available infrastructures are sufficient.

   Despite the increase in compute capacity,                  2. Human Brain Project
however, we can also observe an increasing                       An even more extreme example for the shift
analysis gap, i.e. the fraction of data we are able to        towards the 4th paradigm is taken from life
process in a way that we can extract knowledge is             sciences. The recently started Human Brain Project
getting smaller. The reasons for the analysis gap             (HBP) [4] (as an EC flagship project) has as visions
are many and not subject of discussion in this                (a) to be able to simulate at physiologocial level
paper.                                                        first rat brains and in a follow up phase human
                                                              brains (in silico experiments) and (b) to predict
   Two examples taken from a humanities                       brain diseases from patterns found in recorded data
discipline show the fundamental changes towards               sets at an early stage. The main goal of the latter
data intensive science that could not have been               (medical informatics sub project in HBP) is being
carried out a few years ago. When studying for                illustrated in figure 1. Researchers would like to
example the evolution of human languages over                 correlate observed phenomena such as specific
thousands of years linguists until recently based             deficits due to brain diseases with all types of
their theories on comparing fragmented                        recordings that can be found from corresponding
descriptions of colleagues about several languages.           patients such as brain images of different types,
Currently, large feature matrices are extracted               gene sequences, protein data and perhaps even
describing characteristics of all languages in a              reaction time measurements. Without having a
particular region such as for example those spoken
in Austronesia and these matrices are fed into
phylogenetic algorithms to calculate most probable
dependency trees that indicate how languages may
have influenced each other over thousands of years.
For this research a large database is required and
also more powerful computers are needed than
linguists were using traditionally to let the
algorithms generate meaningful optima.

   The application of massive crowd sourcing
techniques in linguistics for example to understand
human communication including multimodal
interaction can be used as another example to
indicate the dramatic changes in research towards a
data centric perspective. These techniques generate
many parallel data streams originating from
smartphones that need to be annotated immediately
by machine processing tools to make them                      Fig. 1: One example for this new paradigm as it is
available for scientific studies. This automatic             used in neuro-sciences (HBP) is shown. For example
annotation requires smart pre-processing and smart                phenomena such as created by specific brain
data management. In this setup an increasing                  diseases can be observed. Yet there is no chance to
number of parallel operating detectors must be                model the complexity of the human brain to make
trained to detect patterns in speech and video                 statements about their physiological origins. Data
streams in real time with the help of stochastic                  from various sources are correlated with the
machines. It is simply the shear amount of data              phenomena to find those patterns in the data that are
requiring new ways of processing to enable this                          causing the observed deficits.
type of research leading to better assumptions
about what guides our interactions.
                                                              model of the human brain at hand this correlation
   The basis of such methods as described in the              would allow researchers nevertheless to detect co-
example above is the availability of large amounts            occuring patterns in the data that seem to cause the
of data to estimate the many free parameters of the           observed phenomena. Machine learning methods


                                                         2
are used to generate meaningful signatures from                 The goals are ambitious1 and it is admitted that
physical features in the data that then can be used          the gap between physiological modeling and
to predict potential diseases from patients.                 cognition is still huge. However, the HBP indicates
                                                             how data intensive science is pushed to its
  No assumptions are made about the structure and            extremes in life sciences: (a) huge amounts of data
functioning of the brain, no assumptions are made            addressing many different levels of brain
how genes may influence brain structure and                  organization are needed to feed the atlases, to
functioning, etc. since we don’t have sufficient             enable analyses needed to feed and test the validity
knowledge in these areas. Nevertheless, by using a           of the models and (b) much computer power will
large database of aligned data it is assumed that            be required to carry out the necessary computations
researchers can relate physical patterns with                first within the project and afterwards by the
phenomenological observations first for early                interested researchers.
prediction, but later also for improved medication.
Full brain simulations will typically cover spatial            In addition to the problems described in the next
scales from nanometers (proteins) to centimeters             section the HBP is confronted with difficult privacy
(brain) and energy scales from 10 femto Joule at             and ethical issues making access to data even more
biological (Genome, Transcriptome, Proteome) up              problematic. Distributed data mining solutions are
to 1 Joule at complex brain level (cognition).               investigated to overcome these problems for
                                                             example.
   To achieve its goals the HBP defined in total 13
sub-projects each of them having a size of a large           3. Data Practices
project. Here we will briefly describe the new                  A large survey about data practices [5], based on
informatics-based platforms that are meant to offer          some 120 interactions with data practitioners2 from
the research community the possibility to work on            various disciplines, and two RDA Europe
human brain issues with the help of a set of strong          workshops with leading European scientists [22]
and highly integrated tools:                                 made very clear that the current data practices are
x Neuroinformatics (searchable atlases and                   not adequate to support such data intensive science
     analysis of brain data)                                 in an efficient and cost-effective way.
x Brain Simulation (building and simulating
     multi-level models of brain circuits and                  The major findings of this survey can be
     functions, incl. for example models of neural           summarized as:
     microcircuits of up to a million neurons)               x The ESFRI3 [6] discussion process and its
x Medical Informatics (see figure 1)                            project initiatives, as well as recent
x Neuromorphic Computing (brain-like                            developments in e-Infrastructures, raised much
     functions implemented in hardware)                         awareness about data issues, the practices and
x Neurorobotics (testing brain models and                       the interaction processes around data
     simulations in virtual environments)                       management and access crossing discipline
x High Performance Computing (providing the                     boundaries.
     necessary computing power by architectures              x Open Access [7] to publications and now also
     that allow memory intensive applications and               to data is widely supported but in practice
     new ways of visually interacting with                      there are so many hurdles that most data is still
     simulations)                                               not available.
                                                             x Finding data re-usable for data intensive
   HPC facilities at 4 centers can be used for the              sciences using the web requires new
purposes of the HBP: Jülich (6 petaflops peak, 450              mechanisms to establish trust. At this moment
TB memory, 8 PB scratch file system) allowing                   we are lacking such mechanisms.
simulations to up to 100 Mio neurons (scale of               x There is much legacy data out there the
mouse brain), Swiss CSCS (836 teraflops peak, 64                integration of which in our re-usable data
T, 4 PB) in particular for software development and             domain will cost an enormous amount of
optimization, Barcelona SC (1 petaflops peak, 100               curation and thus funds. In addition, we are
TB) for molecular-level simulations, CINECA (2
petaflops, 200 TB, 5 PB) mainly for data analytics.          1
                                                               It should be mentioned that there is a broad debate
In addition KIT Karlsruhe provides 3 PB of                   about the question whether the ambitions of the
storage. All centers are linked with 10 Gbit/s. In the       HBP are realistic.
neuromorphic area SpiNNaker chips are being used             2
                                                               The term "data practitioner" is used here as a term
that have 18 cores and share 128 MB RAM                      describing skills of data scientists, data managers,
allowing to simulate 16.000 neurons with 8 Mio               data stewards, data librarians, etc. since mostly
plastic synapses with 1 W energy budget.                     these terms are not well-defined yet.
                                                             3
                                                               European Strategy Forum on Research
                                                             Infrastructures


                                                         3
    still creating legacy-style data despite all                 inefficiencies in particular when users do not
    advancements since it is not suitably organized              have direct relations with the creators.
    and described, which is mainly due to a lack of          x   There is a clear trend towards using "trustful"
    trained experts and appropriate software.                    centres which offer researchers to host,
x   There is an increasing pressure for almost all               manage and access their data. However, there
    departments to participate in data intensive                 are many hurdles for centres to offer cross-
    sciences, but researchers see a lack of expertise            border services although economy of scale
    in adequate data management and workflow                     factors indicate that much can be gained due to
    creation/maintenance        skills.     Currently            the available expertise. Existing certification
    researchers need to spend a large fraction of                methods such as defined by Data Seal of
    their time (partly up to 75%) to find, access                Approval [9] need to be applied by the centres
    and curate data to make it fit for their needs. In           to raise the level of trust.
    addition, the practice of many researchers               x   It is widely agreed that there is a lack of
    working with manual steps or with ad hoc                     expertise and knowledge about data issues
    scripts does not lead to reproducible science.               (principles, organization, curation, etc.) and
x   Data management is still widely based on file                that we need to train a new generation of data
    systems which do not allow capturing the                     practitioners. It is this lack of experts and
    increasing amount of “logical” information                   expertise that hampers progress.

                                                             Senior scientists agree that changes in data
                                                             practices are urgently needed, but they hesitate to
                                                             take steps for mainly two reasons:
                                                             x they lack guidance towards certain agreed
                                                                  solutions which prevents investments,
                                                             x they lack the experts that would turn
                                                                  investments into appropriate solutions.

                                                             4. Achieving Changes through RDA
                                                                This raises the questions who can give guidance
       Fig. 2: The typical decrease of available             in navigating in the huge solution space with
      information about data stored over time as             respect to data issues and how can we train the new
        described by W. Michener is indicated                generation towards harmonized solutions that
      which results in great problems in making              guarantee more efficiency and cost-effectiveness
       use of data. There are various factors and            which finally will boost data intensive sciences.
         moments that lead to this decrease of               Here we would like to refer to the early phases of
       information such as when PhDs leave an                the Internet where many solutions were suggested
      institute without having documented their              with different competing approaches. It took about
      data properly which is a very well-known               15 years until agreements on simple principles such
     phenomenon. Assigning persistent identifiers            as TCP/IP [10] for global networks were accepted.
       and creating appropriate metadata would               Basically these agreements led to the boost of
           help to reduce the speed of losing                connectivity which we can now take profit from.
                      information.
                                                                Quite a number of policy level initiatives have
    about the data (persistent identifiers, metadata,        established rules and principles and there seems to
    rights, relations, etc.). Ad hoc solutions are           be wide agreement [11] about them. An increasing
    being used amplifying the problem of                     number of funders are also requesting to add so-
    "increasing data entropy" as W. Michener [8]             called data management plans to grant applications
    called it (see figure 2).                                which certainly raise the level of awareness about
                                                             data issues for many researchers. But due to the
x   The use of persistent identifiers and metadata
                                                             problems described above there is also great
    which would help in identifying, finding and
                                                             uncertainty how to create such plans that make
    re-using data is still in its infancy. Ad hoc
                                                             sense for the many data use cases [12]. An
    solutions such as handling spreadsheets do
                                                             increasing conviction of some data practitioners
    only work for the duration of projects and
                                                             and some funders emerged that an acceleration of
    leave chaos afterwards given the increasing
                                                             the process to come to agreements that help
    amount of data.
                                                             changing data practices is urgently required. The
x   Despite some efforts for specific databases
                                                             Internet history seems to offer a possible approach:
    there is in general a lack of explicitness with
                                                             complement the policy level efforts by an
    respect to structure and semantic descriptions
                                                             essentially bottom-up driven initiative where data
    of the content of data which creates
                                                             practitioners work on urgent barriers that need to


                                                         4
be overcome. To this end a first international              meetings. Every RDA member can decide to
workshop was organized at the ICRI conference               initiate such a group and to be successful a case
2012 [13] under the name "DAITF" which stands               statement needs to be submitted that must fulfil a
for Data Access and Interoperability Task Force. A          number of criteria [18]. A Council was setup that
joint effort from mainly European, US American              has an overlooking role to ensure balanced progress
and Australian experts and funders led then to the          and adherence to quality rules and processes. A
birth of the Research Data Alliance (RDA) [14] in           Technical Advisory Board that is elected by the
autumn 2012. We like to use the similarity of some          RDA members6 will give advice to all actors on
characteristics with the Internet Engineering Task          content aspects, i.e. respond on questions such as
Force, however, it is obvious that the data domain          “do the intentions of the Working and Interest
has many more facets and challenges to deal with.           Groups meet the scope of RDA, do they fulfil the
                                                            established requirements, do they involve existing
   We would like to cite Naoyuki Tsunematsu                 and relevant initiatives, do they intend to remove
(Senior Advisor of Japanese Council for Science             practical barriers, etc.“. An Organisational
and Technology) who pointed to two observations             Advisory Board that represents all organizations
relevant in this context and which motivated Japan          that are organizational members and thus
to join the Research Data Alliance [15].                    contribute with some funds to the success of RDA
x The value proposition for publically funded               gives advice on organizational and administrative
     research        is     about       "stimulating        issues. In addition RDA has a Secretariat that
     competitiveness" but a new strand needs to be          needs to organise the plenaries, keep control on the
     added which is "knowledge discovery on smart           processes and doing a variety of other
     data     collections"    where     professional        administration/ organisational tasks. A General
     infrastructures and human skills are the key           Secretary has been appointed leading the
     factors for success.                                   secretarial work and taking responsibility for
x There seems to be a correlation between a lack            managing RDA global.
     of motivation to share data in the Japanese
     academic world and thus a lack of openness                While RDA global is the platform where
     and a decrease in the number of top-level              agreements are being achieved in form of
     international collaborations and of top-level          guidelines, procedures, interface and protocol
     papers which is a concern for policy makers in         specifications to overcome barriers, the regional
     Japan4.                                                branches such as RDA Europe have the task to
                                                            raise awareness about RDA in their region,
   After the workshop at ICRI 2012 the European             convince experts to participate, interact with many
Commission, NSF and NIST in the US and the                  stakeholders to understand the needs and priorities,
Australian Government accepted grant proposals              organize the adoption of RDA results, taking care
from key experts in their respective regions that           of training and education and contributing to the
allowed the practitioners to start the RDA work, i.e.       costs of RDA Global. RDA Europe for example
funding is given to consortiums in the three                organises a number of meetings to meet the
regions. As one branch the RDA Europe [17]                  requirements such as interacting with the EC and
project was funded as a usual EC project, in                member state ministries, European science
September 2015 already, the 3rd RDA Europe                  organisations, European leading scientists, large
project will start to allow us to continue the work         scale European research infrastructures such as
and EC’s new draft work programme 2016/17                   ESFRI projects [19] and e-Infrastructures [20] such
indicates future perspectives for RDA. First, a             as EUDAT [21] and many research communities.
steering board was established between the three            The meetings with leading scientists [22] are of
funded initiatives to define a governance structure         great importance and have led to useful
and procedures for RDA, and it started stimulating          recommendations for RDA, most of which will be
the practical work.                                         implemented by RDA Europe from September
                                                            2015 on. The interactions with policy stakeholders
RDA decided to have a very simple structure where           led for example to the Data Harvest Report [23]
the key roles are given to the Working Groups and           setting priorities.
Interest Groups5 that meet at plenaries and other
                                                            5. Early RDA Results
4                                                             Thus RDA's mission is about building the many
  The recent G8 Open Data Report [16] indicates             social and technical bridges that are required to
that in the rating between G8 members Germany               make data intensive work much more efficient and
and Russia are even behind with respect to                  thus to allow many researchers to participate in
openness of data.
5
  It should be noted here that the major difference
                                                            6
between the two groups is that the WGs need to               Everyone who agrees with the basic rules of RDA
come with tangible results after 18 months.                 can become a member by registration.


                                                        5
extracting knowledge by processing virtual                   “checksum” would allow application programmers
collections existing of data coming from various             to simply provide one piece of software allowing
providers increasingly often across disciplines and          them to deal with all PID service providers in the
borders. Here we want to briefly indicate the major          same way. Since PIDs will have such a central role
results of the first working groups that finished            in data management and access the impact of a
after roughly 20 months (or that will finish within          unified API will be enormous.
the coming few months) and their possible impact
on changing practices.                                       5.4 Practical Policies (PP)
                                                                In particular data management and curation are
5.1 Data Foundation and Terminology                          guided by specific policies which are then turned
(DFT)                                                        into executable procedures such as "replicating a
   Based on many use cases from various                      data collection" or "checking digital objects'
disciplines and countries the DFT Working group              integrity" that are mostly used in federated
[24] came up with a simple core data model and a             environments. The PP group [27] is collecting
terminology for registered data. It introduces the           many such practical policies from various
notion of the Digital Object which is represented            institutions and projects, analysing and evaluating
by a bitstream, can be stored in various                     them and suggesting best practices which then can
repositories, is identified by a persistent identifier       be offered as templates for proven operations.
and described by metadata. The model includes a              Thus, these templates have the potential to increase
few further definitions, but important is to note that       the trust level. The work of the group will not end
these definitions are fundamental and independent            since there are so many areas where best practices
of disciplines. If scientists worldwide would adhere         can improve the quality and reproducibility of data
to such a simple model we could much more easily             practices. In collaboration with the EUDAT project
understand each other when talking about data and            the group is working on an open registry standard
would be able to build harmonized software                   for such best practice PPs.
leading to much higher interoperability.
                                                             5.5 Metadata Standard Registry (MDR)
5.2 Data Type Registries (DTR)                                  As has been described the usage of proper
   The DTR group [25] created a specification for            metadata is still in its infancy and there are many
data type registries that allow users to link data           reasons for this. One reason certainly is that many
types of various sorts with functions (executable            labs still do not know which metadata they should
code). Data types can be simple types such as                use, where they can find suitable vocabularies and
semantic categories (temperature, noun, etc.) or             tools, etc. The MDR group [28] offers a registry
complex types such as scientific digital objects             which allows researchers to look for most suitable
(complex annotated images, time series, tables,              metadata schemas. Therefore this MDR will help
etc.). DTRs can be used for example to carry out             data practitioners that are looking for proper
mappings automatically when simple types such as             metadata solutions. More work in the metadata area
“temperature” occur or start for example                     is going on within RDA.
visualization software when complex types are
found. Such DTRs would overcome the problems                 5.6 Data Citation (DC)
we so often have with unknown data types which                  The Data Citation group [29] worked out
we receive and where we do not know how to                   suggestions of how to cite so-called dynamic data,
process and interpret them. Thus we see an                   i.e. data that changes while people are already
enormous impact for DTRs in daily practice.                  working with it and referring to it. All data coming
                                                             in from seismological sensors for example will
5.3 PID Information Types (PIT)                              immediately be used when it becomes available for
   The PIT working group [26] produced a                     processing even if data samples in the sequences
common API (Application Program Interface) to                are missing due to transmission delays for example.
unify access to Persistent Identifier (PID) service          How can researchers refer back to these incomplete
providers. Currently there are different PID                 versions of data? This is a problem that many
systems (Handle/DOI7, AWK, etc.) and many                    disciplines have and this group worked out a
different service providers all having their own             suggestion how to solve this citation problem so
regulations making it very cumbersome to get for             that it could be implemented in all software and
example the checksum of a Digital Object to check            procedures.
its identity and integrity. Applying this unified API
together with some basic data types such as                  5.7 Repository Audit and Certification
                                                             (RAC)
7
                                                                As indicated above quality assessment of
  DOIs are Handles with a special prefix and used            repositories (centres) is increasingly important to
to refer to published collections. Handle/DOI                raise the level of trust and the RAC group [30]
services are available worldwide.


                                                         6
wants to come up with a unified standard. A few                  looking for further adopters of these results by
suggestions have been made such as by Data Seal                  offering funding for collaboration projects.
of Approval [31] and World Data Systems [32].
These two suggestions are already widely used and                   We should add here that RDA is obviously
so similar that the responsible initiatives decided to           entering a new phase. While the first 5 working
join forces to make their guidelines compatible                  groups were started at the first plenary in March
with each-other. It is widely agreed that the                    2013 each of them focusing on their specific topic
resulting set of guidelines is a good basis to certify           under high time pressure, the experts now
trusted repositories worldwide8.                                 understand that they need to synchronise more to
                                                                 achieve the needed coherence of all results. One
5.8 New RDA Phase                                                consequence was to set up the Data Fabric Interest
   At the fifth plenary (P5) we had a first adoption             Group (DFIG) which is now bundling forces to
day [33] where experts from different disciplines                understand all components that are required to
and institutions presented their way of making use               come to efficient and reproducible data intensive
of these early results. The presentations showed                 sciences. Figure 3 indicates briefly the topic being
that the RDA results were not just an academic                   addressed9. Data production and consumption in


     Fig. 3: It indicates at an abstract level the typical data creation and consumption cycle as it is being used in
    the labs doing data intensive sciences. DFIG's questions are now which components are needed to run such a
      cycle efficient and self-documenting and how these components need to interact. The figure also indicates
                        how the working groups that finished or are finishing fit into this cycle.

enterprise, but indeed fulfil concrete needs of early            the daily data driven work can be indicated by a
adopters in particular since in some cases first                 cycle where at a certain moment new raw data is
implementation versions are available and can be                 being created and in some form being
used. Currently, RDA Europe is for example                       organised/registered and put into a store.
                                                                 Researchers who want to make use of data define a
                                                                 new (virtual) collection by selecting data from
8
  We note here that there are several further                    repositories and then carry out some processing
certification schemes that go more in-depth on                   steps on it which can be management or analytical
specific aspects such as the “Security for                       operations. The result is a new collection of data
Collaborating Infrastructures Assessment and                     which should be registered and stored again. The
Modification Record” (SCI) for security aspects, or              questions addressed are now which components are
the NESTOR seal (based on DIN 31644) or ISO                      needed to run such a "fabric" efficiently and self-
16363 certification for general data repository                  documenting and how these components should
aspects. The DIN and ISO certifications are
                                                                 9
extremely detailed and thorough, and thus fairly                  A White Paper describes DFIG in more detail
costly to implement.                                             [34].


                                                           7
                                                                                     researchers

                                                                         influence                 facilitate
interact. Figure 3 also indicates how the finishing
working groups fit into this cycle.                                    specifications

Currently the DFIG is collecting many Use Cases
                                                                                        enable
to build on what people are already doing and to
abstract from these Use Cases to "common                          Fig. 4: It indicates schematically the essential
components" that are required. Such common                              relationships between researchers,
components would include for example a global                    infrastructures and the specification work such
PID system10 providing PID registration and                                          as in RDA.
resolution mechanisms that can be used by
everyone. Everyone interested should be motivated                specifications as a joint effort of data practitioners,
to contribute Use Cases that will influence the                  i.e. researchers and infrastructure providers.
discussions about common components. A first
paper to accelerate discussions has been made                       Information infrastructures in our distributed
available by a number of distinguished experts                   landscape of data and computational services get
from various regions [35].                                       very complex and involve several layers, which is
                                                                 sketched in the diagram drawn by the High Level
5.9 RDA Summary                                                  Expert Group on Scientific Data (Figure 5) [3].
   RDA is still a very young initiative and its                  This diagram aims to work out the difference
success mainly depends on the willingness of data                between discipline specific and common services
practitioners to spend time on global and cross-                 that users (top layer) will use probably without
disciplinary11 problem solving, on the quality of                noticing who will give the services they are using.
their results, and their uptake by scientific projects           Initiatives such as EUDAT were started to offer
worldwide. For TCP/IP in its early days, there was               common services (bottom layer) and thus to
nothing particular that distinguished it from other
suggestions. It was its layered approach and
robustly running code that finally convinced people
worldwide to adopt the standard. RDA needs to do
a lot to have similar success and it needs strong
infrastructure pillars that provide and maintain
services.

6. Infrastructure Pillars
   As described, RDA is only working on
specifications and it is neither providing services
nor maintaining code. It will rely on powerful
centres and federations to provide the
infrastructures that are finally required to transform
specifications into real services that enable efficient
data intensive sciences. In the same way we can                  Fig. 5; It schematically indicates 3 layers of the so-
state that researchers in general are not so much                  called Collaborative Data Infrastructure where
interested in specifications of interfaces for                   community based infrastructures offer community
example, but in the services that will facilitate their         specific services and e-Infrastructures offer common
work. In a simplified way figure 4 indicates the              discipline crossing services. This was seen by the EC as
essential relationships between researchers as                             a blueprint for funding programs.
consumers of facilitating services who would also
like to influence specification building to ensure               complement the typical ESFRI layer (middle layer)
the emergence of useful services, infrastructures                with many European research infrastructures in
that are built compliant to the specifications to                various research disciplines.
ensure interoperability of the services and
initiatives such as RDA which establish the                         The first ESFRI roadmap from 2006 [36] led to
                                                                 44 research infrastructures leading to an intensive
10
                                                                 and concerted European activity across many
   The Handle System (http://www.handle.net/) is                 disciplines. Most of these infrastructure initiatives
such a global PID system supervised and managed                  are heading towards building persistent distributed
by the international DONA Foundation and it is                   information infrastructures.
also basis of the DOI and other service providers
such as EPIC in Europe.
11
   RDA also includes some disciplinary groups
which are using the global nature of RDA to
achieve community agreements.


                                                          8
One example is the CLARIN initiative [37] in the                EUDAT to make use of the advanced services that
area of language resources and technology which                 are offered by them.
has recently achieved the status of an ERIC 12.
CLARIN is based on strong and federated centres                 6.1 EUDAT
in a variety of European countries that share the               EUDAT is a federation of well-resourced and
effort in defining standards together with the                  partly national data and compute centres in various


     Fig. 6 shows the federation of centres across Europe that is the basis of EUDAT’s e-Infrastructure and the 5
        basic user services it offers to the research community. In addition to the 5 user services it established
        system services such as an authentication and authorisation infrastructure and a service to register and
                                                resolve persistent identifiers.
community, in aggregating digital language                      countries as figure 6 indicates. Within its first three
resources, and in offering joint services, in                   years EUDAT invested all efforts in developing 5
managing and curating data with discipline specific             basic services in collaboration with at the
knowledge and others. The services offered by                   beginning 5 communities13 (climate modelling,
CLARIN include deposit possibilities, a joint                   earth plate observation, human physiology,
metadata catalogue called Virtual Language                      biodiversity and language resources and
Observatory [38], a distributed workflow tool                   technology). B2SHARE, B2DROP and B2FIND
allowing users to analyse texts in various languages            are services directed to the end users meant for
and many smaller services. However, CLARIN                      dealing with long tail type data. B2SAFE is a
centres are not equipped to offer massive compute               service that allows replicating large data sets
power to all possible users from all over Europe                between a community centre and the EUDAT
who may want to execute workflows or use large                  centre network. The B2STAGE service is meant to
storage systems to manage large data sets.                      move data sets from the EUDAT store to the
Therefore research infrastructures such as CLARIN               workspaces of powerful computers of different
make liaisons with e-Infrastructures such as                    types (HPC, etc.) to carry out computations and to
EUDAT and pay for such common services. All                     return the results. All data in EUDAT are
research infrastructures from the different research            registered, i.e. all digital objects have PIDs and are
domains are looking for similar options if they are             associated with metadata to make them findable
data and compute oriented.                                      and accessible.

   The ESFRI organisation and the EC are still                     It should be added here that federating data
actively starting new research infrastructures. To              centres and their collections was and is a major
come to an optimized eco-system of information                  challenge and currently not scalable. The reason for
infrastructures all ESFRI projects and beyond (such             this can be found mainly in the data organisations
as Human Brain Project) are seeking collaborations              where each centre has chosen a different solution.
with e-Infrastructures such as PRACE [39] and                   This lack of interoperability leading to enormous
                                                                costs is one of the reasons why EUDAT is very
12
  ERIC is a special organisational template
                                                                13
invented to allow ESFRI research infrastructures to               Currently EUDAT is closely interacting with 32
become European legal entities.                                 communities.


                                                           9
much interested in harmonised solutions being               services they are expecting. Yet the stakeholders
worked out by RDA, for example in the DFT                   are still discussing which concept will be the best
group. Due to this close interest EUDAT declared            to address the eminent challenges posed by the data
that it will try out RDA outputs where possible and         deluge and the need to optimize data sharing and
thus act as an RDA testbed in Europe.                       re-use in the USA. Recently the leading persons in
                                                            RDA US agreed to ask NDS to act as national
   EUDAT just received its 2 nd funding grant for 3         testbed center for RDA results.
years which needs to be used to stabilize and
improve the services being offered, work out a              7. National Level Pillars
sustainable funding model and look for                         Also at the national level in Europe new
collaborations     with    other     European     e-        organisational structures are being tested and
Infrastructures such as PRACE. This led a to an             established to meet the challenges of data intensive
additional work item which is devoted to                    sciences.
improving the exchange of data between EUDAT
and PRACE and demonstrating this as an efficient            7.1 Max Planck Society
service with the help of concrete data and compute             In the Max Planck Society an IT Strategy
bound projects. Future challenges are anticipated           Committee was founded a few years ago to come
by also strengthening the work on executing                 up with advice how to reshape the IT service
automatic workflows. It is understood that data             structure in its organisation to maintain
science needs to turn increasingly often to                 competitiveness of its research. With the
automatic and self-documenting workflows to                 introduction of parallel computers many years ago
make its results reproducible. Yet the challenges to        the Computer Centre in Garching got the task to
let users quickly deploy and execute complex                provide not only high performance compute
software close to where the data is stored, i.e.            capacity but also to provide expertise in
operate in a distributed environment, are huge and          parallelising relevant domain specific software
severe barriers need to be removed. But EUDAT               codes for simulation and analytics. In collaboration
needs to demonstrate that it finally can offer              with domain experts such code was optimised
services similar to Amazon and other companies              allowing optimal use of HPC architectures. The
where users can execute their software in a virtual         optimal solution for such code parallelization was
machine environment and basically pay for the               thus found by bringing together expertise and
cycles used.                                                resources of each institute with central expertise
                                                            and resources such as storage capacity and compute
   In the coming period EUDAT will also be faced            power. The strategy committee realized that the
by a new initiative and request of the European             huge increase of data and the challenges of data
Commission in the realm of Open Science and                 intensive sciences require a new approach in so far
Innovation [40] called the European Open Science            as it makes sense to also provide central expertise
Cloud. The EC wants to have a “cloud service” for           and facilities in data management, curation and
all European researchers without having defined its         analytics.
exact specifications yet. A high level expert group
is being formed that will work out the                         As a consequence, the centre in Garching got a
requirements. According to EC experts the term              new name (Max Planck Computing and Data
“cloud service” is meant in the broad sense, i.e. it        Facility, MPCDF) to indicate the change in focus,
needs to include the necessary structures for               and was extended with data experts having
persistent identifiers, metadata, relations, etc.           expertise in mathematics and algorithms in typical
                                                            data analytics applications which are widely
6.2 National Data Service (NDS)                             discipline unspecific. The idea is to carry out
   Also in the USA an attempt is being made under           collaborations between the centre and the various
the lead of NSCA [41] to setup a National Data              institutes and their departments that cannot invest
Service (NDS) [42] and to offer similar cross-              in the specific knowledge required and that do not
disciplinary data services compared to EUDAT in             have the local resources to store and manage all
Europe and ANDS [43] in Australia. The NDS is               data and to carry out the required computations.
an emerging vision for how scientists and
researchers across all disciplines can find, reuse,            We will use the NoMaD (Novel Materials
and publish data. It wants to build on the data             Discovery) Repository project [44] which has been
archiving and sharing efforts already underway              selected as one of the European Centres of
within specific communities and to link them                Excellence projects as an example for the typical
together with a common set of tools.                        collaboration between a leading research institute
                                                            in the MPS and its MPCDF centre.
  Currently the NDS is focusing on collaborations
with some communities to find out what kind of


                                                       10
   Theoretical material scientists worldwide are              7.2 Approaches in NL
doing experiments with a number of well-known                    Also in countries such as for example the
chemical software packages (some at petascale                 Netherlands new strategies are being tested. In
performance) to compute possible characteristics              addition to strengthen domain specific centres of
for materials. These simulations are typically run            different types new centres have been established
on HPC machines after having carried out deep                 to structure the data landscape. DANS [45] and
optimization of the software code tuned to certain            3TU [46] have received the task to specialise on
architectures. Until now the resulting data has been          data management and curation. They should make
used to write scientific papers, but was not                  use of the data services of the national data and
considered valuable as such. This attitude is                 compute centre SARA.[47]. In addition the
changing due to the fact that as in other research            eScience Centre [48] has been established to run
disciplines the researchers see a value in re-using           collaborative projects where discipline experts and
data in different contexts, in allowing others to do          experts with centrally aggregated expertise are
new kinds of computations and to prevent doubling             shared to meet the challenges of data intensive
the work. The repository is meant to be a centre for          science. All these national service providers are
storing results of simulation runs being identified           requested to synchronise their activities to come to
by DOIs and described by proper metadata. Thus                an     efficiently     organised     eco-system   of
proper data organization and stewardship is basis of          infrastructure pillars and services.
the work.
                                                              8. Conclusions
                                                                 Data Intensive Science (DIS) is one facet of the
                                                              digital change which we are currently experiencing
                                                              and which will change not only science but also
                                                              societies substantially. DIS which will be open to
                                                              many to exploit its full innovative power and not
                                                              exclusive to a few will depend on a change of
                                                              culture towards open data and accessibility of
                                                              services. In the European Union and its member
                                                              states community-driven research infrastructures
                                                              and e-Infrastructures tackling common cross-
                                                              disciplinary challenges have been started to address
                                                              the needs for an efficient eco-system of services
    Fig. 7 indicates the intentions of the Novel              enabling data intensive work. The US did not make
 Materials Discovery project (NoMaD) project to               this     distinction,   but   under     the    term
  federate and aggregate all data about stemming              “cyberinfrastructure” also community-driven and
 from material experiments to enable easy access              more commons-driven projects were initiated.
                     and re-use.
                                                                 After almost a decade of experience in
                                                              infrastructure building it is obvious that there are
   In collaboration with the researchers of the Fritz-        still many social and technical barriers prohibiting
Haber Institute the MPCDF experts are developing              efficient and cost-effective data usage and
software to transform the incoming data to a                  reproducible results. In fact one can argue that only
normalized and compressed format, developing the              active infrastructure building made many of the
repository software, the user upload, access and              barriers visible to all stakeholders. The time period
search interfaces, and the needed data management             between the invention of TP/IP and its broad
tools. In addition, novel analytic tools are being            uptake to enable efficient communication between
developed in collaboration between the involved               compute nodes took about 15 years. Several data
centres to allow graphical searches, to carry out             scientists and infrastructure builders from mainly
machine-learning based comparisons on data sets,              Europe, US and Australia agreed that it is time to
to do smart visualizations supporting voyaging                accelerate the process of overcoming the many
methods, etc. Typically all these operations on the           barriers for efficient data usage since waiting for
aggregated data will be executed by making use of             another decade to overcome the most severe
“trivial” parallelization techniques such as enabled          barriers is acceptable. Setting up the RDA based on
by Map-Reduce methods on appropriate Hadoop                   similar principles as IETF (bottom-up, rough
clusters, i.e. the repository will be hosted at               consensus, running code, lean governance) was the
MPCDF and the computations will be carried out                preferred choice of the data experts and this choice
on computers offered by MPCDF.                                was supported by the funding organizations.

                                                              With this background in mind it is not surprising
                                                              that almost all strong European infrastructure


                                                         11
centres are very active in EUDAT as well as in                 [4] Human Brain Project:
RDA and that for example also ANDS and NDS                     https://www.humanbrainproject.eu/
engage actively in RDA. The Max Planck                         [5] Herman Stehouwer, Peter Wittenburg, RDA
Computing and Data Facility for example will                   Data Practice Report, 2014, http://europe.rd-
coordinate RDA Europe from September 2015, and                 alliance.org/sites/default/files/RDA-Europe-D2.5-
its members are in the Technical Advisory Board,               Second-Year-Report-RDA-Europe-Forum-
co-chairing the Data Foundation and Terminology                Analysis-Programme.pdf
and Data Fabric Interest Groups and are leading a              [6] ESFRI Roadmap 2006,
Work Package in EUDAT, SARA and DANS for                       http://ec.europa.eu/research/infrastructures/index_e
example are also leading activities in EUDAT and               n.cfm?pg=esfri-roadmap
are actively engaged in RDA groups. NDS is co-                 [7] Open Access,
chairing for example the Data Fabric Interest                  http://en.wikipedia.org/wiki/Open_access
Group and ANDS is represented in the Council and               [8] http://research.microsoft.com/en-
Technical Advisory Board of RDA.                               us/um/redmond/events/fs2010/presentations/miche
                                                               ner_environ_data_mgmt_rfs_71210.pdf
   In addition to accelerating global agreement                [9] Data Seal of Approval:
finding to improve data sharing and re-use and thus            http://datasealofapproval.org/en/
to enable inclusive data intensive science two main            [10] TCP/IP Protocol:
reasons can be mentioned for the engagement: a)                http://en.wikipedia.org/wiki/Internet_protocol_suit
engaging its experts in cutting-edge developments              e
will make them fit for the coming challenges and b)            [11] Herman Stehouwer, Peter Wittenburg,
bringing in their expertise will influence decision            Principles for Data Sharing and Re-use: are they all
taking. So far RDA is too young to present final               the same?, 2015
conclusions about the question whether the                     http://hdl.handle.net/11304/1aab3df4-f3ce-11e4-
expectations were met.                                         ac7e-860aa0063d1f
                                                               [12] Peter Wittenburg, Leif Laaksonen, Hermann
   We need to accept that the data landscape is                Stehouwer, Raphael Ritz, Living with Data
changing rapidly and that new structures that have             Management Plans, 2015
been set up to facilitate data intensive sciences are          http://hdl.handle.net/11304/ea286e5a-f3d1-11e4-
often still in a test phase. Essential questions in the        ac7e-860aa0063d1f
data domain are still not fully answered yet such as:          [13] ICRI 2012 Conference Copenhagen:
Which persistent structures need to be funded in               http://www.icri2012.dk/www.ereg.me/ehome/index
addition to libraries that often do not yet have the           06e1.html
skills to participate in the emerging data services            [14] Research Data Alliance: http://rd-alliance.org
domain? What is the optimal division between                   [15] Naoyuki Tsunematsu, RDA plenary Keynote,
discipline specific and common services? What is               San Diego, 2015: https://rd-alliance.org/keynote-
the most optimal way to share specialised and                  naoyuki-tsunematsu.html
expensive data experts that are scarce? Which are              [16] Daniel Castro, Travis Korte, Open Data in the
the common components that need to be specified                G8, 2015,
to come to global, interoperable and well-                     http://www2.datainnovation.org/2015-open-data-
maintained services supporting data intensive                  g8.pdf
sciences optimally?                                            [17] Research Data Alliance - Europe,
                                                               http://europe.rd-alliance.org
   The EU and several of its member states as well             [18] RDA Case Statements, https://rd-
as the US decided to take an active role to exploit            alliance.org/working-and-interest-groups/case-
the possibilities by taking concrete actions and by            statements.html
asking data science experts to develop and test out            [19] ESFRI Projects,
bottom-up driven models.                                       http://ec.europa.eu/research/infrastructures/index_e
                                                               n.cfm?pg=esfri
9. References                                                  [20] EU e-Infrastructures,
[1] MPI for Psycholinguistics, http://www.mpi.nl               http://cordis.europa.eu/fp7/ict/e-infrastructure/
[2] Tony Hey et.al., The Fourth Paradigm - Data                [21] EUDAT e-Infrastructure, http://www.eudat.eu
Intensive Scientific Discovery, 2009,                          [22] Bernard Schutz et.al., RDA Europe Science
http://research.microsoft.com/en-                              Workshop Report, 2014, http://europe.rd-
us/collaboration/fourthparadigm/4th_paradigm_bo                alliance.org/documents/publications-reports/rda-
ok_complete_lr.pdf                                             europe-science-workshop-report
[3] John Wood et.al., Riding the Wave Report,                  [23] John Wood et.al., The Data Harvest, 2014,
2012, http://cordis.europa.eu/fp7/ict/e-                       https://europe.rd-
infrastructure/docs/hlg-sdi-report.pdf                         alliance.org/documents/publications-reports/data-


                                                          12
harvest-how-sharing-research-data-can-yield-                http://hdl.handle.net/11304/33430f2e-f598-11e4-
knowledge-jobs-and                                          ac7e-860aa0063d1f
[24] RDA Data Foundation and Terminology WG,                [36] ESFRI Roadmap 2006,
https://rd-alliance.org/groups/data-foundation-and-         http://ec.europa.eu/research/infrastructures/index_e
terminology-wg.html                                         n.cfm?pg=esfri-roadmap&section=roadmap-2006
[25] RDA Data Type Registry WG, https://rd-                 [37] CLARIN Research Infrastructure,
alliance.org/groups/data-type-registries-wg.html            http://www.clarin.eu/
[26] RDA PID Information Type WG, https://rd-               [38] CLARIN Virtual Language Observatory,
alliance.org/groups/pid-information-types-wg.html           http://clarin.eu/content/virtual-language-
[27] RDA Practical Policy WG, https://rd-                   observatory
alliance.org/groups/practical-policy-wg.html                [39] PRACE e-Infrastructure, http://www.prace-
[28] RDA Metadata Standards Directory WG,                   ri.eu/
https://rd-alliance.org/groups/metadata-standards-          [40] EC Open Science and Innovation,
directory-working-group.html                                http://ec.europa.eu/research/conferences/2015/era-
[29] RDA Data Citation WG, https://rd-                      of-innovation/index.cfm
alliance.org/groups/data-citation-wg.html                   [41] National Center for Supercomputer
[30] RDA Repository Audit and Certification WG,             Applications, http://www.ncsa.illinois.edu/
https://rd-alliance.org/groups/repository-audit-and-        [42] National Data Service,
certification-dsa%E2%80%93wds-partnership-                  http://www.nationaldataservice.org/
wg.html                                                     [43] Australian National Data Service,
[31] Data Seal of Approval,                                 http://www.ands.org.au/
http://datasealofapproval.org/en/                           [44] NoMaD - Novel Materials Discovery Project,
[32] World Data Systems, https://www.icsu-                  http://nomad-repository.eu/cms/
wds.org/                                                    [45] Data Archiving and Networked Service,
[33] RDA Adoption Day, San Diego, 2015,                     http://www.dans.knaw.nl/nl
https://www.rd-alliance.org/plenary-meetings/fifth-         [46] 3TU Data Centrum,
plenary/programme/adoption-day.html                         http://datacentrum.3tu.nl/home/
[34] RDA Data Fabric IG, https://www.rd-                    [47] SURF Sara, https://surfsara.nl/
alliance.org/group/data-fabric-ig.html                      [48] Netherlands eScience Center,
[35] Bridget Almas et.al., Data Management                  https://www.esciencecenter.nl/
Trends, Principles and Components - What Needs
to be Done Next?, 2015,


                                                       13