Specifying and Implementing Data Infrastructures Enabling Data Intensive Sciences © Peter Wittenburg, Herman Stehouwer Max Planck Data and Compute Center, Garching/Munich peter.wittenburg@mpi.nl, herman.stehouwer@rzg.mpg.de Abstract course much larger volumes of data were being Examples from Psycholinguistics – a humanities processed and they can look back to a much longer discipline – show that data intensive research is history of data oriented work. changing all scientific disciplines dramatically. Data intensive sciences pose unprecedented It was the book "The Fourth Paradigm – Data challenges in data management and processing. A Intensive Scientific Discovery" [2] edited by Tony survey in Europe showed clearly that most of the Hey and colleagues that introduced “data intensive research departments are not prepared for this step science” as the 4th paradigm of scientific discovery and that the methods that are used to manage, by referring to a talk given by J. Gray. It raised curate and process data are inefficient and too much attention for the concept behind this new costly. The Research Data Alliance, as a bottom- paradigm. Gray distinguishes 4 paradigms that are up organized global and cross-disciplinary co-existing today: (1) Empirical Science describing initiative, has been established to accelerate the natural phenomena, (2) Theoretical Science using process of changing data practice. After only two models to achieve generalizations, (3) years RDA produced its first concrete results, Computational Science simulating complex which have to demonstrate their potential. In phenomena and (4) Data exploration by unifying particular, the infrastructure builders are requested theory, experiment and simulation. Indeed, we can to act as early adopters of RDA results. The observe that science is changing in so far as finding European Commission and its member states have meaningful patterns in data sets becomes an taken serious steps to establish an eco-system of essential approach. Increasingly more powerful and research infrastructures and e-Infrastructures numerous sensors, improved network connections, anticipating the challenges imposed by the data more powerful and numerous computers and more deluge which will enable broad uptake of the advanced algorithms are key pillars for this paradigm of data intensive science. Research development. The "Riding the Wave" [3] report organisations have recognised these challenges as created by a High Level Expert Group of the well and taken first steps to adapt its structures. European Commission (EC) was one of the However, we need to understand that we are in a documents that summarized the specific data phase of gigantic changes which implies that challenges and opportunities, and requested actions measures currently being taken need to be by the EC to enable data intensive sciences for a interpreted as tests on the way to new solid and large number of researchers and not only those that sustainable structures. have sufficient funding to curate all data and software to be integrated to make use of it. 1. Enabling Data Intensive Sciences Quite a number of scientific institutes have been We see a number of trends which we can data oriented for a long time already. For instance, summarize as follows: most of the research of the experimental and x An increasing number of research disciplines theoretical institutes of the Max Planck Society was adopted data intensive methods due to new based on data. Even an institute that belongs to the technological and methodological possibilities. humanities section of the Max Planck Society such During the last decades these changes were as our former affiliation - the Institute for extreme in biological and neurological Psycholinguistics [1] was oriented from the start disciplines. towards the analysis of speech, eye movement and x The amount of data and its complexity in terms gesture recordings, detecting meaningful patterns, of creation contexts, data types and relations and building models to simulate speech perception. are increasing extremely. In physics institutes (fusion, astronomy, etc.) of x The Internet allows us to offer data via the web to be re-used by others. _______________________________________ x This enables us to combine data sets in new Proceedings of the XVII International ways across institutional, national and Conference «Data Analytics and Management discipline borders. in Data Intensive Domains» (DAMDID/RCDL’2015), Obninsk, Russia, 1 x Mathematical methods have advanced to cope models and in both examples data cannot come with heterogeneous data sets and we see large from one project or institute, but from many libraries with statistical, stochastic and research labs. Researchers doing this kind of machine learning methods becoming available. research know how difficult it is to find, access and x The total amount of available CPU and storage combine the required data. Such research is very capacity allows researchers to do large cost intensive and raises the questions whether we amounts of computations on increasingly large can continue without serious changes, and whether data sets. the available infrastructures are sufficient. Despite the increase in compute capacity, 2. Human Brain Project however, we can also observe an increasing An even more extreme example for the shift analysis gap, i.e. the fraction of data we are able to towards the 4th paradigm is taken from life process in a way that we can extract knowledge is sciences. The recently started Human Brain Project getting smaller. The reasons for the analysis gap (HBP) [4] (as an EC flagship project) has as visions are many and not subject of discussion in this (a) to be able to simulate at physiologocial level paper. first rat brains and in a follow up phase human brains (in silico experiments) and (b) to predict Two examples taken from a humanities brain diseases from patterns found in recorded data discipline show the fundamental changes towards sets at an early stage. The main goal of the latter data intensive science that could not have been (medical informatics sub project in HBP) is being carried out a few years ago. When studying for illustrated in figure 1. Researchers would like to example the evolution of human languages over correlate observed phenomena such as specific thousands of years linguists until recently based deficits due to brain diseases with all types of their theories on comparing fragmented recordings that can be found from corresponding descriptions of colleagues about several languages. patients such as brain images of different types, Currently, large feature matrices are extracted gene sequences, protein data and perhaps even describing characteristics of all languages in a reaction time measurements. Without having a particular region such as for example those spoken in Austronesia and these matrices are fed into phylogenetic algorithms to calculate most probable dependency trees that indicate how languages may have influenced each other over thousands of years. For this research a large database is required and also more powerful computers are needed than linguists were using traditionally to let the algorithms generate meaningful optima. The application of massive crowd sourcing techniques in linguistics for example to understand human communication including multimodal interaction can be used as another example to indicate the dramatic changes in research towards a data centric perspective. These techniques generate many parallel data streams originating from smartphones that need to be annotated immediately by machine processing tools to make them Fig. 1: One example for this new paradigm as it is available for scientific studies. This automatic used in neuro-sciences (HBP) is shown. For example annotation requires smart pre-processing and smart phenomena such as created by specific brain data management. In this setup an increasing diseases can be observed. Yet there is no chance to number of parallel operating detectors must be model the complexity of the human brain to make trained to detect patterns in speech and video statements about their physiological origins. Data streams in real time with the help of stochastic from various sources are correlated with the machines. It is simply the shear amount of data phenomena to find those patterns in the data that are requiring new ways of processing to enable this causing the observed deficits. type of research leading to better assumptions about what guides our interactions. model of the human brain at hand this correlation The basis of such methods as described in the would allow researchers nevertheless to detect co- example above is the availability of large amounts occuring patterns in the data that seem to cause the of data to estimate the many free parameters of the observed phenomena. Machine learning methods 2 are used to generate meaningful signatures from The goals are ambitious1 and it is admitted that physical features in the data that then can be used the gap between physiological modeling and to predict potential diseases from patients. cognition is still huge. However, the HBP indicates how data intensive science is pushed to its No assumptions are made about the structure and extremes in life sciences: (a) huge amounts of data functioning of the brain, no assumptions are made addressing many different levels of brain how genes may influence brain structure and organization are needed to feed the atlases, to functioning, etc. since we don’t have sufficient enable analyses needed to feed and test the validity knowledge in these areas. Nevertheless, by using a of the models and (b) much computer power will large database of aligned data it is assumed that be required to carry out the necessary computations researchers can relate physical patterns with first within the project and afterwards by the phenomenological observations first for early interested researchers. prediction, but later also for improved medication. Full brain simulations will typically cover spatial In addition to the problems described in the next scales from nanometers (proteins) to centimeters section the HBP is confronted with difficult privacy (brain) and energy scales from 10 femto Joule at and ethical issues making access to data even more biological (Genome, Transcriptome, Proteome) up problematic. Distributed data mining solutions are to 1 Joule at complex brain level (cognition). investigated to overcome these problems for example. To achieve its goals the HBP defined in total 13 sub-projects each of them having a size of a large 3. Data Practices project. Here we will briefly describe the new A large survey about data practices [5], based on informatics-based platforms that are meant to offer some 120 interactions with data practitioners2 from the research community the possibility to work on various disciplines, and two RDA Europe human brain issues with the help of a set of strong workshops with leading European scientists [22] and highly integrated tools: made very clear that the current data practices are x Neuroinformatics (searchable atlases and not adequate to support such data intensive science analysis of brain data) in an efficient and cost-effective way. x Brain Simulation (building and simulating multi-level models of brain circuits and The major findings of this survey can be functions, incl. for example models of neural summarized as: microcircuits of up to a million neurons) x The ESFRI3 [6] discussion process and its x Medical Informatics (see figure 1) project initiatives, as well as recent x Neuromorphic Computing (brain-like developments in e-Infrastructures, raised much functions implemented in hardware) awareness about data issues, the practices and x Neurorobotics (testing brain models and the interaction processes around data simulations in virtual environments) management and access crossing discipline x High Performance Computing (providing the boundaries. necessary computing power by architectures x Open Access [7] to publications and now also that allow memory intensive applications and to data is widely supported but in practice new ways of visually interacting with there are so many hurdles that most data is still simulations) not available. x Finding data re-usable for data intensive HPC facilities at 4 centers can be used for the sciences using the web requires new purposes of the HBP: Jülich (6 petaflops peak, 450 mechanisms to establish trust. At this moment TB memory, 8 PB scratch file system) allowing we are lacking such mechanisms. simulations to up to 100 Mio neurons (scale of x There is much legacy data out there the mouse brain), Swiss CSCS (836 teraflops peak, 64 integration of which in our re-usable data T, 4 PB) in particular for software development and domain will cost an enormous amount of optimization, Barcelona SC (1 petaflops peak, 100 curation and thus funds. In addition, we are TB) for molecular-level simulations, CINECA (2 petaflops, 200 TB, 5 PB) mainly for data analytics. 1 It should be mentioned that there is a broad debate In addition KIT Karlsruhe provides 3 PB of about the question whether the ambitions of the storage. All centers are linked with 10 Gbit/s. In the HBP are realistic. neuromorphic area SpiNNaker chips are being used 2 The term "data practitioner" is used here as a term that have 18 cores and share 128 MB RAM describing skills of data scientists, data managers, allowing to simulate 16.000 neurons with 8 Mio data stewards, data librarians, etc. since mostly plastic synapses with 1 W energy budget. these terms are not well-defined yet. 3 European Strategy Forum on Research Infrastructures 3 still creating legacy-style data despite all inefficiencies in particular when users do not advancements since it is not suitably organized have direct relations with the creators. and described, which is mainly due to a lack of x There is a clear trend towards using "trustful" trained experts and appropriate software. centres which offer researchers to host, x There is an increasing pressure for almost all manage and access their data. However, there departments to participate in data intensive are many hurdles for centres to offer cross- sciences, but researchers see a lack of expertise border services although economy of scale in adequate data management and workflow factors indicate that much can be gained due to creation/maintenance skills. Currently the available expertise. Existing certification researchers need to spend a large fraction of methods such as defined by Data Seal of their time (partly up to 75%) to find, access Approval [9] need to be applied by the centres and curate data to make it fit for their needs. In to raise the level of trust. addition, the practice of many researchers x It is widely agreed that there is a lack of working with manual steps or with ad hoc expertise and knowledge about data issues scripts does not lead to reproducible science. (principles, organization, curation, etc.) and x Data management is still widely based on file that we need to train a new generation of data systems which do not allow capturing the practitioners. It is this lack of experts and increasing amount of “logical” information expertise that hampers progress. Senior scientists agree that changes in data practices are urgently needed, but they hesitate to take steps for mainly two reasons: x they lack guidance towards certain agreed solutions which prevents investments, x they lack the experts that would turn investments into appropriate solutions. 4. Achieving Changes through RDA This raises the questions who can give guidance Fig. 2: The typical decrease of available in navigating in the huge solution space with information about data stored over time as respect to data issues and how can we train the new described by W. Michener is indicated generation towards harmonized solutions that which results in great problems in making guarantee more efficiency and cost-effectiveness use of data. There are various factors and which finally will boost data intensive sciences. moments that lead to this decrease of Here we would like to refer to the early phases of information such as when PhDs leave an the Internet where many solutions were suggested institute without having documented their with different competing approaches. It took about data properly which is a very well-known 15 years until agreements on simple principles such phenomenon. Assigning persistent identifiers as TCP/IP [10] for global networks were accepted. and creating appropriate metadata would Basically these agreements led to the boost of help to reduce the speed of losing connectivity which we can now take profit from. information. Quite a number of policy level initiatives have about the data (persistent identifiers, metadata, established rules and principles and there seems to rights, relations, etc.). Ad hoc solutions are be wide agreement [11] about them. An increasing being used amplifying the problem of number of funders are also requesting to add so- "increasing data entropy" as W. Michener [8] called data management plans to grant applications called it (see figure 2). which certainly raise the level of awareness about data issues for many researchers. But due to the x The use of persistent identifiers and metadata problems described above there is also great which would help in identifying, finding and uncertainty how to create such plans that make re-using data is still in its infancy. Ad hoc sense for the many data use cases [12]. An solutions such as handling spreadsheets do increasing conviction of some data practitioners only work for the duration of projects and and some funders emerged that an acceleration of leave chaos afterwards given the increasing the process to come to agreements that help amount of data. changing data practices is urgently required. The x Despite some efforts for specific databases Internet history seems to offer a possible approach: there is in general a lack of explicitness with complement the policy level efforts by an respect to structure and semantic descriptions essentially bottom-up driven initiative where data of the content of data which creates practitioners work on urgent barriers that need to 4 be overcome. To this end a first international meetings. Every RDA member can decide to workshop was organized at the ICRI conference initiate such a group and to be successful a case 2012 [13] under the name "DAITF" which stands statement needs to be submitted that must fulfil a for Data Access and Interoperability Task Force. A number of criteria [18]. A Council was setup that joint effort from mainly European, US American has an overlooking role to ensure balanced progress and Australian experts and funders led then to the and adherence to quality rules and processes. A birth of the Research Data Alliance (RDA) [14] in Technical Advisory Board that is elected by the autumn 2012. We like to use the similarity of some RDA members6 will give advice to all actors on characteristics with the Internet Engineering Task content aspects, i.e. respond on questions such as Force, however, it is obvious that the data domain “do the intentions of the Working and Interest has many more facets and challenges to deal with. Groups meet the scope of RDA, do they fulfil the established requirements, do they involve existing We would like to cite Naoyuki Tsunematsu and relevant initiatives, do they intend to remove (Senior Advisor of Japanese Council for Science practical barriers, etc.“. An Organisational and Technology) who pointed to two observations Advisory Board that represents all organizations relevant in this context and which motivated Japan that are organizational members and thus to join the Research Data Alliance [15]. contribute with some funds to the success of RDA x The value proposition for publically funded gives advice on organizational and administrative research is about "stimulating issues. In addition RDA has a Secretariat that competitiveness" but a new strand needs to be needs to organise the plenaries, keep control on the added which is "knowledge discovery on smart processes and doing a variety of other data collections" where professional administration/ organisational tasks. A General infrastructures and human skills are the key Secretary has been appointed leading the factors for success. secretarial work and taking responsibility for x There seems to be a correlation between a lack managing RDA global. of motivation to share data in the Japanese academic world and thus a lack of openness While RDA global is the platform where and a decrease in the number of top-level agreements are being achieved in form of international collaborations and of top-level guidelines, procedures, interface and protocol papers which is a concern for policy makers in specifications to overcome barriers, the regional Japan4. branches such as RDA Europe have the task to raise awareness about RDA in their region, After the workshop at ICRI 2012 the European convince experts to participate, interact with many Commission, NSF and NIST in the US and the stakeholders to understand the needs and priorities, Australian Government accepted grant proposals organize the adoption of RDA results, taking care from key experts in their respective regions that of training and education and contributing to the allowed the practitioners to start the RDA work, i.e. costs of RDA Global. RDA Europe for example funding is given to consortiums in the three organises a number of meetings to meet the regions. As one branch the RDA Europe [17] requirements such as interacting with the EC and project was funded as a usual EC project, in member state ministries, European science September 2015 already, the 3rd RDA Europe organisations, European leading scientists, large project will start to allow us to continue the work scale European research infrastructures such as and EC’s new draft work programme 2016/17 ESFRI projects [19] and e-Infrastructures [20] such indicates future perspectives for RDA. First, a as EUDAT [21] and many research communities. steering board was established between the three The meetings with leading scientists [22] are of funded initiatives to define a governance structure great importance and have led to useful and procedures for RDA, and it started stimulating recommendations for RDA, most of which will be the practical work. implemented by RDA Europe from September 2015 on. The interactions with policy stakeholders RDA decided to have a very simple structure where led for example to the Data Harvest Report [23] the key roles are given to the Working Groups and setting priorities. Interest Groups5 that meet at plenaries and other 5. Early RDA Results 4 Thus RDA's mission is about building the many The recent G8 Open Data Report [16] indicates social and technical bridges that are required to that in the rating between G8 members Germany make data intensive work much more efficient and and Russia are even behind with respect to thus to allow many researchers to participate in openness of data. 5 It should be noted here that the major difference 6 between the two groups is that the WGs need to Everyone who agrees with the basic rules of RDA come with tangible results after 18 months. can become a member by registration. 5 extracting knowledge by processing virtual “checksum” would allow application programmers collections existing of data coming from various to simply provide one piece of software allowing providers increasingly often across disciplines and them to deal with all PID service providers in the borders. Here we want to briefly indicate the major same way. Since PIDs will have such a central role results of the first working groups that finished in data management and access the impact of a after roughly 20 months (or that will finish within unified API will be enormous. the coming few months) and their possible impact on changing practices. 5.4 Practical Policies (PP) In particular data management and curation are 5.1 Data Foundation and Terminology guided by specific policies which are then turned (DFT) into executable procedures such as "replicating a Based on many use cases from various data collection" or "checking digital objects' disciplines and countries the DFT Working group integrity" that are mostly used in federated [24] came up with a simple core data model and a environments. The PP group [27] is collecting terminology for registered data. It introduces the many such practical policies from various notion of the Digital Object which is represented institutions and projects, analysing and evaluating by a bitstream, can be stored in various them and suggesting best practices which then can repositories, is identified by a persistent identifier be offered as templates for proven operations. and described by metadata. The model includes a Thus, these templates have the potential to increase few further definitions, but important is to note that the trust level. The work of the group will not end these definitions are fundamental and independent since there are so many areas where best practices of disciplines. If scientists worldwide would adhere can improve the quality and reproducibility of data to such a simple model we could much more easily practices. In collaboration with the EUDAT project understand each other when talking about data and the group is working on an open registry standard would be able to build harmonized software for such best practice PPs. leading to much higher interoperability. 5.5 Metadata Standard Registry (MDR) 5.2 Data Type Registries (DTR) As has been described the usage of proper The DTR group [25] created a specification for metadata is still in its infancy and there are many data type registries that allow users to link data reasons for this. One reason certainly is that many types of various sorts with functions (executable labs still do not know which metadata they should code). Data types can be simple types such as use, where they can find suitable vocabularies and semantic categories (temperature, noun, etc.) or tools, etc. The MDR group [28] offers a registry complex types such as scientific digital objects which allows researchers to look for most suitable (complex annotated images, time series, tables, metadata schemas. Therefore this MDR will help etc.). DTRs can be used for example to carry out data practitioners that are looking for proper mappings automatically when simple types such as metadata solutions. More work in the metadata area “temperature” occur or start for example is going on within RDA. visualization software when complex types are found. Such DTRs would overcome the problems 5.6 Data Citation (DC) we so often have with unknown data types which The Data Citation group [29] worked out we receive and where we do not know how to suggestions of how to cite so-called dynamic data, process and interpret them. Thus we see an i.e. data that changes while people are already enormous impact for DTRs in daily practice. working with it and referring to it. All data coming in from seismological sensors for example will 5.3 PID Information Types (PIT) immediately be used when it becomes available for The PIT working group [26] produced a processing even if data samples in the sequences common API (Application Program Interface) to are missing due to transmission delays for example. unify access to Persistent Identifier (PID) service How can researchers refer back to these incomplete providers. Currently there are different PID versions of data? This is a problem that many systems (Handle/DOI7, AWK, etc.) and many disciplines have and this group worked out a different service providers all having their own suggestion how to solve this citation problem so regulations making it very cumbersome to get for that it could be implemented in all software and example the checksum of a Digital Object to check procedures. its identity and integrity. Applying this unified API together with some basic data types such as 5.7 Repository Audit and Certification (RAC) 7 As indicated above quality assessment of DOIs are Handles with a special prefix and used repositories (centres) is increasingly important to to refer to published collections. Handle/DOI raise the level of trust and the RAC group [30] services are available worldwide. 6 wants to come up with a unified standard. A few looking for further adopters of these results by suggestions have been made such as by Data Seal offering funding for collaboration projects. of Approval [31] and World Data Systems [32]. These two suggestions are already widely used and We should add here that RDA is obviously so similar that the responsible initiatives decided to entering a new phase. While the first 5 working join forces to make their guidelines compatible groups were started at the first plenary in March with each-other. It is widely agreed that the 2013 each of them focusing on their specific topic resulting set of guidelines is a good basis to certify under high time pressure, the experts now trusted repositories worldwide8. understand that they need to synchronise more to achieve the needed coherence of all results. One 5.8 New RDA Phase consequence was to set up the Data Fabric Interest At the fifth plenary (P5) we had a first adoption Group (DFIG) which is now bundling forces to day [33] where experts from different disciplines understand all components that are required to and institutions presented their way of making use come to efficient and reproducible data intensive of these early results. The presentations showed sciences. Figure 3 indicates briefly the topic being that the RDA results were not just an academic addressed9. Data production and consumption in Fig. 3: It indicates at an abstract level the typical data creation and consumption cycle as it is being used in the labs doing data intensive sciences. DFIG's questions are now which components are needed to run such a cycle efficient and self-documenting and how these components need to interact. The figure also indicates how the working groups that finished or are finishing fit into this cycle. enterprise, but indeed fulfil concrete needs of early the daily data driven work can be indicated by a adopters in particular since in some cases first cycle where at a certain moment new raw data is implementation versions are available and can be being created and in some form being used. Currently, RDA Europe is for example organised/registered and put into a store. Researchers who want to make use of data define a new (virtual) collection by selecting data from 8 We note here that there are several further repositories and then carry out some processing certification schemes that go more in-depth on steps on it which can be management or analytical specific aspects such as the “Security for operations. The result is a new collection of data Collaborating Infrastructures Assessment and which should be registered and stored again. The Modification Record” (SCI) for security aspects, or questions addressed are now which components are the NESTOR seal (based on DIN 31644) or ISO needed to run such a "fabric" efficiently and self- 16363 certification for general data repository documenting and how these components should aspects. The DIN and ISO certifications are 9 extremely detailed and thorough, and thus fairly A White Paper describes DFIG in more detail costly to implement. [34]. 7 researchers influence facilitate interact. Figure 3 also indicates how the finishing working groups fit into this cycle. specifications Currently the DFIG is collecting many Use Cases enable to build on what people are already doing and to abstract from these Use Cases to "common Fig. 4: It indicates schematically the essential components" that are required. Such common relationships between researchers, components would include for example a global infrastructures and the specification work such PID system10 providing PID registration and as in RDA. resolution mechanisms that can be used by everyone. Everyone interested should be motivated specifications as a joint effort of data practitioners, to contribute Use Cases that will influence the i.e. researchers and infrastructure providers. discussions about common components. A first paper to accelerate discussions has been made Information infrastructures in our distributed available by a number of distinguished experts landscape of data and computational services get from various regions [35]. very complex and involve several layers, which is sketched in the diagram drawn by the High Level 5.9 RDA Summary Expert Group on Scientific Data (Figure 5) [3]. RDA is still a very young initiative and its This diagram aims to work out the difference success mainly depends on the willingness of data between discipline specific and common services practitioners to spend time on global and cross- that users (top layer) will use probably without disciplinary11 problem solving, on the quality of noticing who will give the services they are using. their results, and their uptake by scientific projects Initiatives such as EUDAT were started to offer worldwide. For TCP/IP in its early days, there was common services (bottom layer) and thus to nothing particular that distinguished it from other suggestions. It was its layered approach and robustly running code that finally convinced people worldwide to adopt the standard. RDA needs to do a lot to have similar success and it needs strong infrastructure pillars that provide and maintain services. 6. Infrastructure Pillars As described, RDA is only working on specifications and it is neither providing services nor maintaining code. It will rely on powerful centres and federations to provide the infrastructures that are finally required to transform specifications into real services that enable efficient data intensive sciences. In the same way we can Fig. 5; It schematically indicates 3 layers of the so- state that researchers in general are not so much called Collaborative Data Infrastructure where interested in specifications of interfaces for community based infrastructures offer community example, but in the services that will facilitate their specific services and e-Infrastructures offer common work. In a simplified way figure 4 indicates the discipline crossing services. This was seen by the EC as essential relationships between researchers as a blueprint for funding programs. consumers of facilitating services who would also like to influence specification building to ensure complement the typical ESFRI layer (middle layer) the emergence of useful services, infrastructures with many European research infrastructures in that are built compliant to the specifications to various research disciplines. ensure interoperability of the services and initiatives such as RDA which establish the The first ESFRI roadmap from 2006 [36] led to 44 research infrastructures leading to an intensive 10 and concerted European activity across many The Handle System (http://www.handle.net/) is disciplines. Most of these infrastructure initiatives such a global PID system supervised and managed are heading towards building persistent distributed by the international DONA Foundation and it is information infrastructures. also basis of the DOI and other service providers such as EPIC in Europe. 11 RDA also includes some disciplinary groups which are using the global nature of RDA to achieve community agreements. 8 One example is the CLARIN initiative [37] in the EUDAT to make use of the advanced services that area of language resources and technology which are offered by them. has recently achieved the status of an ERIC 12. CLARIN is based on strong and federated centres 6.1 EUDAT in a variety of European countries that share the EUDAT is a federation of well-resourced and effort in defining standards together with the partly national data and compute centres in various Fig. 6 shows the federation of centres across Europe that is the basis of EUDAT’s e-Infrastructure and the 5 basic user services it offers to the research community. In addition to the 5 user services it established system services such as an authentication and authorisation infrastructure and a service to register and resolve persistent identifiers. community, in aggregating digital language countries as figure 6 indicates. Within its first three resources, and in offering joint services, in years EUDAT invested all efforts in developing 5 managing and curating data with discipline specific basic services in collaboration with at the knowledge and others. The services offered by beginning 5 communities13 (climate modelling, CLARIN include deposit possibilities, a joint earth plate observation, human physiology, metadata catalogue called Virtual Language biodiversity and language resources and Observatory [38], a distributed workflow tool technology). B2SHARE, B2DROP and B2FIND allowing users to analyse texts in various languages are services directed to the end users meant for and many smaller services. However, CLARIN dealing with long tail type data. B2SAFE is a centres are not equipped to offer massive compute service that allows replicating large data sets power to all possible users from all over Europe between a community centre and the EUDAT who may want to execute workflows or use large centre network. The B2STAGE service is meant to storage systems to manage large data sets. move data sets from the EUDAT store to the Therefore research infrastructures such as CLARIN workspaces of powerful computers of different make liaisons with e-Infrastructures such as types (HPC, etc.) to carry out computations and to EUDAT and pay for such common services. All return the results. All data in EUDAT are research infrastructures from the different research registered, i.e. all digital objects have PIDs and are domains are looking for similar options if they are associated with metadata to make them findable data and compute oriented. and accessible. The ESFRI organisation and the EC are still It should be added here that federating data actively starting new research infrastructures. To centres and their collections was and is a major come to an optimized eco-system of information challenge and currently not scalable. The reason for infrastructures all ESFRI projects and beyond (such this can be found mainly in the data organisations as Human Brain Project) are seeking collaborations where each centre has chosen a different solution. with e-Infrastructures such as PRACE [39] and This lack of interoperability leading to enormous costs is one of the reasons why EUDAT is very 12 ERIC is a special organisational template 13 invented to allow ESFRI research infrastructures to Currently EUDAT is closely interacting with 32 become European legal entities. communities. 9 much interested in harmonised solutions being services they are expecting. Yet the stakeholders worked out by RDA, for example in the DFT are still discussing which concept will be the best group. Due to this close interest EUDAT declared to address the eminent challenges posed by the data that it will try out RDA outputs where possible and deluge and the need to optimize data sharing and thus act as an RDA testbed in Europe. re-use in the USA. Recently the leading persons in RDA US agreed to ask NDS to act as national EUDAT just received its 2 nd funding grant for 3 testbed center for RDA results. years which needs to be used to stabilize and improve the services being offered, work out a 7. National Level Pillars sustainable funding model and look for Also at the national level in Europe new collaborations with other European e- organisational structures are being tested and Infrastructures such as PRACE. This led a to an established to meet the challenges of data intensive additional work item which is devoted to sciences. improving the exchange of data between EUDAT and PRACE and demonstrating this as an efficient 7.1 Max Planck Society service with the help of concrete data and compute In the Max Planck Society an IT Strategy bound projects. Future challenges are anticipated Committee was founded a few years ago to come by also strengthening the work on executing up with advice how to reshape the IT service automatic workflows. It is understood that data structure in its organisation to maintain science needs to turn increasingly often to competitiveness of its research. With the automatic and self-documenting workflows to introduction of parallel computers many years ago make its results reproducible. Yet the challenges to the Computer Centre in Garching got the task to let users quickly deploy and execute complex provide not only high performance compute software close to where the data is stored, i.e. capacity but also to provide expertise in operate in a distributed environment, are huge and parallelising relevant domain specific software severe barriers need to be removed. But EUDAT codes for simulation and analytics. In collaboration needs to demonstrate that it finally can offer with domain experts such code was optimised services similar to Amazon and other companies allowing optimal use of HPC architectures. The where users can execute their software in a virtual optimal solution for such code parallelization was machine environment and basically pay for the thus found by bringing together expertise and cycles used. resources of each institute with central expertise and resources such as storage capacity and compute In the coming period EUDAT will also be faced power. The strategy committee realized that the by a new initiative and request of the European huge increase of data and the challenges of data Commission in the realm of Open Science and intensive sciences require a new approach in so far Innovation [40] called the European Open Science as it makes sense to also provide central expertise Cloud. The EC wants to have a “cloud service” for and facilities in data management, curation and all European researchers without having defined its analytics. exact specifications yet. A high level expert group is being formed that will work out the As a consequence, the centre in Garching got a requirements. According to EC experts the term new name (Max Planck Computing and Data “cloud service” is meant in the broad sense, i.e. it Facility, MPCDF) to indicate the change in focus, needs to include the necessary structures for and was extended with data experts having persistent identifiers, metadata, relations, etc. expertise in mathematics and algorithms in typical data analytics applications which are widely 6.2 National Data Service (NDS) discipline unspecific. The idea is to carry out Also in the USA an attempt is being made under collaborations between the centre and the various the lead of NSCA [41] to setup a National Data institutes and their departments that cannot invest Service (NDS) [42] and to offer similar cross- in the specific knowledge required and that do not disciplinary data services compared to EUDAT in have the local resources to store and manage all Europe and ANDS [43] in Australia. The NDS is data and to carry out the required computations. an emerging vision for how scientists and researchers across all disciplines can find, reuse, We will use the NoMaD (Novel Materials and publish data. It wants to build on the data Discovery) Repository project [44] which has been archiving and sharing efforts already underway selected as one of the European Centres of within specific communities and to link them Excellence projects as an example for the typical together with a common set of tools. collaboration between a leading research institute in the MPS and its MPCDF centre. Currently the NDS is focusing on collaborations with some communities to find out what kind of 10 Theoretical material scientists worldwide are 7.2 Approaches in NL doing experiments with a number of well-known Also in countries such as for example the chemical software packages (some at petascale Netherlands new strategies are being tested. In performance) to compute possible characteristics addition to strengthen domain specific centres of for materials. These simulations are typically run different types new centres have been established on HPC machines after having carried out deep to structure the data landscape. DANS [45] and optimization of the software code tuned to certain 3TU [46] have received the task to specialise on architectures. Until now the resulting data has been data management and curation. They should make used to write scientific papers, but was not use of the data services of the national data and considered valuable as such. This attitude is compute centre SARA.[47]. In addition the changing due to the fact that as in other research eScience Centre [48] has been established to run disciplines the researchers see a value in re-using collaborative projects where discipline experts and data in different contexts, in allowing others to do experts with centrally aggregated expertise are new kinds of computations and to prevent doubling shared to meet the challenges of data intensive the work. The repository is meant to be a centre for science. All these national service providers are storing results of simulation runs being identified requested to synchronise their activities to come to by DOIs and described by proper metadata. Thus an efficiently organised eco-system of proper data organization and stewardship is basis of infrastructure pillars and services. the work. 8. Conclusions Data Intensive Science (DIS) is one facet of the digital change which we are currently experiencing and which will change not only science but also societies substantially. DIS which will be open to many to exploit its full innovative power and not exclusive to a few will depend on a change of culture towards open data and accessibility of services. In the European Union and its member states community-driven research infrastructures and e-Infrastructures tackling common cross- disciplinary challenges have been started to address the needs for an efficient eco-system of services Fig. 7 indicates the intentions of the Novel enabling data intensive work. The US did not make Materials Discovery project (NoMaD) project to this distinction, but under the term federate and aggregate all data about stemming “cyberinfrastructure” also community-driven and from material experiments to enable easy access more commons-driven projects were initiated. and re-use. After almost a decade of experience in infrastructure building it is obvious that there are In collaboration with the researchers of the Fritz- still many social and technical barriers prohibiting Haber Institute the MPCDF experts are developing efficient and cost-effective data usage and software to transform the incoming data to a reproducible results. In fact one can argue that only normalized and compressed format, developing the active infrastructure building made many of the repository software, the user upload, access and barriers visible to all stakeholders. The time period search interfaces, and the needed data management between the invention of TP/IP and its broad tools. In addition, novel analytic tools are being uptake to enable efficient communication between developed in collaboration between the involved compute nodes took about 15 years. Several data centres to allow graphical searches, to carry out scientists and infrastructure builders from mainly machine-learning based comparisons on data sets, Europe, US and Australia agreed that it is time to to do smart visualizations supporting voyaging accelerate the process of overcoming the many methods, etc. Typically all these operations on the barriers for efficient data usage since waiting for aggregated data will be executed by making use of another decade to overcome the most severe “trivial” parallelization techniques such as enabled barriers is acceptable. Setting up the RDA based on by Map-Reduce methods on appropriate Hadoop similar principles as IETF (bottom-up, rough clusters, i.e. the repository will be hosted at consensus, running code, lean governance) was the MPCDF and the computations will be carried out preferred choice of the data experts and this choice on computers offered by MPCDF. was supported by the funding organizations. With this background in mind it is not surprising that almost all strong European infrastructure 11 centres are very active in EUDAT as well as in [4] Human Brain Project: RDA and that for example also ANDS and NDS https://www.humanbrainproject.eu/ engage actively in RDA. The Max Planck [5] Herman Stehouwer, Peter Wittenburg, RDA Computing and Data Facility for example will Data Practice Report, 2014, http://europe.rd- coordinate RDA Europe from September 2015, and alliance.org/sites/default/files/RDA-Europe-D2.5- its members are in the Technical Advisory Board, Second-Year-Report-RDA-Europe-Forum- co-chairing the Data Foundation and Terminology Analysis-Programme.pdf and Data Fabric Interest Groups and are leading a [6] ESFRI Roadmap 2006, Work Package in EUDAT, SARA and DANS for http://ec.europa.eu/research/infrastructures/index_e example are also leading activities in EUDAT and n.cfm?pg=esfri-roadmap are actively engaged in RDA groups. NDS is co- [7] Open Access, chairing for example the Data Fabric Interest http://en.wikipedia.org/wiki/Open_access Group and ANDS is represented in the Council and [8] http://research.microsoft.com/en- Technical Advisory Board of RDA. us/um/redmond/events/fs2010/presentations/miche ner_environ_data_mgmt_rfs_71210.pdf In addition to accelerating global agreement [9] Data Seal of Approval: finding to improve data sharing and re-use and thus http://datasealofapproval.org/en/ to enable inclusive data intensive science two main [10] TCP/IP Protocol: reasons can be mentioned for the engagement: a) http://en.wikipedia.org/wiki/Internet_protocol_suit engaging its experts in cutting-edge developments e will make them fit for the coming challenges and b) [11] Herman Stehouwer, Peter Wittenburg, bringing in their expertise will influence decision Principles for Data Sharing and Re-use: are they all taking. So far RDA is too young to present final the same?, 2015 conclusions about the question whether the http://hdl.handle.net/11304/1aab3df4-f3ce-11e4- expectations were met. ac7e-860aa0063d1f [12] Peter Wittenburg, Leif Laaksonen, Hermann We need to accept that the data landscape is Stehouwer, Raphael Ritz, Living with Data changing rapidly and that new structures that have Management Plans, 2015 been set up to facilitate data intensive sciences are http://hdl.handle.net/11304/ea286e5a-f3d1-11e4- often still in a test phase. Essential questions in the ac7e-860aa0063d1f data domain are still not fully answered yet such as: [13] ICRI 2012 Conference Copenhagen: Which persistent structures need to be funded in http://www.icri2012.dk/www.ereg.me/ehome/index addition to libraries that often do not yet have the 06e1.html skills to participate in the emerging data services [14] Research Data Alliance: http://rd-alliance.org domain? What is the optimal division between [15] Naoyuki Tsunematsu, RDA plenary Keynote, discipline specific and common services? What is San Diego, 2015: https://rd-alliance.org/keynote- the most optimal way to share specialised and naoyuki-tsunematsu.html expensive data experts that are scarce? Which are [16] Daniel Castro, Travis Korte, Open Data in the the common components that need to be specified G8, 2015, to come to global, interoperable and well- http://www2.datainnovation.org/2015-open-data- maintained services supporting data intensive g8.pdf sciences optimally? [17] Research Data Alliance - Europe, http://europe.rd-alliance.org The EU and several of its member states as well [18] RDA Case Statements, https://rd- as the US decided to take an active role to exploit alliance.org/working-and-interest-groups/case- the possibilities by taking concrete actions and by statements.html asking data science experts to develop and test out [19] ESFRI Projects, bottom-up driven models. http://ec.europa.eu/research/infrastructures/index_e n.cfm?pg=esfri 9. References [20] EU e-Infrastructures, [1] MPI for Psycholinguistics, http://www.mpi.nl http://cordis.europa.eu/fp7/ict/e-infrastructure/ [2] Tony Hey et.al., The Fourth Paradigm - Data [21] EUDAT e-Infrastructure, http://www.eudat.eu Intensive Scientific Discovery, 2009, [22] Bernard Schutz et.al., RDA Europe Science http://research.microsoft.com/en- Workshop Report, 2014, http://europe.rd- us/collaboration/fourthparadigm/4th_paradigm_bo alliance.org/documents/publications-reports/rda- ok_complete_lr.pdf europe-science-workshop-report [3] John Wood et.al., Riding the Wave Report, [23] John Wood et.al., The Data Harvest, 2014, 2012, http://cordis.europa.eu/fp7/ict/e- https://europe.rd- infrastructure/docs/hlg-sdi-report.pdf alliance.org/documents/publications-reports/data- 12 harvest-how-sharing-research-data-can-yield- http://hdl.handle.net/11304/33430f2e-f598-11e4- knowledge-jobs-and ac7e-860aa0063d1f [24] RDA Data Foundation and Terminology WG, [36] ESFRI Roadmap 2006, https://rd-alliance.org/groups/data-foundation-and- http://ec.europa.eu/research/infrastructures/index_e terminology-wg.html n.cfm?pg=esfri-roadmap§ion=roadmap-2006 [25] RDA Data Type Registry WG, https://rd- [37] CLARIN Research Infrastructure, alliance.org/groups/data-type-registries-wg.html http://www.clarin.eu/ [26] RDA PID Information Type WG, https://rd- [38] CLARIN Virtual Language Observatory, alliance.org/groups/pid-information-types-wg.html http://clarin.eu/content/virtual-language- [27] RDA Practical Policy WG, https://rd- observatory alliance.org/groups/practical-policy-wg.html [39] PRACE e-Infrastructure, http://www.prace- [28] RDA Metadata Standards Directory WG, ri.eu/ https://rd-alliance.org/groups/metadata-standards- [40] EC Open Science and Innovation, directory-working-group.html http://ec.europa.eu/research/conferences/2015/era- [29] RDA Data Citation WG, https://rd- of-innovation/index.cfm alliance.org/groups/data-citation-wg.html [41] National Center for Supercomputer [30] RDA Repository Audit and Certification WG, Applications, http://www.ncsa.illinois.edu/ https://rd-alliance.org/groups/repository-audit-and- [42] National Data Service, certification-dsa%E2%80%93wds-partnership- http://www.nationaldataservice.org/ wg.html [43] Australian National Data Service, [31] Data Seal of Approval, http://www.ands.org.au/ http://datasealofapproval.org/en/ [44] NoMaD - Novel Materials Discovery Project, [32] World Data Systems, https://www.icsu- http://nomad-repository.eu/cms/ wds.org/ [45] Data Archiving and Networked Service, [33] RDA Adoption Day, San Diego, 2015, http://www.dans.knaw.nl/nl https://www.rd-alliance.org/plenary-meetings/fifth- [46] 3TU Data Centrum, plenary/programme/adoption-day.html http://datacentrum.3tu.nl/home/ [34] RDA Data Fabric IG, https://www.rd- [47] SURF Sara, https://surfsara.nl/ alliance.org/group/data-fabric-ig.html [48] Netherlands eScience Center, [35] Bridget Almas et.al., Data Management https://www.esciencecenter.nl/ Trends, Principles and Components - What Needs to be Done Next?, 2015, 13