=Paper=
{{Paper
|id=Vol-2066/cse2018paper03
|storemode=property
|title=Designing a Generic Research Data Infrastructure Architecture with Continuous Software Engineering
|pdfUrl=https://ceur-ws.org/Vol-2066/cse2018paper03.pdf
|volume=Vol-2066
|authors=Nelson Tavares de Sousa,Wilhelm Hasselbring,Tobias Weber,Dieter Kranzlmüller
|dblpUrl=https://dblp.org/rec/conf/se/SousaHWK18
}}
==Designing a Generic Research Data Infrastructure Architecture with Continuous Software Engineering==
Designing a Generic Research Data Infrastructure Architecture with Continuous Software Engineering Nelson Tavares de Sousa, Wilhelm Hasselbring Tobias Weber, Dieter Kranzlmüller Software Engineering Group Leibniz Supercomputing Centre Kiel University Bavarian Academy of Sciences and Humanities Kiel, Germany Garching, Germany {tavaresdesousa, hasselbring}@email.uni-kiel.de {weber, kranzlmueller}@lrz.de Abstract—Long-living software systems undergo a continuous design and self-contained systems. Additionally, through loose development including adaptions due to altering requirements or coupling we are able to integrate already existing software the addition of new features. This is an even greater challenge and services such as high-performance computing or cloud if neither all users nor requirements are known at an initial design phase. In such a context, complex restructuring activities computing and storage. Furthermore, changes remain within are much more probable, if the challenges are not taken into affected self-contained systems and will not propagate to other account from the beginning. We introduce a combination of the self-contained systems. concepts of domain-driven design and self-contained systems to In Section II we introduce the application domain for our meet these challenges within the system’s architecture design. reference implementation. Our approach to meet the men- We show the merits of this approach by designing an architec- ture for a generic research data infrastructure, a use case where tioned challenges is presented in Section III with the resulting the mentioned challenges can be found. Embedding this approach architecture. Section IV shows an implementation of one use within continuous software engineering, allows to implement and case. Deployment and operation requirements derived by this integrate changes continuously, without neglecting other crucial approach will be discussed in Section V. Section VI presents properties such as maintainability and scalability. our conclusions and provides an outlook to future work. Index Terms—Microservice, Self-Contained System, System- oriented Architecture, Continuous Software Engineering, Re- II. A PPLICATION D OMAIN search Data Management Data-intensive research requires appropriate management of I. I NTRODUCTION the research data. However, present solutions for data storage Software systems with a heterogeneous set of stakeholders often lead to inaccessible data silos, instead of providing experience various challenges within their requirements en- research data in a findable, accessible, inter-operable and gineering. The extent to which all stakeholders are unknown reusable (FAIR) way [1]. Various initiatives in this field target is a further factor which impedes an initial complete system on reducing barriers for researchers to establish an efficient specification. Long-living systems may also undergo a chang- data management and processing for their research data. This ing set of stakeholders and therefore shifting requirements. In focus on making data accessible and shareable reflects the continuous software engineering, this needs to be considered key points of the European Open Science Cloud (EOSC) [2]. throughout all stages of a system’s life span. For instance, Apart from data, services which offer capabilities to process beginning with the system specification, the design needs to and analyze the data also need to be reusable and shareable be able to compensate for changing requirements at any given to other researchers to increase not only the impact of their point in time. Continuous integration and deployment need to research efforts, but also to make the research process more be implemented in a way that new or changed requirements efficient, transparent, and reproducible. can be integrated into the running infrastructure. The project GeRDI aims to provide an infrastructure which In the following, we will introduce an approach to meet fosters FAIR data practices and also supports researchers in these challenges. This approach is validated in a reference their data-driven workflow [3]. This is realized by integrating implementation for the project Generic Research Data In- different domain services seamlessly into one single infras- frastructure (GeRDI). By abstracting and extrapolating the tructure. The continuous involvement of nine research groups requirements from a limited set of stakeholders, we are able to of different research domains into the development process extract different domains regarding the feature set of GeRDI allows us to determine specific workflows and their involved whereby in turn each major feature is implemented as a services to optimize the infrastructure for real-life use cases. distinct self-contained system (SCS)1 . This allows us to be As a reference, selected research cases of these research adaptable regarding the requirements set and also to benefit groups will be reimplemented and extended using the GeRDI from different properties of both concepts, domain-driven infrastructure. One research case is provided by the Environmental, Re- 1 http://scs-architecture.org/ source and Ecological Economics Group (EREE) of Kiel CSE 2018: 3rd Workshop on Continuous Software Engineering @ SE18, Ulm, Germany 85 External Frontend Integration Frontend Archive Harvest Search Bookmark Store Preprocess Analyze Publish Archive UI Search UI Bookmark UI Store UI Preprocess UI Analyze UI Publish UI Query/ Index Analyze Archive API Harvest API Bookmark API Store API Preprocess API Publish API API API Preprocess Analyze Publish Archive DB Harvest DB Search DB Bookmark DB Storage Storage Storage Storage External Backend Integration Backend Fig. 1. The GeRDI Self-Contained Systems Architecture (a.k.a. Microservices Architecture) University. In a report [4], published by the WWF, different selection of relevant metadata for sharing search results or economic and fishery management scenarios were evaluated for further processing is then performed in Bookmark. After in order to derive future changes in fishery catch rates. This that, data is downloaded either to a local machine or a remote research case illustrates a possible scientific workflow which storage system (Store). The processing of data is divided into GeRDI aims to support. Data is collected from multiple data two stages. The first step is to normalize or pre-filter data repositories, aggregated and filtered in a preprocess step, and in Preprocess. In Analyze, actual analysis on the preprocessed then passed to a computation model for scientific analysis and data is performed to gain new scientific insights. The new data prediction of fishery catch rates. The other research groups can then be uploaded to a research data repository (Publish). contribute different research cases, to cover different research This closes a cycle (not included in Figure 1) as the uploaded domains (such as digital humanities, hydrology, or socioeco- data is again available in the research data repository. nomics) including different workflows and used services. The required functionality can be provided as a specific implementation within each service domain. Additionally, the III. A RCHITECTURE implementations of all domains are able to communicate with Our goal is to design an architecture which allows us to react each other through remote interfaces. As a result, we are able to changing requirements without major restructuring activities to not only reimplement our individual services, but to also and which has a level of complexity that is as low as possible. implement and integrate required functionality in the future, To achieve this goal we rely on the strategy of domain- by implementing them with compliance to the given interfaces. driven design (DDD) for the concept of our architecture. In As mentioned in Section I, for the implementation of such DDD, complex systems are divided into bounded contexts in an architecture, we make use of SCS as an architectural order to contain different domains within distinct components style. In our architecture concept, we vertically decompose [5]. In our case, we derive different domains by clustering the system along the domains and are therefore able to use the required functionality of our research cases into sets of methods of DDD not only as a design concept but also for the generic services. This is done by analyzing the workflows of implementation of our architecture. A self-contained system all research cases and cutting them into different domains. As depicts a certain functionality and implements it as a full a result, we obtain a set of domains as shown in Figure 1. stack with an user interface layer, business logic layer and Each colored box depicts a service domain, enabling actions persistence layer, which can be implemented as microservices throughout the research data’s life span. The service domains [6]. Microservice architectures facilitate scalability [7], as are all implemented as self-contained systems: well as agility and reliability [8]. Communication between Archive depicts the data source which in most research cases SCS should be reduced to a minimum level. In cases where is a research data repository for long-term data archival. An communication is inevitable, this should occur through well- interface between such a research data repository and our defined REST-interfaces. Cross-cutting concerns regarding the infrastructure is realized through Harvest which collects the implementation, such as an authentication and authorization metadata, enriches it, and forwards it to a search index. With infrastructure or system monitoring, are deployed within a Search, a researcher can find relevant research data among backend integration layer. all harvested, multidisciplinary research data repositories. A A further layer for the frontend integration is also required, CSE 2018: 3rd Workshop on Continuous Software Engineering @ SE18, Ulm, Germany 86 Archive Harvest Search Bookmark Store Preprocess Analyze Publish •Sea Around Us •Adapter to the •LMEs •By Price and •Download onto •Union GIS •Model •Back to a •FAO Stat Repositories •Catch Data Trade Data for Laptop •LME Catches Combination Repository •FAO FishStatJ •Price Fish •Prediction e.g. Pangaea Commodeties •SSP •Trade Data •GIS Data •GIS LME & Countries Fig. 2. Research Case Mapping to the GeRDI Architecture as each SCS implements its own user interface. The SCS attributes. Storage is provided by local computers in this use approach is scalable regarding different aspects [7]. The archi- case. Therefore, the Store domain must support interfaces to tecture scales well regarding its functionality, as new functions download data to a local machine and to use the saved data can be continuously implemented and integrated as a SCS. for further steps. Preprocess and Analyze modify the original This is enabled by the inherent nature of loose coupling of data. In a first step, data is aggregated by using geographic SCS which makes them interchangeable. It scales well with information system (GIS) data or catch rates of large marine the amount of developers, as each SCS can and should be ecosystem (LME). Then, by feeding the preprocessed data developed by an individual team [8]. This allows to enable a into a model, predictions for future scenarios regarding the community-driven infrastructure, as external developer teams catch rates can be made. Thus, both service domains rely on may contribute their functions continuously to the GeRDI computation tasks. As a last step, the predicted data is again infrastructure. Due to the option of instantiating multiple in- published to a research data repository, Pangaea2 in this case. stances of one SCS and a further load balancing in the frontend As already mentioned, we will reuse existing software integration layer, this approach also introduces potential to wherever possible in our reference implementation. The im- scale well with regard to performance. plementations for Harvest are newly developed. The Search Figure 1 also illustrates how we implement the reference makes use of Elasticsearch3 as a search platform. For storage architecture of GeRDI. Each colored box depicts an SCS capabilities, network file sharing systems, such as Samba4 , and therefore a domain of the complete system. White boxes can be used for this use case. By implementing a facade, within each SCS show the different layers of user interface this can be made GeRDI compliant. The computation steps (UI), business logic (API) and persistence (DB or Storage) require resources which can be provided by a cloud provider within the SCS. Grey boxes at the top and bottom of the or a compute center. To enable the computation of the Matlab figure show both integration layers. As we do not re-implement model, we reuse Jupyter5 and integrate it in combination with research data repositories, their frontend and backend layers its Matlab kernel into the infrastructure. For the publication are not integrated into the GeRDI frontend. Additionally, of the newly generated data, we use Pubflow6 which provides implementations of Harvest do not require a UI, which leads functionality to upload research data to repositories. to the lack of the corresponding microservice. Different research cases may use different implementations. However, to benefit from a broader set of research data reposi- IV. U SE C ASE I MPLEMENTATION tories, we encourage to make use of the same implementations For the (re)implementation of research cases, this archi- for the service domains Harvest, Search, and Bookmark. The tecture allows to make use of existing middleware software reference implementations for these domains will be open and wherever possible. Well-established software can be integrated accessible through a GeRDI portal. into the infrastructure if it can be mapped to one domain and if it complies with its interfaces. Therefore, to implement V. I NTEGRATION AND D EPLOYMENT complete research workflows, all required services need to In this section we will briefly describe the requirements be mapped first to the generic services model. Afterwards, of an operational setup to run software as exemplified in each service is implemented as a SCS and integrated into the Section IV. infrastructure. The microservice architecture described in Section III ne- Figure 2 shows a mapping of required services and/or cessitates a container-ready system to mirror the encapsula- functions for the EREE research case mentioned in Section II. tion. A registry is needed to disseminate the built images We see in Archive the different research data repositories to the deployment contexts. Tagging the images is another which deliver the relevant data. These repositories already requirement, since it facilitates the selection of compatible exist and will be integrated into GeRDI by connecting them versions of the different SCS and enables the description of a with different implementations of Harvesters for each repos- 2 https://www.pangaea.de/ itory. Both Search and Bookmark depict certain attributes 3 https://www.elastic.co/products/elasticsearch for which data can be searched and bookmarked for. In our 4 https://www.samba.org/ reference implementation for a search platform, we need to 5 https://jupyter.org/ make sure we support the search and bookmarking for these 6 https://www.pubflow.uni-kiel.de/en CSE 2018: 3rd Workshop on Continuous Software Engineering @ SE18, Ulm, Germany 87 release manifest (i.e. a list of images/version pairs including VI. C ONCLUSION & O UTLOOK the external dependencies). We introduced an approach to handle an incomplete set of After passing the code reviews and all tests within the requirements through an appropriate architecture design. Our continuous integration (CI) process the CI system builds the approach combines domain-driven design and self-contained container image. This way the developers’ assumptions with systems to provide an infrastructure which can be used for regard to the deployment context (e.g. available libraries) is the implementation of different and also unknown function re- encapsulated, thoroughly tested and ready to ship. The CI quirements. With continuous software engineering we are able system needs to support this workflow. to continuously implement, deploy, and integrate functionality We identified three deployment contexts: testing, staging changes into a running system. The result is used for a generic and production. The testing context needs full automation, research data infrastructure and allows to (re)implement exist- i.e. continuous deployment, and is used by developers to test ing and yet unknown use cases. As an example, we depicted and discuss features. Staging and production contexts are less one use case and introduced its implementation with this automated since the robustness requirements are higher. They architecture. Challenges regarding the operation of such a can therefore be classified as continuous delivery systems (i.e. system were also discussed and an appropriate setup was manual work is necessary to deploy). Staging is not only used presented. to prepare a release to the production context, but also as a The development of GeRDI is in a early stage and therefore preview for the stakeholders to facilitate an agile development prototypical. Evaluations are required to show the benefits of approach. Several computing centers should be able to provide this architecture in real-world usage. This includes the im- the computational resources to run all or parts of the three plementation of different use cases of other research domains deployment contexts. At the same time some operational which will show if the stated claims, regarding its adaptability, aspects need centralized services, such as monitoring and will hold. logging facilities. Some parts of the infrastructure (such as Yet to be validated are other topics such as monitoring the search index) might profit from running on the same site which is required for a useful system scaling and performance to reduce performance penalties through network traffic. As a evaluation. The deployment and operation of an authentication result, the deployment infrastructure needs to support fully- and authorization infrastructure for such a system is an addi- automated and semi-automated deployment workflows and tional challenge of greater importance, due to a broader set of allow for transparent integration of compute nodes, without possible service providers. losing the possibility to pin containers to specific nodes ACKNOWLEDGEMENTS if necessary. In addition to that, scalability and availability requirements necessitate container orchestration abilities such This work was supported by the DFG (German Research as on-the-fly scaling, node draining, and rolling updates. Foundation) with the GeRDI project (Grants No. BO818/16-1 Since the deployment infrastructure is also developing over and HA2038/6-1). time, its setup needs to be documented and automated by R EFERENCES a provisioning and configuration management system. The [1] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Ax- scripts and configuration for such a system are also part of the ton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos, release process. Releases therefore consist of the source code, P. E. Bourne et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Scientific data, vol. 3, 2016. the container images, the setup scripts for the infrastructure [2] C. H. L. E. G. on the European Open Science Cloud, “Realising the and the release manifest. All release assets need to be available European Open Science Cloud,” European Commision, Tech. Rep., 2013. for the public (open source licenses). [3] R. Grunzke, T. Adolph, C. Biardzki, A. Bode, T. Borst, H.-J. Bun- gartz, A. Busch, A. Frank, C. Grimm, W. Hasselbring, A. Kazakova, The following setup meets the above described requirements A. Latif, F. Limani, M. Neumann, N. T. de Sousa, J. Tendel, I. Thom- and will be used for GeRDI: sen, K. Tochtermann, R. Müller-Pfefferkorn, and W. E. Nagel, “Chal- lenges in Creating a Sustainable Generic Research Data Infrastructure,” • Containerization: Docker7 Softwaretechnik-Trends, vol. 37, no. 2, 2017. 8 • Continuous Integration: Bamboo [4] M. Quaas, J. Hoffmann, K. Kamin, L. Kleemann, and K. Schacht, “Fishing for Proteins,” WWF, 2016. • Continuous Deployment/Delivery and container orches- [5] E. Evans, Domain-Driven Design: Tackling Complexity in the Heart of tration: Kubernetes9 Software. Addison-Wesley Professional, 2004. 10 • Provision and configuration management: Ansible [6] J. Lewis and M. Fowler, “Microservices,” 2014, http://martinfowler.com/ articles/microservices.html. In a recent literature review only 6 out 69 case studies were [7] W. Hasselbring, “Microservices for Scalability: Keynote Talk Abstract,” identified to discuss continuous practices in academic setups in Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering (ICPE 2016). New York, NY, USA: ACM, (cf. [9]). None of these describe the same requirements as 2016, pp. 133–134. pointed out in this section. [8] W. Hasselbring and G. Steinacker, “Microservice Architectures for Scalability, Agility and Reliability in E-Commerce,” in 2017 IEEE International Conference on Software Architecture Workshops (ICSAW). 7 https://www.docker.com Gothenburg, Sweden: IEEE, Apr. 2017, pp. 243–246. 8 https://de.atlassian.com/software/bamboo [9] M. Shahin, M. A. Babar, and L. Zhu, “Continuous Integration, Delivery 9 https://kubernetes.io and Deployment: A Systematic Review on Approaches, Tools, Challenges 10 https://www.ansible.com and Practices,” IEEE Access, vol. 5, pp. 3909–3943, 2017. CSE 2018: 3rd Workshop on Continuous Software Engineering @ SE18, Ulm, Germany 88