=Paper= {{Paper |id=Vol-2066/cse2018paper03 |storemode=property |title=Designing a Generic Research Data Infrastructure Architecture with Continuous Software Engineering |pdfUrl=https://ceur-ws.org/Vol-2066/cse2018paper03.pdf |volume=Vol-2066 |authors=Nelson Tavares de Sousa,Wilhelm Hasselbring,Tobias Weber,Dieter Kranzlmüller |dblpUrl=https://dblp.org/rec/conf/se/SousaHWK18 }} ==Designing a Generic Research Data Infrastructure Architecture with Continuous Software Engineering== https://ceur-ws.org/Vol-2066/cse2018paper03.pdf
 Designing a Generic Research Data Infrastructure
Architecture with Continuous Software Engineering
            Nelson Tavares de Sousa, Wilhelm Hasselbring                       Tobias Weber, Dieter Kranzlmüller
                        Software Engineering Group                               Leibniz Supercomputing Centre
                              Kiel University                            Bavarian Academy of Sciences and Humanities
                              Kiel, Germany                                           Garching, Germany
              {tavaresdesousa, hasselbring}@email.uni-kiel.de                   {weber, kranzlmueller}@lrz.de



   Abstract—Long-living software systems undergo a continuous       design and self-contained systems. Additionally, through loose
development including adaptions due to altering requirements or     coupling we are able to integrate already existing software
the addition of new features. This is an even greater challenge     and services such as high-performance computing or cloud
if neither all users nor requirements are known at an initial
design phase. In such a context, complex restructuring activities   computing and storage. Furthermore, changes remain within
are much more probable, if the challenges are not taken into        affected self-contained systems and will not propagate to other
account from the beginning. We introduce a combination of the       self-contained systems.
concepts of domain-driven design and self-contained systems to         In Section II we introduce the application domain for our
meet these challenges within the system’s architecture design.      reference implementation. Our approach to meet the men-
   We show the merits of this approach by designing an architec-
ture for a generic research data infrastructure, a use case where   tioned challenges is presented in Section III with the resulting
the mentioned challenges can be found. Embedding this approach      architecture. Section IV shows an implementation of one use
within continuous software engineering, allows to implement and     case. Deployment and operation requirements derived by this
integrate changes continuously, without neglecting other crucial    approach will be discussed in Section V. Section VI presents
properties such as maintainability and scalability.                 our conclusions and provides an outlook to future work.
   Index Terms—Microservice, Self-Contained System, System-
oriented Architecture, Continuous Software Engineering, Re-                          II. A PPLICATION D OMAIN
search Data Management
                                                                       Data-intensive research requires appropriate management of
                            I. I NTRODUCTION                        the research data. However, present solutions for data storage
   Software systems with a heterogeneous set of stakeholders        often lead to inaccessible data silos, instead of providing
experience various challenges within their requirements en-         research data in a findable, accessible, inter-operable and
gineering. The extent to which all stakeholders are unknown         reusable (FAIR) way [1]. Various initiatives in this field target
is a further factor which impedes an initial complete system        on reducing barriers for researchers to establish an efficient
specification. Long-living systems may also undergo a chang-        data management and processing for their research data. This
ing set of stakeholders and therefore shifting requirements. In     focus on making data accessible and shareable reflects the
continuous software engineering, this needs to be considered        key points of the European Open Science Cloud (EOSC) [2].
throughout all stages of a system’s life span. For instance,        Apart from data, services which offer capabilities to process
beginning with the system specification, the design needs to        and analyze the data also need to be reusable and shareable
be able to compensate for changing requirements at any given        to other researchers to increase not only the impact of their
point in time. Continuous integration and deployment need to        research efforts, but also to make the research process more
be implemented in a way that new or changed requirements            efficient, transparent, and reproducible.
can be integrated into the running infrastructure.                     The project GeRDI aims to provide an infrastructure which
   In the following, we will introduce an approach to meet          fosters FAIR data practices and also supports researchers in
these challenges. This approach is validated in a reference         their data-driven workflow [3]. This is realized by integrating
implementation for the project Generic Research Data In-            different domain services seamlessly into one single infras-
frastructure (GeRDI). By abstracting and extrapolating the          tructure. The continuous involvement of nine research groups
requirements from a limited set of stakeholders, we are able to     of different research domains into the development process
extract different domains regarding the feature set of GeRDI        allows us to determine specific workflows and their involved
whereby in turn each major feature is implemented as a              services to optimize the infrastructure for real-life use cases.
distinct self-contained system (SCS)1 . This allows us to be        As a reference, selected research cases of these research
adaptable regarding the requirements set and also to benefit        groups will be reimplemented and extended using the GeRDI
from different properties of both concepts, domain-driven           infrastructure.
                                                                       One research case is provided by the Environmental, Re-
  1 http://scs-architecture.org/                                    source and Ecological Economics Group (EREE) of Kiel




CSE 2018: 3rd Workshop on Continuous Software Engineering @ SE18, Ulm, Germany                                                    85
    External
                                                                                  Frontend Integration
    Frontend


    Archive          Harvest            Search            Bookmark             Store            Preprocess          Analyze      Publish


   Archive UI                          Search UI        Bookmark UI           Store UI         Preprocess UI       Analyze UI   Publish UI



                                      Query/ Index                                                                  Analyze
   Archive API     Harvest API                          Bookmark API         Store API        Preprocess API                    Publish API
                                          API                                                                         API



                                                                                                Preprocess          Analyze      Publish
   Archive DB       Harvest DB         Search DB        Bookmark DB           Storage
                                                                                                 Storage            Storage      Storage



    External
                                                                        Backend Integration
    Backend


                            Fig. 1. The GeRDI Self-Contained Systems Architecture (a.k.a. Microservices Architecture)



University. In a report [4], published by the WWF, different              selection of relevant metadata for sharing search results or
economic and fishery management scenarios were evaluated                  for further processing is then performed in Bookmark. After
in order to derive future changes in fishery catch rates. This            that, data is downloaded either to a local machine or a remote
research case illustrates a possible scientific workflow which            storage system (Store). The processing of data is divided into
GeRDI aims to support. Data is collected from multiple data               two stages. The first step is to normalize or pre-filter data
repositories, aggregated and filtered in a preprocess step, and           in Preprocess. In Analyze, actual analysis on the preprocessed
then passed to a computation model for scientific analysis and            data is performed to gain new scientific insights. The new data
prediction of fishery catch rates. The other research groups              can then be uploaded to a research data repository (Publish).
contribute different research cases, to cover different research          This closes a cycle (not included in Figure 1) as the uploaded
domains (such as digital humanities, hydrology, or socioeco-              data is again available in the research data repository.
nomics) including different workflows and used services.                     The required functionality can be provided as a specific
                                                                          implementation within each service domain. Additionally, the
                     III. A RCHITECTURE                                   implementations of all domains are able to communicate with
   Our goal is to design an architecture which allows us to react         each other through remote interfaces. As a result, we are able
to changing requirements without major restructuring activities           to not only reimplement our individual services, but to also
and which has a level of complexity that is as low as possible.           implement and integrate required functionality in the future,
To achieve this goal we rely on the strategy of domain-                   by implementing them with compliance to the given interfaces.
driven design (DDD) for the concept of our architecture. In                  As mentioned in Section I, for the implementation of such
DDD, complex systems are divided into bounded contexts in                 an architecture, we make use of SCS as an architectural
order to contain different domains within distinct components             style. In our architecture concept, we vertically decompose
[5]. In our case, we derive different domains by clustering               the system along the domains and are therefore able to use
the required functionality of our research cases into sets of             methods of DDD not only as a design concept but also for the
generic services. This is done by analyzing the workflows of              implementation of our architecture. A self-contained system
all research cases and cutting them into different domains. As            depicts a certain functionality and implements it as a full
a result, we obtain a set of domains as shown in Figure 1.                stack with an user interface layer, business logic layer and
Each colored box depicts a service domain, enabling actions               persistence layer, which can be implemented as microservices
throughout the research data’s life span. The service domains             [6]. Microservice architectures facilitate scalability [7], as
are all implemented as self-contained systems:                            well as agility and reliability [8]. Communication between
   Archive depicts the data source which in most research cases           SCS should be reduced to a minimum level. In cases where
is a research data repository for long-term data archival. An             communication is inevitable, this should occur through well-
interface between such a research data repository and our                 defined REST-interfaces. Cross-cutting concerns regarding the
infrastructure is realized through Harvest which collects the             implementation, such as an authentication and authorization
metadata, enriches it, and forwards it to a search index. With            infrastructure or system monitoring, are deployed within a
Search, a researcher can find relevant research data among                backend integration layer.
all harvested, multidisciplinary research data repositories. A               A further layer for the frontend integration is also required,




CSE 2018: 3rd Workshop on Continuous Software Engineering @ SE18, Ulm, Germany                                                                86
    Archive            Harvest           Search          Bookmark             Store                 Preprocess        Analyze         Publish
•Sea Around Us     •Adapter to the   •LMEs            •By Price and     •Download onto        •Union GIS          •Model         •Back to a
•FAO Stat           Repositories     •Catch Data       Trade Data for    Laptop               •LME Catches         Combination    Repository
•FAO FishStatJ                       •Price            Fish                                                       •Prediction     e.g. Pangaea
                                                       Commodeties
•SSP                                 •Trade Data
•GIS Data                            •GIS LME &
                                      Countries


                                            Fig. 2. Research Case Mapping to the GeRDI Architecture



as each SCS implements its own user interface. The SCS                   attributes. Storage is provided by local computers in this use
approach is scalable regarding different aspects [7]. The archi-         case. Therefore, the Store domain must support interfaces to
tecture scales well regarding its functionality, as new functions        download data to a local machine and to use the saved data
can be continuously implemented and integrated as a SCS.                 for further steps. Preprocess and Analyze modify the original
This is enabled by the inherent nature of loose coupling of              data. In a first step, data is aggregated by using geographic
SCS which makes them interchangeable. It scales well with                information system (GIS) data or catch rates of large marine
the amount of developers, as each SCS can and should be                  ecosystem (LME). Then, by feeding the preprocessed data
developed by an individual team [8]. This allows to enable a             into a model, predictions for future scenarios regarding the
community-driven infrastructure, as external developer teams             catch rates can be made. Thus, both service domains rely on
may contribute their functions continuously to the GeRDI                 computation tasks. As a last step, the predicted data is again
infrastructure. Due to the option of instantiating multiple in-          published to a research data repository, Pangaea2 in this case.
stances of one SCS and a further load balancing in the frontend             As already mentioned, we will reuse existing software
integration layer, this approach also introduces potential to            wherever possible in our reference implementation. The im-
scale well with regard to performance.                                   plementations for Harvest are newly developed. The Search
   Figure 1 also illustrates how we implement the reference              makes use of Elasticsearch3 as a search platform. For storage
architecture of GeRDI. Each colored box depicts an SCS                   capabilities, network file sharing systems, such as Samba4 ,
and therefore a domain of the complete system. White boxes               can be used for this use case. By implementing a facade,
within each SCS show the different layers of user interface              this can be made GeRDI compliant. The computation steps
(UI), business logic (API) and persistence (DB or Storage)               require resources which can be provided by a cloud provider
within the SCS. Grey boxes at the top and bottom of the                  or a compute center. To enable the computation of the Matlab
figure show both integration layers. As we do not re-implement           model, we reuse Jupyter5 and integrate it in combination with
research data repositories, their frontend and backend layers            its Matlab kernel into the infrastructure. For the publication
are not integrated into the GeRDI frontend. Additionally,                of the newly generated data, we use Pubflow6 which provides
implementations of Harvest do not require a UI, which leads              functionality to upload research data to repositories.
to the lack of the corresponding microservice.                              Different research cases may use different implementations.
                                                                         However, to benefit from a broader set of research data reposi-
                 IV. U SE C ASE I MPLEMENTATION                          tories, we encourage to make use of the same implementations
   For the (re)implementation of research cases, this archi-             for the service domains Harvest, Search, and Bookmark. The
tecture allows to make use of existing middleware software               reference implementations for these domains will be open and
wherever possible. Well-established software can be integrated           accessible through a GeRDI portal.
into the infrastructure if it can be mapped to one domain
and if it complies with its interfaces. Therefore, to implement                         V. I NTEGRATION AND D EPLOYMENT
complete research workflows, all required services need to                  In this section we will briefly describe the requirements
be mapped first to the generic services model. Afterwards,               of an operational setup to run software as exemplified in
each service is implemented as a SCS and integrated into the             Section IV.
infrastructure.                                                             The microservice architecture described in Section III ne-
   Figure 2 shows a mapping of required services and/or                  cessitates a container-ready system to mirror the encapsula-
functions for the EREE research case mentioned in Section II.            tion. A registry is needed to disseminate the built images
We see in Archive the different research data repositories               to the deployment contexts. Tagging the images is another
which deliver the relevant data. These repositories already              requirement, since it facilitates the selection of compatible
exist and will be integrated into GeRDI by connecting them               versions of the different SCS and enables the description of a
with different implementations of Harvesters for each repos-
                                                                           2 https://www.pangaea.de/
itory. Both Search and Bookmark depict certain attributes                  3 https://www.elastic.co/products/elasticsearch
for which data can be searched and bookmarked for. In our                  4 https://www.samba.org/
reference implementation for a search platform, we need to                 5 https://jupyter.org/

make sure we support the search and bookmarking for these                  6 https://www.pubflow.uni-kiel.de/en




CSE 2018: 3rd Workshop on Continuous Software Engineering @ SE18, Ulm, Germany                                                                   87
release manifest (i.e. a list of images/version pairs including                      VI. C ONCLUSION & O UTLOOK
the external dependencies).                                            We introduced an approach to handle an incomplete set of
   After passing the code reviews and all tests within the          requirements through an appropriate architecture design. Our
continuous integration (CI) process the CI system builds the        approach combines domain-driven design and self-contained
container image. This way the developers’ assumptions with          systems to provide an infrastructure which can be used for
regard to the deployment context (e.g. available libraries) is      the implementation of different and also unknown function re-
encapsulated, thoroughly tested and ready to ship. The CI           quirements. With continuous software engineering we are able
system needs to support this workflow.                              to continuously implement, deploy, and integrate functionality
   We identified three deployment contexts: testing, staging        changes into a running system. The result is used for a generic
and production. The testing context needs full automation,          research data infrastructure and allows to (re)implement exist-
i.e. continuous deployment, and is used by developers to test       ing and yet unknown use cases. As an example, we depicted
and discuss features. Staging and production contexts are less      one use case and introduced its implementation with this
automated since the robustness requirements are higher. They        architecture. Challenges regarding the operation of such a
can therefore be classified as continuous delivery systems (i.e.    system were also discussed and an appropriate setup was
manual work is necessary to deploy). Staging is not only used       presented.
to prepare a release to the production context, but also as a          The development of GeRDI is in a early stage and therefore
preview for the stakeholders to facilitate an agile development     prototypical. Evaluations are required to show the benefits of
approach. Several computing centers should be able to provide       this architecture in real-world usage. This includes the im-
the computational resources to run all or parts of the three        plementation of different use cases of other research domains
deployment contexts. At the same time some operational              which will show if the stated claims, regarding its adaptability,
aspects need centralized services, such as monitoring and           will hold.
logging facilities. Some parts of the infrastructure (such as          Yet to be validated are other topics such as monitoring
the search index) might profit from running on the same site        which is required for a useful system scaling and performance
to reduce performance penalties through network traffic. As a       evaluation. The deployment and operation of an authentication
result, the deployment infrastructure needs to support fully-       and authorization infrastructure for such a system is an addi-
automated and semi-automated deployment workflows and               tional challenge of greater importance, due to a broader set of
allow for transparent integration of compute nodes, without         possible service providers.
losing the possibility to pin containers to specific nodes
                                                                                          ACKNOWLEDGEMENTS
if necessary. In addition to that, scalability and availability
requirements necessitate container orchestration abilities such       This work was supported by the DFG (German Research
as on-the-fly scaling, node draining, and rolling updates.          Foundation) with the GeRDI project (Grants No. BO818/16-1
   Since the deployment infrastructure is also developing over      and HA2038/6-1).
time, its setup needs to be documented and automated by                                          R EFERENCES
a provisioning and configuration management system. The             [1] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, G. Appleton, M. Ax-
scripts and configuration for such a system are also part of the        ton, A. Baak, N. Blomberg, J.-W. Boiten, L. B. da Silva Santos,
release process. Releases therefore consist of the source code,         P. E. Bourne et al., “The FAIR Guiding Principles for scientific data
                                                                        management and stewardship,” Scientific data, vol. 3, 2016.
the container images, the setup scripts for the infrastructure      [2] C. H. L. E. G. on the European Open Science Cloud, “Realising the
and the release manifest. All release assets need to be available       European Open Science Cloud,” European Commision, Tech. Rep., 2013.
for the public (open source licenses).                              [3] R. Grunzke, T. Adolph, C. Biardzki, A. Bode, T. Borst, H.-J. Bun-
                                                                        gartz, A. Busch, A. Frank, C. Grimm, W. Hasselbring, A. Kazakova,
   The following setup meets the above described requirements           A. Latif, F. Limani, M. Neumann, N. T. de Sousa, J. Tendel, I. Thom-
and will be used for GeRDI:                                             sen, K. Tochtermann, R. Müller-Pfefferkorn, and W. E. Nagel, “Chal-
                                                                        lenges in Creating a Sustainable Generic Research Data Infrastructure,”
  • Containerization: Docker7                                           Softwaretechnik-Trends, vol. 37, no. 2, 2017.
                                   8
  • Continuous Integration: Bamboo                                  [4] M. Quaas, J. Hoffmann, K. Kamin, L. Kleemann, and K. Schacht,
                                                                        “Fishing for Proteins,” WWF, 2016.
  • Continuous Deployment/Delivery and container orches-
                                                                    [5] E. Evans, Domain-Driven Design: Tackling Complexity in the Heart of
    tration: Kubernetes9                                                Software. Addison-Wesley Professional, 2004.
                                                   10
  • Provision and configuration management: Ansible                 [6] J. Lewis and M. Fowler, “Microservices,” 2014, http://martinfowler.com/
                                                                        articles/microservices.html.
   In a recent literature review only 6 out 69 case studies were    [7] W. Hasselbring, “Microservices for Scalability: Keynote Talk Abstract,”
identified to discuss continuous practices in academic setups           in Proceedings of the 7th ACM/SPEC on International Conference on
                                                                        Performance Engineering (ICPE 2016). New York, NY, USA: ACM,
(cf. [9]). None of these describe the same requirements as              2016, pp. 133–134.
pointed out in this section.                                        [8] W. Hasselbring and G. Steinacker, “Microservice Architectures for
                                                                        Scalability, Agility and Reliability in E-Commerce,” in 2017 IEEE
                                                                        International Conference on Software Architecture Workshops (ICSAW).
  7 https://www.docker.com                                              Gothenburg, Sweden: IEEE, Apr. 2017, pp. 243–246.
  8 https://de.atlassian.com/software/bamboo                        [9] M. Shahin, M. A. Babar, and L. Zhu, “Continuous Integration, Delivery
  9 https://kubernetes.io                                               and Deployment: A Systematic Review on Approaches, Tools, Challenges
  10 https://www.ansible.com                                            and Practices,” IEEE Access, vol. 5, pp. 3909–3943, 2017.




CSE 2018: 3rd Workshop on Continuous Software Engineering @ SE18, Ulm, Germany                                                              88