Towards Evolution Capabilities in Data Pipelines
                                Kevin Kramer1
                                1
                                    University of Hagen, Universitätsstr. 1, 58097 Hagen, Germany


                                                  Abstract
                                                  Evolutionary change over time in the context of data pipelines is certain, especially with regard to the structure and semantics
                                                  of data as well as to the pipeline operators. Dealing with these changes, i.e. providing long-term maintenance, is costly. The
                                                  present work explores the need for evolution capabilities within pipeline frameworks. In this context dealing with evolution
                                                  is defined as a two-step process consisting of self-awareness and self-adaption. Furthermore, a conceptual requirements
                                                  model is provided, which encompasses criteria for self-awareness and self-adaption as well as covering the dimensions data,
                                                  operator, pipeline and environment. A lack of said capabilities in existing frameworks exposes a major gap. Filling this gap
                                                  will be a significant contribution for practitioners and scientists alike. The present work envisions and lays the foundation for
                                                  a framework which can handle evolutionary change.

                                                  Keywords
                                                  data pipeline, data evolution, operator evolution, data pipeline framework


                                1. Introduction                                                                        capabilities is envisioned and discussed. A conceptual
                                                                                                                       requirements model, which focuses on these evolution
                                The last decade was characterized by ever increasing                                   capabilities, is presented in Section 4. Finally, the last sec-
                                amounts of data. This also led to new technical demands                                tion concludes the paper and outlines a roadmap for the
                                in the context of data storage, transfer and analysis. In                              community towards a pipeline framework with evolution
                                order to cope with these demands complex new systems                                   capabilities.
                                emerged, which in turn require maintenance. Providing
                                this maintenance is costly and even though the systems
                                themselves might run as expected, changes over time,                                   2. Evolution in Data Pipelines
                                e.g. to the structure and semantics of data, inevitably
                                induce a need to adjust the systems configuration to re-                               This section provides the basis for the current work by
                                store functionality. One estimate suggests that 50-70%                                 defining important concepts as well as presenting related
                                of the total cost of a long running software system can                                work. Firstly, data pipelines and their components are
                                be attributed to maintenance [1]. Data pipelines are an                                introduced. Secondly, data pipeline frameworks includ-
                                intuitive way to structure end-to-end data processing.                                 ing their benefits are showcased. Finally, evolution in
                                The corresponding tools and frameworks are used in a                                   the context of data pipelines is defined.
                                wide field of domains and for an extensive amount of
                                diverse applications. Still, they also need costly mainte-                             2.1. Data Pipelines
                                nance whenever change, i.e. evolution happens. Adding
                                evolution capabilities to data pipelines and thereby re-                                                         Data pipelines are used for a plethora of applications and
                                ducing maintenance cost and human involvement could                                                              domains such as bioinformatics [2, 3], manufacturing [4]
                                be a big contribution for scientists and practitioners alike.                                                    and cybersecurity [5]. Broadly speaking, a data pipeline
                                The current work takes the first step in this direction by                                                       consists of three components: data source(s), operator(s)
                                collecting requirements needed for such a system and                                                             and data sink(s). Figure 1 (a) shows such a basic pipeline.
                                by envisioning a data pipeline framework which fulfills                                                             Biswas et al. empirically studied the components and
                                these requirements.                                                                                              stages of 71 data science (DS) pipelines [6]. Their find-
                                   The following sections are structured as follows. Sec-                                                        ings suggest that DS pipelines consist of a pre-processing
                                tion 2 describes the general concepts and challenges of                                                          phase, a model building phase and a post-processing phase.
                                evolution in data pipelines. Important terminology is                                                            They further extracted tasks and sub-tasks associated
                                defined and related work is shown in this section as                                                             with these phases. Subtasks are atomic operators in the
                                well. In Section 3 a pipeline framework with evolution                                                           context of a pipeline. The pre-processing phase consists
                                                                                                                                                 of the tasks data acquisition, data preparation and storage
                                           th
                                34 GI-Workshop on Foundations of Databases (Grundlagen von Daten-                                                which   represent the typical components of data engi-
                                banken), June 7-9, 2023, Hirsau, Germany                                                                         neering and also includes the data source(s). The model
                                Envelope-Open kevin.kramer@fernuni-hagen.de (K. Kramer)                                                          building phase is comprised of the tasks feature engineer-
                                GLOBE https://www.fernuni-hagen.de/dbis/team/kevin.kramer.shtml                                                  ing, modeling, training, evaluation as well as prediction.
                                (K. Kramer)
                                                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License These tasks correspond to basic machine learning (ML)
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: (a) A basic data processing pipeline consisting of a data source, operators and a data sink. (b) Self-awareness: the
system perceives a disruption at the data source level. This could be the structural or semantic change of incoming data. (c)
Self-adaption: the system automatically adapts to the perceived disruption by swapping the first operator for a different one.


and data mining (DM) functions. The tasks included              2.2. Pipeline Frameworks
in the post-processing layer are interpretation, commu-
                                                                The number of existing pipeline frameworks is over-
nication and deployment as well as all data sinks. The
                                                                whelming. A popular collection of pipeline tools at
empirical results show that the pre-processing and the
                                                                GitHub3 includes 122 pipeline frameworks. At the same
model building phases appeared in 96% of examined DS
                                                                time there is almost no scientific attention on the ab-
pipelines, the post-processing phase only appeared in
                                                                stract concepts of these systems. Some conceptual work
52% of pipelines.
                                                                was made by Maymounkov [9]. The author proposes
   Pipelines can be linear, i.e. one data source, a chain of
                                                                an important distinction in order to categorize pipeline
operators and finally one data sink. Psallidas et al. em-
                                                                frameworks. He divides frameworks into task-driven
pirically studied 8M Jupyter notebooks1 from GitHub2
                                                                and data-driven. Task-driven frameworks are agnostic
[7]. Their results which were produced by mining and
                                                                about actual data and operations that occur during a
analyzing the abstract syntax trees of all notebooks sug-
                                                                pipeline run. Their focus lies on managing inter- and
gest that 80% of the pipelines are linear. The structure of
                                                                intra-pipeline dependencies and scheduling large num-
pipelines can be interpreted as a directed acyclic graph
                                                                bers of pipelines in parallel. Popular proponents of this
(DAG), allowing for pipelines, which can include several
                                                                category are Luigi4 and Apache Airflow5 . Data-driven
data sources and sinks as well as branching operators, i.e.
                                                                pipelines are – to a varying degree – aware of the data
operators which have more than one input or output. A
                                                                they process and the included operations. These frame-
widespread example of such non-linear data processing
                                                                works put a focus on data (and operator) lineage also
are extract transform load (ETL). They are used to extract
                                                                called provenance, i.e. they allow the user to retrace the
data from multiple heterogeneous sources, transform
                                                                history of a data artifact by saving and curating metadata
them to use a common schema and then load them into a
                                                                on all steps of the artifact producing pipeline. A popu-
data sink such as a data warehouse (which may become a
                                                                lar data-driven framework which logs various metadata
data source itself in the following steps) [8]. Even though
                                                                during pipeline runs is Dagster6 . Some frameworks in
pipelines can be created using only functions and mod-
                                                                this category enable data provenance by using a version
ules by chaining their inputs and outputs together [7],
                                                                control system similar to Git7 . A prominent example of
pipeline frameworks allow users to generate, maintain
and administrate complex pipelines.
                                                                3
                                                                  https://www.github.com/pditommaso/awesome-pipeline
                                                                4
                                                                  https://www.github.com/spotify/luigi
                                                                5
                                                                  https://www.airflow.apache.org/
1                                                               6
    https://www.jupyter.org/                                      https://www.dagster.io/
2                                                               7
    https://www.github.com/                                       https://www.git-scm.com/
this is Pachyderm8 .                                             components and their interactions with each other. The
   Comparing pipeline frameworks is made difficult by a          changes triggered by disruptors are diverse, but can be
number of factors: the sheer amount of different frame-          broadly categorized into data, operator and environment
works, the lack of a theoretical basis for analysis, the over-   disruptors.
lapping functionality and the differing ways to achieve             The structure and semantics of data might change, af-
the same goal within two frameworks. A thorough search           fecting data sources and sinks as well as data artifacts
of related work and literature focusing on such compar-          created within the pipeline, e.g. interim results. Struc-
ison, only revealed one paper [10]. Even though the              tural changes in data might occur over time due to altered
analysis was geared towards a specific system and its            data producers or operators. Semantic changes in data
requirements, the general results and especially the com-        can emerge from technical, legislative but also societal
parison criteria are a helpful first step towards distin-        reasons.
guishing pipeline frameworks. Some of these criteria                Operator functionality might also experience evolu-
and their possible values include:                               tion, e.g after a software update, resulting in different
       • Type: business, science, big data                       APIs or a changed set of available (hyper)parameters.
       • Model: script-based, event-based, adaptive,             Another form of change in this context is choosing a
          declarative and procedural                             different operator for a specific task which accepts the
       • Separation of concerns: asks whether or not             same input as the old one but produces a different output,
          high-level pipeline definitions can be separated       e.g. a different data structure. This leads to the need to
          from low-level data and operator implementa-           adapt the pipeline to fit this new operator.
          tions                                                     Also, the environment in which the pipeline is run
       • Language: general purpose language (GPL), do-           can change over time. For example, the hardware could
          main specific language (DSL)                           change resulting in more processing power or more clus-
       • Pipeline programming: text-based, graphical,            ter nodes becoming available. Adapting to such change
          visual                                                 by increasing the number of pipelines running in par-
       • Reusability: asks whether or not a framework            allel or utilizing bigger batch sizes in order to increase
          provides tools for reusing existing pipeline def-      efficiency could be possible examples.
          initions as well as individual components of a
          previously defined pipeline                        3. Pipeline Framework with
       • Containerization: asks if pipeline components,
          whole pipelines and the pipeline framework itself        Evolution Capabilities
          can be deployed in a container
       • Monitoring: asks whether or not the framework In this section a pipeline framework with evolution ca-
          allows for runtime observation of the system or pabilities is envisioned and discussed. Figure 2, based
          if it is granting logging capabilities             on a figure from [12], shows a graphical representation
                                                             of the proposed framework. The outside of the figure is
Some of these results are referenced in Section 3. In Sec- made up of the environment frame including goals and
tion 4 these basic criteria are extended with a special contracts as well as metadata and statistics. These ele-
focus on evolution capabilities. The particularities result- ments represent the available resources, user objectives
ing from evolution will be presented in more detail in the and metadata, which the system gathered, stored and
next subsection.                                             aggregated throughout its lifecycle. Within this frame
                                                             there are essentially five columns. They represent (from
2.3. Pipeline Evolution                                      left to right) data sources, operators and data sinks. The
                                                             arrows connect the individual components and show two
Evolution means change over time. In the realm of com-
                                                             pipelines, each consisting of a data source, three opera-
puter science change can mean a lot of different things.
                                                             tors and a data sink. Evolutionary change can happen
The emergence and widespread adoption of a new data
                          9                                  at several points during a pipeline’s lifecycle. In Fig-
format (such as JSON ) or programming model (such as
                                                             ure 2 these disruption points are shown as red flashes.
MapReduce [11]) are examples of this. This type of evo-
                                                             Structure and semantics of data might change at the data
lution is often gradual and influenced by many different
                                                             sources as well as within the pipeline. Evolution can also
factors. In the context of data pipelines and correspond-
                                                             affect the operators and the environment in which the
ing frameworks evolution can happen over different time
                                                             pipelines are run. In any case, an ideal pipeline frame-
frames, ranging from gradual to sudden. The main evolu-
                                                             work could automatically adapt to these changes.
tion factors are so-called disruptors, which can affect all
                                                                Concerning adaptability, an important distinction
8
  https://www.pachyderm.com/
                                                             needs   to be made. Generally speaking, it is possible
9
  https://www.json.org                                       to build pipelines in existing frameworks, that are very
Figure 2: Pipeline framework and its components. Evolution can happen in the form of structural and semantic changes to
the data during loading (1) and through operator processing (2) as well as to operators (3), e.g. after a software update. The
environment, i.e. hardware, scaling, etc., might also change over time (4). Based on a figure from [12].


flexible. One class of systems, which are very flexible    monitoring capabilities and allow for concepts such as
are adaptive workflows, first presented by van der Aalst   reproducibility and provenance which are closely related
et al. [13]. Besides being mainly task-driven, these sys-  to evolution. A tool for inspecting pipelines which runs
tems adapt themselves based on strict, predefined rules.   on existing Python code is mlinspect [15, 16]. It extracts
An example of such a system is AdaptFlow presented         the DAG structure of a pipeline and helps the user to
in [14]. Given a treatment plan in the medical context,    identify problems and bugs. For example it can help to
AdaptFlow can notice logical errors and choose a differ-   identify a skewed data distribution which would lead to
ent path in the predefined workflow. This flexibility is   unfair [17] results. ArgusEyes [18] is a tool for inspect-
completely dependent on and bounded by the treatment       ing classification pipelines which builds upon mlinspect.
workflow. Generally speaking, the space of possible al-    It enables the user to check whether best practices are
terations, given such a flexible system, is significantly  applied while also providing various metadata to analyze
smaller than the space envisioned in the present work.     pipelines. Even though these tools are not intended to
This stems from the fact that a pipeline framework with    track the evolution of pipelines and their components,
evolution capabilities dynamically creates and alters this but rather focus on helping practitioners with a specific
search space, in order to find an optimal solution, at dif-issue, the underlying architecture can serve as useful
ferent times during the system’s lifecycle. This demon-    guidance for the development of a pipeline framework
strates that flexibility is not the same as adaptability. It is
                                                           with evolution capabilities. Another important aspect
also possible to build meta pipelines especially for moni- is to track data changes across pipeline steps. The au-
toring changes as well as adapting to these changes. Even  thors of [19] present three measuring approaches that
though this is currently the most practical solution for   are utilized in order to deal with bias.
achieving evolution capabilities in existing frameworks,      Monitoring capabilities, gathering and storing meta-
this approach does not represent real evolution capabil-   data as well as calculating and providing statistics on
ities as they were defined in the previous sections. In    these findings are critical functionalities towards evolu-
any case, before adapting to evolution, the underlying     tion capabilities in pipeline frameworks. They are neces-
changes need to be noticed and recognized.                 sary in all dimensions and are the basis for self-awareness.
                                                           Tools like mlinspect and ArgusEyes, but also existing
3.1. Self-awareness                                        data-driven frameworks like Dagster can be a starting
                                                           point towards achieving such functionality. Perceiving
The first step in dealing with evolution is to be aware of change in operator results or contracts leading to the
change. Figure 1 (b) shows this step in dealing with evo- automatic swapping or parameter change is also funda-
lution. Data-driven frameworks are usually more aware mentally important. One project that can be of help in
of change than task-driven ones since they provide more this regard is IBM Lale [20] which automatically creates
optimal pipelines based on scikit-learn10 functions. Once        4. Conceptual Requirements
the system is aware of change, it needs to adapt to the
new circumstances.
                                                                    Model
                                                                 As described in the previous sections, there is no frame-
3.2. Self-adaption                                               work with comprehensive evolution capabilities yet. This
                                                                 emphasizes the need for a requirements model, encom-
Automatic acting upon change can only be done with               passing important components and their interplay as
respect to a goal. This goal could be as simple as ensuring      well as system functionalities. The model presented in
functionality and as complex as automatically optimiz-           this section is conceptual, i.e. it was not derived through
ing the performance and accuracy of several big data             a structured method from the field of requirements en-
pipelines running in parallel given certain hardware. Fig-       gineering [24]. It rather evolved from technical talks
ure 1 (c) shows the adaption step, after a disruption has        with experienced colleagues and a rough analysis and
been perceived by the self-awareness capabilities. In this       comparison of existing pipeline frameworks. It can serve
context it is decisive to formulate a goal including a fitting   as the inception step for a structured requirements gath-
representation, which the pipeline framework can use             ering process and furthermore helps with the testing of
to evaluate decisions. The dimensions for pipeline and           existing frameworks for their evolution capabilities.
environment shown in the last section both contain the              The requirements are structured into two categories,
evolution requirement to provide an interface for goals.         self-awareness and self-adaption as well as four dimen-
This reveals a potential conflict: A pipeline with the           sions.
goal to achieve the best possible accuracy for a ML task               • Data: Data sources and sinks, structure and se-
might want to simulate a lot of different pipelines to find              mantics of data
the best one and to achieve this goal. At the same time                • Operator: Modules and functions and their re-
simulations and tests might cost a lot of computational                  spective inputs and outputs
resources, which could stand in contrast to the environ-               • Pipeline: Creation and administration of
ment dimension’s goal to provide a certain performance                   pipelines
to all pipelines. A pipeline framework with evolution                  • Environment: Available hardware and schedul-
capabilities needs to have dynamic functionality to deal                 ing, scaling and orchestration of pipelines
with these kinds of conflicts.                                      Table 1 presents an overview of the requirements. The
   The vision of self-adapting systems is not unique to          following sections describe the requirements listed in
the present work. The authors of [12] present four gener-        Table 1 in detail.
ations in data engineering for data science ranging from
simple data pre-processing to fully automated data cura-
tion. In [21] the authors envision a framework for multi-        4.1. Self-awareness Requirements
model databases, which is self-adapting with regard to de-       Self-awareness means being aware of change. This
sign and maintenance. Similar to the insight gained from         change is always relative with respect to some previ-
tools like mlinspect and ArgusEyes in the context of evo-        ous state, i.e. in order to be self-aware, a system needs to
lution awareness, other self-adaptive systems can help           store at least one previous state for comparison with the
to understand the underlying components and their in-            current state. Therefore, collecting and storing metadata
terplay. For example Hillenbrand et al. propose a system         over all dimensions is an integral requirement for a self-
which automatically chooses an optimal data migration            aware pipeline framework. Even though comparing two
strategy given some constraints like service-level agree-        system states is sufficient to notice change, in many cases
ments [22]. Pachyderm which runs natively in Kuber-              it would be beneficial to have a history of system states.
netes11 has a built-in system for distributed computing /        Creating a versioned history of metadata allows for more
scaling, which is very simple and should be considered in        complex concepts and techniques to be applied, e.g. ex-
the context of the environment dimension. The empirical          tracting (meta)data distributions or using window-based
results of [10] showed a complete lack of a simulation           anomaly detection to notice change. Versioning of meta-
environment in all studied frameworks. Simulation and            data, component artifacts and configuration files would
the use of synthetic data [23] are important components,         enable the self-aware system to notice different forms of
which need to be incorporated especially for the pipeline        change and distinguish them. For example it could differ-
and environment dimensions since their self-adaption             entiate between an abrupt change to the interface of an
strategies need a search space to optimize towards a goal.       operator after a software update and the gradual decrease
                                                                 of data quality, based on the wrong composition of pre-
                                                                 processing operators. Collecting and storing such data
10
     https://www.scikit-learn.org/stable/                        is important, but so is managing and curating it, which
11
     https://kubernetes.io/
Table 1
Conceptual requirements and their corresponding dimensions, categorized into self-awareness and self-adaption

  Category             Requirement                                                      Dimension
                       Collecting and storing metadata                                  all
                       Versioning of metadata                                           all
                       Versioning of component artifacts                                all
                       Versioning of configuration files                                all
                       Providing provenance capabilities                                all
                       Analyzing metadata and creating statistics                       all
  Self-awareness
                       Noticing structural changes                                      data
                       Noticing semantic changes                                        data
                       Noticing changes to contracts,APIs and interfaces                operator
                       Noticing changes to available computing resources                environment
                       Monitoring processing results and performance                    operator, pipeline
                       Providing an interface for goal definition                       operator, pipeline, environment
                       Initiating an adaption, based on the violation of a goal         operator, pipeline, environment
                       Automatically swapping operators                                 operator, pipeline
  Self-adaption        Automatically changing pipeline structure and components         pipeline
                       Automatically optimizing resource distribution and scheduling    environment
                       Providing a simulation space to test potential alteration        pipeline, environment


leads to the need for provenance capabilities over all di-     4.2. Self-adaption Requirements
mensions. Also providing tools to analyze metadata, for
                                                               Once the system is aware of a significant change, it trig-
example to aggregate historic data into statistical values,
                                                               gers an adaption. Based on the dimension in which the
is an important requirement. Aggregated data enables a
                                                               adaption should occur, i.e. operator, pipeline or environ-
different perspective of change.
                                                               ment, the prerequisites for all possible adaption operation
   When looking at the data dimension, the two funda-
                                                               are checked. This first step towards an adaption is an
mental requirements a pipeline framework with evolu-
                                                               important requirement for a pipeline framework with
tion capabilities has to fulfill are noticing changes to the
                                                               evolution capabilities, since it creates a search space for
structure of data and noticing changes to the semantics of
                                                               possible adjustments. The operations, which make up
data. These disruptors almost always trigger an adaption
                                                               these adjustments, represent crucial requirements as well.
and therefore, being aware and dealing with them, is of
                                                               They include the automatic swapping of an operator, the
utmost importance. The same can be said about the oper-
                                                               automatic change of pipeline structure and/or compo-
ator dimension. A changing operator interface will most
                                                               nents, as well as the automatic optimization of resource
certainly result in an erroneous pipeline. Hence, noticing
                                                               distribution and pipeline scheduling. The search space
such change is a critical requirement. Changes to the en-
                                                               of all possible operations is transformed into a simula-
vironment do not necessarily result in non-functioning
                                                               tion space, in which possible alterations are tested. This
pipelines, but rather influence the performance. Still,
                                                               space connects the user’s goal definitions with the self-
noticing changes to the environment, e.g. available hard-
                                                               awareness metadata, while at the same time providing
ware, is important to achieve framework performance
                                                               simulation and optimization capabilities, in order to find
goals, such as optimal utilization of available resources.
                                                               an optimal adaption.
A similar approach needs to be taken for operator and
pipeline goals. Processing results and performance of
individual operators as well as pipelines need to be mon-      5. Conclusion and Future Work
itored, in order to compare these results to predefined
goals. Diverse metrics for goal definition can be imag-        The present work defined and showcased data pipelines
ined, ranging from speed and throughput performance to         and their corresponding frameworks. Evolution in the
data quality and model accuracy. This leads to framework       context of these systems was introduced and a conceptual
requiring an interface for goal definition. This interface     requirements model was proposed, comprised of all com-
allows the user to specify objectives with respect to indi-    ponents of such systems, categorized by self-awareness
vidual operators, pipelines and the whole framework. At        and self-adaption and structured into four dimensions.
the same time, this goal definition is used for comparison     By envisioning a system which fulfills these require-
with the current as well as historic states of the system,     ments, a first step was made towards a framework, which
to notice change and possibly initiate an adaption.            would need less maintenance based on its self-awareness
and self-adaption, i.e. evolution capabilities. This type    [4] A. Ismail, H. Truong, W. Kastner, Manufacturing
of framework could be a substantial contribution for sci-        Process Data Analysis Pipelines: A Requirements
entists and practitioners alike.                                 Analysis and survey, J. Big Data 6 (2019).
   The paper is concluded with a set of steps that need      [5] M. M. Koushki, I. Y. Abualhaol, A. D. Raju, Y. Zhou,
to be taken by the community towards achieving evolu-            R. S. Giagone, S. Huang, On Building Machine
tion capabilities in data pipelines. First of all, a proper      Learning Pipelines for Android Malware Detection:
requirements model using concepts and methods of re-             a Procedural Survey of Practices, Challenges and
quirements engineering must be constructed. This must            Opportunities, Cybersecur. 5 (2022).
include a structured requirements gathering process com- [6] S. Biswas, M. Wardat, H. Rajan, The Art and Prac-
prised of talking to stakeholders, who would benefit from        tice of Data Science Pipelines: A Comprehensive
the proposed system, as well as an in-depth analysis             Study of Data Science Pipelines In Theory, In-The-
of existing concepts and techniques with regard to self-         Small, and In-The-Large, in: ICSE, ACM, 2022.
awareness and self-adaption. As a result, this step would    [7] F. Psallidas, Y. Zhu, B. Karlas, J. Henkel, M. Inter-
produce a system specification encompassing require-             landi, S. Krishnan, B. Kroth, K. V. Emani, W. Wu,
ments, including non-functional ones, use-cases and a            C. Zhang, M. Weimer, A. Floratou, C. Curino,
basic software architecture, as well as formal definitions       K. Karanasos, Data Science Through the Looking
of new terms. In the next step, these results need to be         Glass: Analysis of Millions of GitHub Notebooks
compared to existing frameworks and tools, in order to           and ML.NET Pipelines, SIGMOD Rec. 51 (2022).
find working solutions, but also gaps. All dimensions        [8] P. Vassiliadis, A Survey of Extract-Transform-Load
must be thoroughly analyzed and the system specifica-            Technology, in: D. Taniar, L. Chen (Eds.), Inte-
tion must be iteratively adjusted. During this phase soft-       grations of Data Warehousing, Data Mining and
ware engineering and architecture principles, which sup-         Database Technologies - Innovative Approaches,
port evolution capabilities must be derived from existing        Information Science Reference, 2011.
systems and be incorporated into the specification. The      [9] P. Maymounkov, Koji: Automating Pipelines
secondary goal of this step is to either find a framework,       with Mixed-semantics Data Sources,            CoRR
which provides a good basis for evolution capabilities           abs/1901.01908 (2019). arXiv:1901.01908 .
– at least with respect to a certain dimension –, or to [10] M. Matskin, S. Tahmasebi, A. Layegh, A. H. Pay-
discover the need to conceptualize and implement the             berah, A. Thomas, N. Nikolov, D. Roman, A Survey
missing components from scratch. In any case, the next           of Big Data Pipeline Orchestration Tools from the
step would be the creation of a prototype. As a final step,      Perspective of the DataCloud Project, in: DAM-
this prototype must be evaluated and validated, given            DID/RCDL, volume 3036 of CEUR Workshop Pro-
the system specification.                                        ceedings, CEUR-WS.org, 2021.
                                                            [11] J. Dean, S. Ghemawat, MapReduce: Simplified Data
                                                                 Processing on Large Clusters, in: OSDI, USENIX
Acknowledgments                                                  Association, 2004.
                                                            [12] M. Klettke, U. Störl, Four Generations in Data En-
The author wants to thank Meike Klettke, Stefanie
                                                                 gineering for Data Science, Datenbank-Spektrum
Scherzinger, and Uta Störl for many prolific discussions
                                                                 22 (2022).
as well as helpful suggestions, with regard to evolution
                                                            [13] W. M. P. van der Aalst, T. Basten, H. M. W. Ver-
capabilities in data pipelines, without which the present
                                                                 beek, P. A. C. Verkoulen, M. Voorhoeve, Adaptive
work would not have been possible.
                                                                 workflow-on the interplay between flexibility and
                                                                 support, in: Proceedings of the 1st International
References                                                       Conference on Enterprise Information Systems, Se-
                                                                 tubal, Portugal, 27-30 March 1999, ICEIS Secretariat,
  [1] J. Koskinen, H. Lahtonen, T. Tilus, Software Main-         Escola Superior de Tecnologia de Setúbal, Portugal,
      tenance Cost Estimation and Modernization Sup-             1999.
      port, ELTIS-project, Technical Report, University [14] U. Greiner, J. Ramsch, B. Heller, M. Löffler, R. Müller,
      of Jyväskylä, Information Technology Research In-          E. Rahm, Adaptive guideline-based treatment work-
      stitute, 2003.                                             flows with adaptflow, in: Computer-based Support
  [2] B. Fjukstad, L. A. Bongo, A Review of Scalable             for Clinical Guidelines and Protocols - Proceedings
      Bioinformatics Pipelines, Data Sci. Eng. 2 (2017).         of the Symposium on Computerized Guidelines and
  [3] J. A. Novella, P. E. Khoonsari, S. Herman, D. White-       Protocols, CGP 2004, Prague, Czech Republic, 12-14
      nack, M. Capuccini, J. Burman, K. Kultima,                 April, 2004, volume 101 of Studies in Health Tech-
      O. Spjuth, Container-based Bioinformatics with             nology and Informatics, IOS Press, 2004.
      Pachyderm, Bioinform. 35 (2019).                      [15] S. Grafberger, J. Stoyanovich, S. Schelter,
     Lightweight Inspection of Data Preprocessing in
     Native Machine Learning Pipelines, in: CIDR,
     2021.
[16] S. Grafberger, S. Guha, J. Stoyanovich, S. Schelter,
     MLINSPECT: A Data Distribution Debugger for Ma-
     chine Learning Pipelines, in: SIGMOD, ACM, 2021.
[17] J. Stoyanovich, B. Howe, H. V. Jagadish, Responsible
     Data Management, Proc. VLDB Endow. 13 (2020).
[18] S. Schelter, S. Grafberger, S. Guha, O. Sprangers,
     B. Karlas, C. Zhang, Screening Native Machine
     Learning Pipelines with ArgusEyes, in: CIDR, 2022.
[19] M. Klettke, A. Lutsch, U. Störl, Kurz erklärt: Mea-
     suring Data Changes in Data Engineering and their
     Impact on Explainability and Algorithm Fairness,
     Datenbank-Spektrum 21 (2021).
[20] G. Baudart, M. Hirzel, K. Kate, P. Ram, A. Shinnar,
     J. Tsay, Pipeline combinators for gradual AutoML,
     in: NeurIPS, 2021.
[21] I. Holubová, P. Koupil, J. Lu, Self-adapting Design
     and Maintenance of Multi-Model Databases, in:
     B. C. Desai, P. Z. Revesz (Eds.), IDEAS, ACM, 2022.
[22] A. Hillenbrand, U. Störl, S. Nabiyev, M. Klettke, Self-
     adapting Data Migration in the Context of Schema
     Evolution in NoSQL Databases, Distributed Parallel
     Databases 40 (2022).
[23] M. Abufadda, K. Mansour, A Survey of Synthetic
     Data Generation for Machine Learning, in: ACIT,
     IEEE, 2021.
[24] S. Wagner, D. M. Fernández, M. Felderer, A. Vetrò,
     M. Kalinowski, R. J. Wieringa, D. Pfahl, T. Conte,
     M. Christiansson, D. Greer, C. Lassenius, T. Män-
     nistö, M. Nayebi, et al., Status Quo in Requirements
     Engineering: A Theory and a Global Family of Sur-
     veys, in: Software Engineering, volume P-310 of
     LNI, Gesellschaft für Informatik e.V., 2021.