<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Capabilities in Data Pipelines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kevin Kramer</string-name>
          <email>kevin.kramer@fernuni-hagen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>2. Evolution in Data Pipelines</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Hagen</institution>
          ,
          <addr-line>Universitätsstr. 1, 58097 Hagen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Evolutionary change over time in the context of data pipelines is certain, especially with regard to the structure and semantics of data as well as to the pipeline operators. Dealing with these changes, i.e. providing long-term maintenance, is costly. The present work explores the need for evolution capabilities within pipeline frameworks. In this context dealing with evolution is defined as a two-step process consisting of self-awareness and self-adaption. Furthermore, a conceptual requirements model is provided, which encompasses criteria for self-awareness and self-adaption as well as covering the dimensions data, operator, pipeline and environment. A lack of said capabilities in existing frameworks exposes a major gap. Filling this gap will be a significant contribution for practitioners and scientists alike. The present work envisions and lays the foundation for a framework which can handle evolutionary change.</p>
      </abstract>
      <kwd-group>
        <kwd>data pipeline</kwd>
        <kwd>data evolution</kwd>
        <kwd>operator evolution</kwd>
        <kwd>data pipeline framework</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The last decade was characterized by ever increasing
amounts of data. This also led to new technical demands
in the context of data storage, transfer and analysis. In
order to cope with these demands complex new systems
emerged, which in turn require maintenance. Providing
this maintenance is costly and even though the systems
themselves might run as expected, changes over time,
e.g. to the structure and semantics of data, inevitably
induce a need to adjust the systems configuration to
restore functionality. One estimate suggests that 50-70%
of the total cost of a long running software system can
be attributed to maintenance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Data pipelines are an
nance whenever change, i.e. evolution happens. Adding
evolution capabilities to data pipelines and thereby
reducing maintenance cost and human involvement could
be a big contribution for scientists and practitioners alike.
The current work takes the first step in this direction by
collecting requirements needed for such a system and
by envisioning a data pipeline framework which fulfills
these requirements.
      </p>
      <p>
        The following sections are structured as follows.
Section 2 describes the general concepts and challenges of
evolution in data pipelines. Important terminology is
defined and related work is shown in this section as
well. In Section 3 a pipeline framework with evolution
nEvelop-O
LGOBE
(K. Kramer)
domains such as bioinformatics [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], manufacturing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
and cybersecurity [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Broadly speaking, a data pipeline
consists of three components: data source(s), operator(s)
and data sink(s). Figure 1 (a) shows such a basic pipeline.
      </p>
      <sec id="sec-1-1">
        <title>Biswas et al. empirically studied the components and</title>
        <p>
          stages of 71 data science (DS) pipelines [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Their
findings suggest that DS pipelines consist of a pre-processing
phase, a model building phase and a post-processing phase.
        </p>
        <p>They further extracted tasks and sub-tasks associated
with these phases. Subtasks are atomic operators in the
context of a pipeline. The pre-processing phase consists
of the tasks data acquisition, data preparation and storage
which represent the typical components of data
engineering and also includes the data source(s). The model
building phase is comprised of the tasks feature
engineering, modeling, training, evaluation as well as prediction.</p>
        <p>CEUR</p>
        <p>ceur-ws.org
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License These tasks correspond to basic machine learning (ML)
and data mining (DM) functions. The tasks included 2.2. Pipeline Frameworks
in the post-processing layer are interpretation,
communication and deployment as well as all data sinks. The The number of existing pipeline frameworks is
overempirical results show that the pre-processing and the whelming. A popular collection of pipeline tools at
model building phases appeared in 96% of examined DS GitHub3 includes 122 pipeline frameworks. At the same
pipelines, the post-processing phase only appeared in time there is almost no scientific attention on the
ab52% of pipelines. stract concepts of these systems. Some conceptual work</p>
        <p>
          Pipelines can be linear, i.e. one data source, a chain of was made by Maymounkov [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The author proposes
operators and finally one data sink. Psallidas et al. em- an important distinction in order to categorize pipeline
pirically studied 8M Jupyter notebooks1 from GitHub2 frameworks. He divides frameworks into task-driven
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Their results which were produced by mining and and data-driven. Task-driven frameworks are agnostic
analyzing the abstract syntax trees of all notebooks sug- about actual data and operations that occur during a
gest that 80% of the pipelines are linear. The structure of pipeline run. Their focus lies on managing inter- and
pipelines can be interpreted as a directed acyclic graph intra-pipeline dependencies and scheduling large
num(DAG), allowing for pipelines, which can include several bers of pipelines in parallel. Popular proponents of this
data sources and sinks as well as branching operators, i.e. category are Luigi4 and Apache Airflow 5. Data-driven
operators which have more than one input or output. A pipelines are – to a varying degree – aware of the data
widespread example of such non-linear data processing they process and the included operations. These
frameare extract transform load (ETL). They are used to extract works put a focus on data (and operator) lineage also
data from multiple heterogeneous sources, transform called provenance, i.e. they allow the user to retrace the
them to use a common schema and then load them into a history of a data artifact by saving and curating metadata
data sink such as a data warehouse (which may become a on all steps of the artifact producing pipeline. A
popudata source itself in the following steps) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Even though lar data-driven framework which logs various metadata
pipelines can be created using only functions and mod- during pipeline runs is Dagster6. Some frameworks in
ules by chaining their inputs and outputs together [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], this category enable data provenance by using a version
pipeline frameworks allow users to generate, maintain control system similar to Git7. A prominent example of
and administrate complex pipelines.
1https://www.jupyter.org/
2https://www.github.com/
this is Pachyderm8. components and their interactions with each other. The
        </p>
        <p>
          Comparing pipeline frameworks is made dificult by a changes triggered by disruptors are diverse, but can be
number of factors: the sheer amount of diferent frame- broadly categorized into data, operator and environment
works, the lack of a theoretical basis for analysis, the over- disruptors.
lapping functionality and the difering ways to achieve The structure and semantics of data might change,
afthe same goal within two frameworks. A thorough search fecting data sources and sinks as well as data artifacts
of related work and literature focusing on such compar- created within the pipeline, e.g. interim results.
Strucison, only revealed one paper [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Even though the tural changes in data might occur over time due to altered
analysis was geared towards a specific system and its data producers or operators. Semantic changes in data
requirements, the general results and especially the com- can emerge from technical, legislative but also societal
parison criteria are a helpful first step towards distin- reasons.
guishing pipeline frameworks. Some of these criteria Operator functionality might also experience
evoluand their possible values include: tion, e.g after a software update, resulting in diferent
• Type: business, science, big data APIs or a changed set of available (hyper)parameters.
• Model: script-based, event-based, adaptive, Another form of change in this context is choosing a
declarative and procedural diferent operator for a specific task which accepts the
• Separation of concerns: asks whether or not same input as the old one but produces a diferent output,
high-level pipeline definitions can be separated e.g. a diferent data structure. This leads to the need to
from low-level data and operator implementa- adapt the pipeline to fit this new operator.
tions Also, the environment in which the pipeline is run
• Language: general purpose language (GPL), do- can change over time. For example, the hardware could
main specific language (DSL) change resulting in more processing power or more
clus• Pipeline programming: text-based, graphical, ter nodes becoming available. Adapting to such change
visual by increasing the number of pipelines running in
par• Reusability: asks whether or not a framework allel or utilizing bigger batch sizes in order to increase
provides tools for reusing existing pipeline def- eficiency could be possible examples.
initions as well as individual components of a
previously defined pipeline 3. Pipeline Framework with
• Containerization: asks if pipeline components,
whole pipelines and the pipeline framework itself Evolution Capabilities
can be deployed in a container
• Monitoring: asks whether or not the framework
allows for runtime observation of the system or
if it is granting logging capabilities
Some of these results are referenced in Section 3. In
Section 4 these basic criteria are extended with a special
focus on evolution capabilities. The particularities
resulting from evolution will be presented in more detail in the
next subsection.
2.3. Pipeline Evolution
Evolution means change over time. In the realm of
computer science change can mean a lot of diferent things.
        </p>
        <p>
          The emergence and widespread adoption of a new data
format (such as JSON 9) or programming model (such as
MapReduce [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]) are examples of this. This type of
evolution is often gradual and influenced by many diferent
factors. In the context of data pipelines and
corresponding frameworks evolution can happen over diferent time
frames, ranging from gradual to sudden. The main
evolution factors are so-called disruptors, which can afect all
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>8https://www.pachyderm.com/ 9https://www.json.org</title>
        <p>
          In this section a pipeline framework with evolution
capabilities is envisioned and discussed. Figure 2, based
on a figure from [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], shows a graphical representation
of the proposed framework. The outside of the figure is
made up of the environment frame including goals and
contracts as well as metadata and statistics. These
elements represent the available resources, user objectives
and metadata, which the system gathered, stored and
aggregated throughout its lifecycle. Within this frame
there are essentially five columns. They represent (from
left to right) data sources, operators and data sinks. The
arrows connect the individual components and show two
pipelines, each consisting of a data source, three
operators and a data sink. Evolutionary change can happen
at several points during a pipeline’s lifecycle. In
Figure 2 these disruption points are shown as red flashes.
        </p>
        <p>Structure and semantics of data might change at the data
sources as well as within the pipeline. Evolution can also
afect the operators and the environment in which the
pipelines are run. In any case, an ideal pipeline
framework could automatically adapt to these changes.</p>
        <p>
          Concerning adaptability, an important distinction
needs to be made. Generally speaking, it is possible
to build pipelines in existing frameworks, that are very
lfexible. One class of systems, which are very flexible monitoring capabilities and allow for concepts such as
are adaptive workflows , first presented by van der Aalst reproducibility and provenance which are closely related
et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Besides being mainly task-driven, these sys- to evolution. A tool for inspecting pipelines which runs
tems adapt themselves based on strict, predefined rules. on existing Python code is mlinspect [
          <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
          ]. It extracts
An example of such a system is AdaptFlow presented the DAG structure of a pipeline and helps the user to
in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Given a treatment plan in the medical context, identify problems and bugs. For example it can help to
AdaptFlow can notice logical errors and choose a difer- identify a skewed data distribution which would lead to
ent path in the predefined workflow. This flexibility is unfair [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] results. ArgusEyes [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] is a tool for
inspectcompletely dependent on and bounded by the treatment ing classification pipelines which builds upon mlinspect.
workflow. Generally speaking, the space of possible al- It enables the user to check whether best practices are
terations, given such a flexible system, is significantly applied while also providing various metadata to analyze
smaller than the space envisioned in the present work. pipelines. Even though these tools are not intended to
This stems from the fact that a pipeline framework with track the evolution of pipelines and their components,
evolution capabilities dynamically creates and alters this but rather focus on helping practitioners with a specific
search space, in order to find an optimal solution, at dif- issue, the underlying architecture can serve as useful
ferent times during the system’s lifecycle. This demon- guidance for the development of a pipeline framework
strates that flexibility is not the same as adaptability. It is with evolution capabilities. Another important aspect
also possible to build meta pipelines especially for moni- is to track data changes across pipeline steps. The
autoring changes as well as adapting to these changes. Even thors of [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] present three measuring approaches that
though this is currently the most practical solution for are utilized in order to deal with bias.
achieving evolution capabilities in existing frameworks, Monitoring capabilities, gathering and storing
metathis approach does not represent real evolution capabil- data as well as calculating and providing statistics on
ities as they were defined in the previous sections. In these findings are critical functionalities towards
evoluany case, before adapting to evolution, the underlying tion capabilities in pipeline frameworks. They are
neceschanges need to be noticed and recognized. sary in all dimensions and are the basis for self-awareness.
Tools like mlinspect and ArgusEyes, but also existing
3.1. Self-awareness data-driven frameworks like Dagster can be a starting
point towards achieving such functionality. Perceiving
The first step in dealing with evolution is to be aware of change in operator results or contracts leading to the
change. Figure 1 (b) shows this step in dealing with evo- automatic swapping or parameter change is also
fundalution. Data-driven frameworks are usually more aware mentally important. One project that can be of help in
of change than task-driven ones since they provide more this regard is IBM Lale [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] which automatically creates
optimal pipelines based on scikit-learn10 functions. Once
the system is aware of change, it needs to adapt to the
new circumstances.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Conceptual Requirements</title>
    </sec>
    <sec id="sec-3">
      <title>Model</title>
      <sec id="sec-3-1">
        <title>As described in the previous sections, there is no frame</title>
        <p>
          3.2. Self-adaption work with comprehensive evolution capabilities yet. This
emphasizes the need for a requirements model,
encomAutomatic acting upon change can only be done with passing important components and their interplay as
respect to a goal. This goal could be as simple as ensuring well as system functionalities. The model presented in
functionality and as complex as automatically optimiz- this section is conceptual, i.e. it was not derived through
ing the performance and accuracy of several big data a structured method from the field of requirements
enpipelines running in parallel given certain hardware. Fig- gineering [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. It rather evolved from technical talks
ure 1 (c) shows the adaption step, after a disruption has with experienced colleagues and a rough analysis and
been perceived by the self-awareness capabilities. In this comparison of existing pipeline frameworks. It can serve
context it is decisive to formulate a goal including a fitting as the inception step for a structured requirements
gathrepresentation, which the pipeline framework can use ering process and furthermore helps with the testing of
to evaluate decisions. The dimensions for pipeline and existing frameworks for their evolution capabilities.
environment shown in the last section both contain the The requirements are structured into two categories,
evolution requirement to provide an interface for goals. self-awareness and self-adaption as well as four
dimenThis reveals a potential conflict: A pipeline with the sions.
goal to achieve the best possible accuracy for a ML task • Data: Data sources and sinks, structure and
semight want to simulate a lot of diferent pipelines to find mantics of data
the best one and to achieve this goal. At the same time • Operator: Modules and functions and their
resimulations and tests might cost a lot of computational spective inputs and outputs
resources, which could stand in contrast to the environ- • Pipeline: Creation and administration of
ment dimension’s goal to provide a certain performance pipelines
to all pipelines. A pipeline framework with evolution • Environment: Available hardware and
schedulcapabilities needs to have dynamic functionality to deal ing, scaling and orchestration of pipelines
with these kinds of conflicts. Table 1 presents an overview of the requirements. The
        </p>
        <p>
          The vision of self-adapting systems is not unique to following sections describe the requirements listed in
the present work. The authors of [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] present four gener- Table 1 in detail.
ations in data engineering for data science ranging from
simple data pre-processing to fully automated data
curation. In [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] the authors envision a framework for multi- 4.1. Self-awareness Requirements
model databases, which is self-adapting with regard to de- Self-awareness means being aware of change. This
sign and maintenance. Similar to the insight gained from change is always relative with respect to some
previtools like mlinspect and ArgusEyes in the context of evo- ous state, i.e. in order to be self-aware, a system needs to
lution awareness, other self-adaptive systems can help store at least one previous state for comparison with the
to understand the underlying components and their in- current state. Therefore, collecting and storing metadata
terplay. For example Hillenbrand et al. propose a system over all dimensions is an integral requirement for a
selfwhich automatically chooses an optimal data migration aware pipeline framework. Even though comparing two
strategy given some constraints like service-level agree- system states is suficient to notice change, in many cases
ments [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Pachyderm which runs natively in Kuber- it would be beneficial to have a history of system states.
netes11 has a built-in system for distributed computing / Creating a versioned history of metadata allows for more
scaling, which is very simple and should be considered in complex concepts and techniques to be applied, e.g.
exthe context of the environment dimension. The empirical tracting (meta)data distributions or using window-based
results of [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] showed a complete lack of a simulation anomaly detection to notice change. Versioning of
metaenvironment in all studied frameworks. Simulation and data, component artifacts and configuration files would
the use of synthetic data [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] are important components, enable the self-aware system to notice diferent forms of
which need to be incorporated especially for the pipeline change and distinguish them. For example it could
diferand environment dimensions since their self-adaption entiate between an abrupt change to the interface of an
strategies need a search space to optimize towards a goal. operator after a software update and the gradual decrease
of data quality, based on the wrong composition of
preprocessing operators. Collecting and storing such data
is important, but so is managing and curating it, which
        </p>
        <sec id="sec-3-1-1">
          <title>Self-awareness</title>
        </sec>
        <sec id="sec-3-1-2">
          <title>Self-adaption</title>
        </sec>
        <sec id="sec-3-1-3">
          <title>Collecting and storing metadata</title>
        </sec>
        <sec id="sec-3-1-4">
          <title>Versioning of metadata</title>
        </sec>
        <sec id="sec-3-1-5">
          <title>Versioning of component artifacts</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>Versioning of configuration files</title>
        </sec>
        <sec id="sec-3-1-7">
          <title>Providing provenance capabilities</title>
        </sec>
        <sec id="sec-3-1-8">
          <title>Analyzing metadata and creating statistics</title>
        </sec>
        <sec id="sec-3-1-9">
          <title>Noticing structural changes</title>
        </sec>
        <sec id="sec-3-1-10">
          <title>Noticing semantic changes</title>
        </sec>
        <sec id="sec-3-1-11">
          <title>Noticing changes to contracts,APIs and interfaces</title>
        </sec>
        <sec id="sec-3-1-12">
          <title>Noticing changes to available computing resources</title>
        </sec>
        <sec id="sec-3-1-13">
          <title>Monitoring processing results and performance</title>
        </sec>
        <sec id="sec-3-1-14">
          <title>Providing an interface for goal definition</title>
        </sec>
        <sec id="sec-3-1-15">
          <title>Initiating an adaption, based on the violation of a goal</title>
        </sec>
        <sec id="sec-3-1-16">
          <title>Automatically swapping operators</title>
        </sec>
        <sec id="sec-3-1-17">
          <title>Automatically changing pipeline structure and components</title>
        </sec>
        <sec id="sec-3-1-18">
          <title>Automatically optimizing resource distribution and scheduling</title>
        </sec>
        <sec id="sec-3-1-19">
          <title>Providing a simulation space to test potential alteration</title>
          <p>Dimension
leads to the need for provenance capabilities over all di- 4.2. Self-adaption Requirements
mensions. Also providing tools to analyze metadata, for
example to aggregate historic data into statistical values, Once the system is aware of a significant change, it
trigis an important requirement. Aggregated data enables a gers an adaption. Based on the dimension in which the
diferent perspective of change. adaption should occur, i.e. operator, pipeline or
environ</p>
          <p>When looking at the data dimension, the two funda- ment, the prerequisites for all possible adaption operation
mental requirements a pipeline framework with evolu- are checked. This first step towards an adaption is an
tion capabilities has to fulfill are noticing changes to the important requirement for a pipeline framework with
structure of data and noticing changes to the semantics of evolution capabilities, since it creates a search space for
data. These disruptors almost always trigger an adaption possible adjustments. The operations, which make up
and therefore, being aware and dealing with them, is of these adjustments, represent crucial requirements as well.
utmost importance. The same can be said about the oper- They include the automatic swapping of an operator, the
ator dimension. A changing operator interface will most automatic change of pipeline structure and/or
compocertainly result in an erroneous pipeline. Hence, noticing nents, as well as the automatic optimization of resource
such change is a critical requirement. Changes to the en- distribution and pipeline scheduling. The search space
vironment do not necessarily result in non-functioning of all possible operations is transformed into a
simulapipelines, but rather influence the performance. Still, tion space, in which possible alterations are tested. This
noticing changes to the environment, e.g. available hard- space connects the user’s goal definitions with the
selfware, is important to achieve framework performance awareness metadata, while at the same time providing
goals, such as optimal utilization of available resources. simulation and optimization capabilities, in order to find
A similar approach needs to be taken for operator and an optimal adaption.
pipeline goals. Processing results and performance of
individual operators as well as pipelines need to be mon- 5. Conclusion and Future Work
itored, in order to compare these results to predefined
goals. Diverse metrics for goal definition can be imag- The present work defined and showcased data pipelines
ined, ranging from speed and throughput performance to and their corresponding frameworks. Evolution in the
data quality and model accuracy. This leads to framework context of these systems was introduced and a conceptual
requiring an interface for goal definition. This interface requirements model was proposed, comprised of all
comallows the user to specify objectives with respect to indi- ponents of such systems, categorized by self-awareness
vidual operators, pipelines and the whole framework. At and self-adaption and structured into four dimensions.
the same time, this goal definition is used for comparison By envisioning a system which fulfills these
requirewith the current as well as historic states of the system, ments, a first step was made towards a framework, which
to notice change and possibly initiate an adaption. would need less maintenance based on its self-awareness
and self-adaption, i.e. evolution capabilities. This type
of framework could be a substantial contribution for
scientists and practitioners alike.</p>
          <p>The paper is concluded with a set of steps that need
to be taken by the community towards achieving
evolution capabilities in data pipelines. First of all, a proper
requirements model using concepts and methods of
requirements engineering must be constructed. This must
include a structured requirements gathering process
comprised of talking to stakeholders, who would benefit from
the proposed system, as well as an in-depth analysis
of existing concepts and techniques with regard to
selfawareness and self-adaption. As a result, this step would
produce a system specification encompassing
requirements, including non-functional ones, use-cases and a
basic software architecture, as well as formal definitions
of new terms. In the next step, these results need to be
compared to existing frameworks and tools, in order to
ifnd working solutions, but also gaps. All dimensions
must be thoroughly analyzed and the system
specification must be iteratively adjusted. During this phase
software engineering and architecture principles, which
support evolution capabilities must be derived from existing
systems and be incorporated into the specification. The
secondary goal of this step is to either find a framework,
which provides a good basis for evolution capabilities
– at least with respect to a certain dimension –, or to
discover the need to conceptualize and implement the
missing components from scratch. In any case, the next
step would be the creation of a prototype. As a final step,
this prototype must be evaluated and validated, given
the system specification.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <sec id="sec-4-1">
        <title>The author wants to thank Meike Klettke, Stefanie</title>
        <p>Scherzinger, and Uta Störl for many prolific discussions
as well as helpful suggestions, with regard to evolution
capabilities in data pipelines, without which the present
work would not have been possible.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Koskinen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lahtonen</surname>
          </string-name>
          , T. Tilus,
          <article-title>Software Maintenance Cost Estimation and Modernization Support, ELTIS-project</article-title>
          ,
          <source>Technical Report</source>
          , University of Jyväskylä, Information Technology Research Institute,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Fjukstad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Bongo</surname>
          </string-name>
          ,
          <article-title>A Review of Scalable Bioinformatics Pipelines</article-title>
          ,
          <source>Data Sci. Eng</source>
          .
          <volume>2</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Novella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Khoonsari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Herman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Whitenack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Capuccini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Burman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kultima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Spjuth</surname>
          </string-name>
          ,
          <article-title>Container-based Bioinformatics with Pachyderm</article-title>
          ,
          <source>Bioinform</source>
          .
          <volume>35</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ismail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Truong</surname>
          </string-name>
          , W. Kastner,
          <article-title>Manufacturing Process Data Analysis Pipelines: A Requirements Analysis and survey</article-title>
          ,
          <source>J. Big Data</source>
          <volume>6</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>M. M. Koushki</surname>
            ,
            <given-names>I. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Abualhaol</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          <string-name>
            <surname>Raju</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          <string-name>
            <surname>Giagone</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>On Building Machine Learning Pipelines for Android Malware Detection: a Procedural Survey of Practices, Challenges</article-title>
          and Opportunities, Cybersecur.
          <volume>5</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wardat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rajan</surname>
          </string-name>
          ,
          <article-title>The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-TheSmall, and In-The-Large</article-title>
          , in: ICSE, ACM,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Psallidas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Karlas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Henkel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Interlandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kroth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. V.</given-names>
            <surname>Emani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weimer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Floratou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Curino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Karanasos</surname>
          </string-name>
          ,
          <article-title>Data Science Through the Looking Glass: Analysis of Millions of GitHub Notebooks and ML</article-title>
          .
          <source>NET Pipelines, SIGMOD Rec</source>
          .
          <volume>51</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vassiliadis</surname>
          </string-name>
          ,
          <article-title>A Survey of Extract-Transform-Load Technology</article-title>
          , in: D.
          <string-name>
            <surname>Taniar</surname>
          </string-name>
          , L. Chen (Eds.),
          <source>Integrations of Data Warehousing, Data Mining and Database Technologies - Innovative Approaches, Information Science Reference</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Maymounkov</surname>
          </string-name>
          ,
          <article-title>Koji: Automating Pipelines with Mixed-semantics Data Sources</article-title>
          , CoRR abs/
          <year>1901</year>
          .
          <year>01908</year>
          (
          <year>2019</year>
          ). arXiv:
          <year>1901</year>
          .
          <year>01908</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Matskin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tahmasebi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Layegh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Payberah</surname>
          </string-name>
          , A. Thomas,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roman</surname>
          </string-name>
          ,
          <article-title>A Survey of Big Data Pipeline Orchestration Tools from the Perspective of the DataCloud Project</article-title>
          , in: DAMDID/RCDL, volume
          <volume>3036</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          ,
          <source>MapReduce: Simplified Data Processing on Large Clusters</source>
          , in: OSDI, USENIX Association,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klettke</surname>
          </string-name>
          , U. Störl,
          <article-title>Four Generations in Data Engineering for Data Science</article-title>
          ,
          <source>Datenbank-Spektrum</source>
          <volume>22</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
            , T. Basten,
            <given-names>H. M. W.</given-names>
          </string-name>
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>P. A. C.</given-names>
          </string-name>
          <string-name>
            <surname>Verkoulen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Voorhoeve</surname>
          </string-name>
          ,
          <article-title>Adaptive workflow-on the interplay between flexibility and support</article-title>
          ,
          <source>in: Proceedings of the 1st International Conference on Enterprise Information Systems</source>
          , Setubal, Portugal,
          <fpage>27</fpage>
          -30
          <source>March</source>
          <year>1999</year>
          ,
          <string-name>
            <given-names>ICEIS</given-names>
            <surname>Secretariat</surname>
          </string-name>
          , Escola Superior de Tecnologia de Setúbal, Portugal,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>U.</given-names>
            <surname>Greiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ramsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Heller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Löfler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Müller</surname>
          </string-name>
          , E. Rahm,
          <article-title>Adaptive guideline-based treatment worklfows with adaptflow, in: Computer-based Support for Clinical Guidelines and Protocols -</article-title>
          <source>Proceedings of the Symposium on Computerized Guidelines and Protocols</source>
          ,
          <string-name>
            <surname>CGP</surname>
          </string-name>
          <year>2004</year>
          , Prague, Czech Republic,
          <fpage>12</fpage>
          -
          <lpage>14</lpage>
          April,
          <year>2004</year>
          , volume
          <volume>101</volume>
          of
          <article-title>Studies in Health Technology and Informatics</article-title>
          , IOS Press,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Grafberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          ,
          <article-title>Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines</article-title>
          , in: CIDR,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Grafberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          , S. Schelter,
          <article-title>MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines</article-title>
          , in: SIGMOD, ACM,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Stoyanovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. V.</given-names>
            <surname>Jagadish</surname>
          </string-name>
          ,
          <article-title>Responsible Data Management</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>13</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schelter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Grafberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Sprangers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Karlas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Screening</surname>
          </string-name>
          <article-title>Native Machine Learning Pipelines with ArgusEyes</article-title>
          , in: CIDR,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Klettke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lutsch</surname>
          </string-name>
          , U. Störl,
          <article-title>Kurz erklärt: Measuring Data Changes in Data Engineering and their Impact on Explainability and Algorithm Fairness</article-title>
          ,
          <source>Datenbank-Spektrum</source>
          <volume>21</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>G.</given-names>
            <surname>Baudart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hirzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shinnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tsay</surname>
          </string-name>
          ,
          <article-title>Pipeline combinators for gradual AutoML</article-title>
          , in: NeurIPS,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>I.</given-names>
            <surname>Holubová</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Koupil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Self-adapting Design and Maintenance of Multi-Model Databases</article-title>
          , in: B.
          <string-name>
            <surname>C. Desai</surname>
            ,
            <given-names>P. Z.</given-names>
          </string-name>
          <string-name>
            <surname>Revesz</surname>
          </string-name>
          (Eds.), IDEAS, ACM,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hillenbrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Störl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nabiyev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Klettke</surname>
          </string-name>
          ,
          <article-title>Selfadapting Data Migration in the Context of Schema Evolution in NoSQL Databases</article-title>
          ,
          <source>Distributed Parallel Databases</source>
          <volume>40</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abufadda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mansour</surname>
          </string-name>
          ,
          <article-title>A Survey of Synthetic Data Generation for Machine Learning</article-title>
          , in: ACIT, IEEE,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Felderer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vetrò</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kalinowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Wieringa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pfahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Conte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Christiansson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Greer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lassenius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Männistö</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nayebi</surname>
          </string-name>
          , et al.,
          <article-title>Status Quo in Requirements Engineering: A Theory and a Global Family of Surveys</article-title>
          , in: Software Engineering, volume P-310
          <string-name>
            <surname>of</surname>
            <given-names>LNI</given-names>
          </string-name>
          , Gesellschaft für Informatik e.V.,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>