Towards Evolution Capabilities in Data Pipelines Kevin Kramer1 1 University of Hagen, Universitätsstr. 1, 58097 Hagen, Germany Abstract Evolutionary change over time in the context of data pipelines is certain, especially with regard to the structure and semantics of data as well as to the pipeline operators. Dealing with these changes, i.e. providing long-term maintenance, is costly. The present work explores the need for evolution capabilities within pipeline frameworks. In this context dealing with evolution is defined as a two-step process consisting of self-awareness and self-adaption. Furthermore, a conceptual requirements model is provided, which encompasses criteria for self-awareness and self-adaption as well as covering the dimensions data, operator, pipeline and environment. A lack of said capabilities in existing frameworks exposes a major gap. Filling this gap will be a significant contribution for practitioners and scientists alike. The present work envisions and lays the foundation for a framework which can handle evolutionary change. Keywords data pipeline, data evolution, operator evolution, data pipeline framework 1. Introduction capabilities is envisioned and discussed. A conceptual requirements model, which focuses on these evolution The last decade was characterized by ever increasing capabilities, is presented in Section 4. Finally, the last sec- amounts of data. This also led to new technical demands tion concludes the paper and outlines a roadmap for the in the context of data storage, transfer and analysis. In community towards a pipeline framework with evolution order to cope with these demands complex new systems capabilities. emerged, which in turn require maintenance. Providing this maintenance is costly and even though the systems themselves might run as expected, changes over time, 2. Evolution in Data Pipelines e.g. to the structure and semantics of data, inevitably induce a need to adjust the systems configuration to re- This section provides the basis for the current work by store functionality. One estimate suggests that 50-70% defining important concepts as well as presenting related of the total cost of a long running software system can work. Firstly, data pipelines and their components are be attributed to maintenance [1]. Data pipelines are an introduced. Secondly, data pipeline frameworks includ- intuitive way to structure end-to-end data processing. ing their benefits are showcased. Finally, evolution in The corresponding tools and frameworks are used in a the context of data pipelines is defined. wide field of domains and for an extensive amount of diverse applications. Still, they also need costly mainte- 2.1. Data Pipelines nance whenever change, i.e. evolution happens. Adding evolution capabilities to data pipelines and thereby re- Data pipelines are used for a plethora of applications and ducing maintenance cost and human involvement could domains such as bioinformatics [2, 3], manufacturing [4] be a big contribution for scientists and practitioners alike. and cybersecurity [5]. Broadly speaking, a data pipeline The current work takes the first step in this direction by consists of three components: data source(s), operator(s) collecting requirements needed for such a system and and data sink(s). Figure 1 (a) shows such a basic pipeline. by envisioning a data pipeline framework which fulfills Biswas et al. empirically studied the components and these requirements. stages of 71 data science (DS) pipelines [6]. Their find- The following sections are structured as follows. Sec- ings suggest that DS pipelines consist of a pre-processing tion 2 describes the general concepts and challenges of phase, a model building phase and a post-processing phase. evolution in data pipelines. Important terminology is They further extracted tasks and sub-tasks associated defined and related work is shown in this section as with these phases. Subtasks are atomic operators in the well. In Section 3 a pipeline framework with evolution context of a pipeline. The pre-processing phase consists of the tasks data acquisition, data preparation and storage th 34 GI-Workshop on Foundations of Databases (Grundlagen von Daten- which represent the typical components of data engi- banken), June 7-9, 2023, Hirsau, Germany neering and also includes the data source(s). The model Envelope-Open kevin.kramer@fernuni-hagen.de (K. Kramer) building phase is comprised of the tasks feature engineer- GLOBE https://www.fernuni-hagen.de/dbis/team/kevin.kramer.shtml ing, modeling, training, evaluation as well as prediction. (K. Kramer) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License These tasks correspond to basic machine learning (ML) Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: (a) A basic data processing pipeline consisting of a data source, operators and a data sink. (b) Self-awareness: the system perceives a disruption at the data source level. This could be the structural or semantic change of incoming data. (c) Self-adaption: the system automatically adapts to the perceived disruption by swapping the first operator for a different one. and data mining (DM) functions. The tasks included 2.2. Pipeline Frameworks in the post-processing layer are interpretation, commu- The number of existing pipeline frameworks is over- nication and deployment as well as all data sinks. The whelming. A popular collection of pipeline tools at empirical results show that the pre-processing and the GitHub3 includes 122 pipeline frameworks. At the same model building phases appeared in 96% of examined DS time there is almost no scientific attention on the ab- pipelines, the post-processing phase only appeared in stract concepts of these systems. Some conceptual work 52% of pipelines. was made by Maymounkov [9]. The author proposes Pipelines can be linear, i.e. one data source, a chain of an important distinction in order to categorize pipeline operators and finally one data sink. Psallidas et al. em- frameworks. He divides frameworks into task-driven pirically studied 8M Jupyter notebooks1 from GitHub2 and data-driven. Task-driven frameworks are agnostic [7]. Their results which were produced by mining and about actual data and operations that occur during a analyzing the abstract syntax trees of all notebooks sug- pipeline run. Their focus lies on managing inter- and gest that 80% of the pipelines are linear. The structure of intra-pipeline dependencies and scheduling large num- pipelines can be interpreted as a directed acyclic graph bers of pipelines in parallel. Popular proponents of this (DAG), allowing for pipelines, which can include several category are Luigi4 and Apache Airflow5 . Data-driven data sources and sinks as well as branching operators, i.e. pipelines are – to a varying degree – aware of the data operators which have more than one input or output. A they process and the included operations. These frame- widespread example of such non-linear data processing works put a focus on data (and operator) lineage also are extract transform load (ETL). They are used to extract called provenance, i.e. they allow the user to retrace the data from multiple heterogeneous sources, transform history of a data artifact by saving and curating metadata them to use a common schema and then load them into a on all steps of the artifact producing pipeline. A popu- data sink such as a data warehouse (which may become a lar data-driven framework which logs various metadata data source itself in the following steps) [8]. Even though during pipeline runs is Dagster6 . Some frameworks in pipelines can be created using only functions and mod- this category enable data provenance by using a version ules by chaining their inputs and outputs together [7], control system similar to Git7 . A prominent example of pipeline frameworks allow users to generate, maintain and administrate complex pipelines. 3 https://www.github.com/pditommaso/awesome-pipeline 4 https://www.github.com/spotify/luigi 5 https://www.airflow.apache.org/ 1 6 https://www.jupyter.org/ https://www.dagster.io/ 2 7 https://www.github.com/ https://www.git-scm.com/ this is Pachyderm8 . components and their interactions with each other. The Comparing pipeline frameworks is made difficult by a changes triggered by disruptors are diverse, but can be number of factors: the sheer amount of different frame- broadly categorized into data, operator and environment works, the lack of a theoretical basis for analysis, the over- disruptors. lapping functionality and the differing ways to achieve The structure and semantics of data might change, af- the same goal within two frameworks. A thorough search fecting data sources and sinks as well as data artifacts of related work and literature focusing on such compar- created within the pipeline, e.g. interim results. Struc- ison, only revealed one paper [10]. Even though the tural changes in data might occur over time due to altered analysis was geared towards a specific system and its data producers or operators. Semantic changes in data requirements, the general results and especially the com- can emerge from technical, legislative but also societal parison criteria are a helpful first step towards distin- reasons. guishing pipeline frameworks. Some of these criteria Operator functionality might also experience evolu- and their possible values include: tion, e.g after a software update, resulting in different • Type: business, science, big data APIs or a changed set of available (hyper)parameters. • Model: script-based, event-based, adaptive, Another form of change in this context is choosing a declarative and procedural different operator for a specific task which accepts the • Separation of concerns: asks whether or not same input as the old one but produces a different output, high-level pipeline definitions can be separated e.g. a different data structure. This leads to the need to from low-level data and operator implementa- adapt the pipeline to fit this new operator. tions Also, the environment in which the pipeline is run • Language: general purpose language (GPL), do- can change over time. For example, the hardware could main specific language (DSL) change resulting in more processing power or more clus- • Pipeline programming: text-based, graphical, ter nodes becoming available. Adapting to such change visual by increasing the number of pipelines running in par- • Reusability: asks whether or not a framework allel or utilizing bigger batch sizes in order to increase provides tools for reusing existing pipeline def- efficiency could be possible examples. initions as well as individual components of a previously defined pipeline 3. Pipeline Framework with • Containerization: asks if pipeline components, whole pipelines and the pipeline framework itself Evolution Capabilities can be deployed in a container • Monitoring: asks whether or not the framework In this section a pipeline framework with evolution ca- allows for runtime observation of the system or pabilities is envisioned and discussed. Figure 2, based if it is granting logging capabilities on a figure from [12], shows a graphical representation of the proposed framework. The outside of the figure is Some of these results are referenced in Section 3. In Sec- made up of the environment frame including goals and tion 4 these basic criteria are extended with a special contracts as well as metadata and statistics. These ele- focus on evolution capabilities. The particularities result- ments represent the available resources, user objectives ing from evolution will be presented in more detail in the and metadata, which the system gathered, stored and next subsection. aggregated throughout its lifecycle. Within this frame there are essentially five columns. They represent (from 2.3. Pipeline Evolution left to right) data sources, operators and data sinks. The arrows connect the individual components and show two Evolution means change over time. In the realm of com- pipelines, each consisting of a data source, three opera- puter science change can mean a lot of different things. tors and a data sink. Evolutionary change can happen The emergence and widespread adoption of a new data 9 at several points during a pipeline’s lifecycle. In Fig- format (such as JSON ) or programming model (such as ure 2 these disruption points are shown as red flashes. MapReduce [11]) are examples of this. This type of evo- Structure and semantics of data might change at the data lution is often gradual and influenced by many different sources as well as within the pipeline. Evolution can also factors. In the context of data pipelines and correspond- affect the operators and the environment in which the ing frameworks evolution can happen over different time pipelines are run. In any case, an ideal pipeline frame- frames, ranging from gradual to sudden. The main evolu- work could automatically adapt to these changes. tion factors are so-called disruptors, which can affect all Concerning adaptability, an important distinction 8 https://www.pachyderm.com/ needs to be made. Generally speaking, it is possible 9 https://www.json.org to build pipelines in existing frameworks, that are very Figure 2: Pipeline framework and its components. Evolution can happen in the form of structural and semantic changes to the data during loading (1) and through operator processing (2) as well as to operators (3), e.g. after a software update. The environment, i.e. hardware, scaling, etc., might also change over time (4). Based on a figure from [12]. flexible. One class of systems, which are very flexible monitoring capabilities and allow for concepts such as are adaptive workflows, first presented by van der Aalst reproducibility and provenance which are closely related et al. [13]. Besides being mainly task-driven, these sys- to evolution. A tool for inspecting pipelines which runs tems adapt themselves based on strict, predefined rules. on existing Python code is mlinspect [15, 16]. It extracts An example of such a system is AdaptFlow presented the DAG structure of a pipeline and helps the user to in [14]. Given a treatment plan in the medical context, identify problems and bugs. For example it can help to AdaptFlow can notice logical errors and choose a differ- identify a skewed data distribution which would lead to ent path in the predefined workflow. This flexibility is unfair [17] results. ArgusEyes [18] is a tool for inspect- completely dependent on and bounded by the treatment ing classification pipelines which builds upon mlinspect. workflow. Generally speaking, the space of possible al- It enables the user to check whether best practices are terations, given such a flexible system, is significantly applied while also providing various metadata to analyze smaller than the space envisioned in the present work. pipelines. Even though these tools are not intended to This stems from the fact that a pipeline framework with track the evolution of pipelines and their components, evolution capabilities dynamically creates and alters this but rather focus on helping practitioners with a specific search space, in order to find an optimal solution, at dif-issue, the underlying architecture can serve as useful ferent times during the system’s lifecycle. This demon- guidance for the development of a pipeline framework strates that flexibility is not the same as adaptability. It is with evolution capabilities. Another important aspect also possible to build meta pipelines especially for moni- is to track data changes across pipeline steps. The au- toring changes as well as adapting to these changes. Even thors of [19] present three measuring approaches that though this is currently the most practical solution for are utilized in order to deal with bias. achieving evolution capabilities in existing frameworks, Monitoring capabilities, gathering and storing meta- this approach does not represent real evolution capabil- data as well as calculating and providing statistics on ities as they were defined in the previous sections. In these findings are critical functionalities towards evolu- any case, before adapting to evolution, the underlying tion capabilities in pipeline frameworks. They are neces- changes need to be noticed and recognized. sary in all dimensions and are the basis for self-awareness. Tools like mlinspect and ArgusEyes, but also existing 3.1. Self-awareness data-driven frameworks like Dagster can be a starting point towards achieving such functionality. Perceiving The first step in dealing with evolution is to be aware of change in operator results or contracts leading to the change. Figure 1 (b) shows this step in dealing with evo- automatic swapping or parameter change is also funda- lution. Data-driven frameworks are usually more aware mentally important. One project that can be of help in of change than task-driven ones since they provide more this regard is IBM Lale [20] which automatically creates optimal pipelines based on scikit-learn10 functions. Once 4. Conceptual Requirements the system is aware of change, it needs to adapt to the new circumstances. Model As described in the previous sections, there is no frame- 3.2. Self-adaption work with comprehensive evolution capabilities yet. This emphasizes the need for a requirements model, encom- Automatic acting upon change can only be done with passing important components and their interplay as respect to a goal. This goal could be as simple as ensuring well as system functionalities. The model presented in functionality and as complex as automatically optimiz- this section is conceptual, i.e. it was not derived through ing the performance and accuracy of several big data a structured method from the field of requirements en- pipelines running in parallel given certain hardware. Fig- gineering [24]. It rather evolved from technical talks ure 1 (c) shows the adaption step, after a disruption has with experienced colleagues and a rough analysis and been perceived by the self-awareness capabilities. In this comparison of existing pipeline frameworks. It can serve context it is decisive to formulate a goal including a fitting as the inception step for a structured requirements gath- representation, which the pipeline framework can use ering process and furthermore helps with the testing of to evaluate decisions. The dimensions for pipeline and existing frameworks for their evolution capabilities. environment shown in the last section both contain the The requirements are structured into two categories, evolution requirement to provide an interface for goals. self-awareness and self-adaption as well as four dimen- This reveals a potential conflict: A pipeline with the sions. goal to achieve the best possible accuracy for a ML task • Data: Data sources and sinks, structure and se- might want to simulate a lot of different pipelines to find mantics of data the best one and to achieve this goal. At the same time • Operator: Modules and functions and their re- simulations and tests might cost a lot of computational spective inputs and outputs resources, which could stand in contrast to the environ- • Pipeline: Creation and administration of ment dimension’s goal to provide a certain performance pipelines to all pipelines. A pipeline framework with evolution • Environment: Available hardware and schedul- capabilities needs to have dynamic functionality to deal ing, scaling and orchestration of pipelines with these kinds of conflicts. Table 1 presents an overview of the requirements. The The vision of self-adapting systems is not unique to following sections describe the requirements listed in the present work. The authors of [12] present four gener- Table 1 in detail. ations in data engineering for data science ranging from simple data pre-processing to fully automated data cura- tion. In [21] the authors envision a framework for multi- 4.1. Self-awareness Requirements model databases, which is self-adapting with regard to de- Self-awareness means being aware of change. This sign and maintenance. Similar to the insight gained from change is always relative with respect to some previ- tools like mlinspect and ArgusEyes in the context of evo- ous state, i.e. in order to be self-aware, a system needs to lution awareness, other self-adaptive systems can help store at least one previous state for comparison with the to understand the underlying components and their in- current state. Therefore, collecting and storing metadata terplay. For example Hillenbrand et al. propose a system over all dimensions is an integral requirement for a self- which automatically chooses an optimal data migration aware pipeline framework. Even though comparing two strategy given some constraints like service-level agree- system states is sufficient to notice change, in many cases ments [22]. Pachyderm which runs natively in Kuber- it would be beneficial to have a history of system states. netes11 has a built-in system for distributed computing / Creating a versioned history of metadata allows for more scaling, which is very simple and should be considered in complex concepts and techniques to be applied, e.g. ex- the context of the environment dimension. The empirical tracting (meta)data distributions or using window-based results of [10] showed a complete lack of a simulation anomaly detection to notice change. Versioning of meta- environment in all studied frameworks. Simulation and data, component artifacts and configuration files would the use of synthetic data [23] are important components, enable the self-aware system to notice different forms of which need to be incorporated especially for the pipeline change and distinguish them. For example it could differ- and environment dimensions since their self-adaption entiate between an abrupt change to the interface of an strategies need a search space to optimize towards a goal. operator after a software update and the gradual decrease of data quality, based on the wrong composition of pre- processing operators. Collecting and storing such data 10 https://www.scikit-learn.org/stable/ is important, but so is managing and curating it, which 11 https://kubernetes.io/ Table 1 Conceptual requirements and their corresponding dimensions, categorized into self-awareness and self-adaption Category Requirement Dimension Collecting and storing metadata all Versioning of metadata all Versioning of component artifacts all Versioning of configuration files all Providing provenance capabilities all Analyzing metadata and creating statistics all Self-awareness Noticing structural changes data Noticing semantic changes data Noticing changes to contracts,APIs and interfaces operator Noticing changes to available computing resources environment Monitoring processing results and performance operator, pipeline Providing an interface for goal definition operator, pipeline, environment Initiating an adaption, based on the violation of a goal operator, pipeline, environment Automatically swapping operators operator, pipeline Self-adaption Automatically changing pipeline structure and components pipeline Automatically optimizing resource distribution and scheduling environment Providing a simulation space to test potential alteration pipeline, environment leads to the need for provenance capabilities over all di- 4.2. Self-adaption Requirements mensions. Also providing tools to analyze metadata, for Once the system is aware of a significant change, it trig- example to aggregate historic data into statistical values, gers an adaption. Based on the dimension in which the is an important requirement. Aggregated data enables a adaption should occur, i.e. operator, pipeline or environ- different perspective of change. ment, the prerequisites for all possible adaption operation When looking at the data dimension, the two funda- are checked. This first step towards an adaption is an mental requirements a pipeline framework with evolu- important requirement for a pipeline framework with tion capabilities has to fulfill are noticing changes to the evolution capabilities, since it creates a search space for structure of data and noticing changes to the semantics of possible adjustments. The operations, which make up data. These disruptors almost always trigger an adaption these adjustments, represent crucial requirements as well. and therefore, being aware and dealing with them, is of They include the automatic swapping of an operator, the utmost importance. The same can be said about the oper- automatic change of pipeline structure and/or compo- ator dimension. A changing operator interface will most nents, as well as the automatic optimization of resource certainly result in an erroneous pipeline. Hence, noticing distribution and pipeline scheduling. The search space such change is a critical requirement. Changes to the en- of all possible operations is transformed into a simula- vironment do not necessarily result in non-functioning tion space, in which possible alterations are tested. This pipelines, but rather influence the performance. Still, space connects the user’s goal definitions with the self- noticing changes to the environment, e.g. available hard- awareness metadata, while at the same time providing ware, is important to achieve framework performance simulation and optimization capabilities, in order to find goals, such as optimal utilization of available resources. an optimal adaption. A similar approach needs to be taken for operator and pipeline goals. Processing results and performance of individual operators as well as pipelines need to be mon- 5. Conclusion and Future Work itored, in order to compare these results to predefined goals. Diverse metrics for goal definition can be imag- The present work defined and showcased data pipelines ined, ranging from speed and throughput performance to and their corresponding frameworks. Evolution in the data quality and model accuracy. This leads to framework context of these systems was introduced and a conceptual requiring an interface for goal definition. This interface requirements model was proposed, comprised of all com- allows the user to specify objectives with respect to indi- ponents of such systems, categorized by self-awareness vidual operators, pipelines and the whole framework. At and self-adaption and structured into four dimensions. the same time, this goal definition is used for comparison By envisioning a system which fulfills these require- with the current as well as historic states of the system, ments, a first step was made towards a framework, which to notice change and possibly initiate an adaption. would need less maintenance based on its self-awareness and self-adaption, i.e. evolution capabilities. This type [4] A. Ismail, H. Truong, W. Kastner, Manufacturing of framework could be a substantial contribution for sci- Process Data Analysis Pipelines: A Requirements entists and practitioners alike. Analysis and survey, J. Big Data 6 (2019). The paper is concluded with a set of steps that need [5] M. M. Koushki, I. Y. Abualhaol, A. D. Raju, Y. Zhou, to be taken by the community towards achieving evolu- R. S. Giagone, S. Huang, On Building Machine tion capabilities in data pipelines. First of all, a proper Learning Pipelines for Android Malware Detection: requirements model using concepts and methods of re- a Procedural Survey of Practices, Challenges and quirements engineering must be constructed. This must Opportunities, Cybersecur. 5 (2022). include a structured requirements gathering process com- [6] S. Biswas, M. Wardat, H. Rajan, The Art and Prac- prised of talking to stakeholders, who would benefit from tice of Data Science Pipelines: A Comprehensive the proposed system, as well as an in-depth analysis Study of Data Science Pipelines In Theory, In-The- of existing concepts and techniques with regard to self- Small, and In-The-Large, in: ICSE, ACM, 2022. awareness and self-adaption. As a result, this step would [7] F. Psallidas, Y. Zhu, B. Karlas, J. Henkel, M. Inter- produce a system specification encompassing require- landi, S. Krishnan, B. Kroth, K. V. Emani, W. Wu, ments, including non-functional ones, use-cases and a C. Zhang, M. Weimer, A. Floratou, C. Curino, basic software architecture, as well as formal definitions K. Karanasos, Data Science Through the Looking of new terms. In the next step, these results need to be Glass: Analysis of Millions of GitHub Notebooks compared to existing frameworks and tools, in order to and ML.NET Pipelines, SIGMOD Rec. 51 (2022). find working solutions, but also gaps. All dimensions [8] P. Vassiliadis, A Survey of Extract-Transform-Load must be thoroughly analyzed and the system specifica- Technology, in: D. Taniar, L. Chen (Eds.), Inte- tion must be iteratively adjusted. During this phase soft- grations of Data Warehousing, Data Mining and ware engineering and architecture principles, which sup- Database Technologies - Innovative Approaches, port evolution capabilities must be derived from existing Information Science Reference, 2011. systems and be incorporated into the specification. The [9] P. Maymounkov, Koji: Automating Pipelines secondary goal of this step is to either find a framework, with Mixed-semantics Data Sources, CoRR which provides a good basis for evolution capabilities abs/1901.01908 (2019). arXiv:1901.01908 . – at least with respect to a certain dimension –, or to [10] M. Matskin, S. Tahmasebi, A. Layegh, A. H. Pay- discover the need to conceptualize and implement the berah, A. Thomas, N. Nikolov, D. Roman, A Survey missing components from scratch. In any case, the next of Big Data Pipeline Orchestration Tools from the step would be the creation of a prototype. As a final step, Perspective of the DataCloud Project, in: DAM- this prototype must be evaluated and validated, given DID/RCDL, volume 3036 of CEUR Workshop Pro- the system specification. ceedings, CEUR-WS.org, 2021. [11] J. Dean, S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in: OSDI, USENIX Acknowledgments Association, 2004. [12] M. Klettke, U. Störl, Four Generations in Data En- The author wants to thank Meike Klettke, Stefanie gineering for Data Science, Datenbank-Spektrum Scherzinger, and Uta Störl for many prolific discussions 22 (2022). as well as helpful suggestions, with regard to evolution [13] W. M. P. van der Aalst, T. Basten, H. M. W. Ver- capabilities in data pipelines, without which the present beek, P. A. C. Verkoulen, M. Voorhoeve, Adaptive work would not have been possible. workflow-on the interplay between flexibility and support, in: Proceedings of the 1st International References Conference on Enterprise Information Systems, Se- tubal, Portugal, 27-30 March 1999, ICEIS Secretariat, [1] J. Koskinen, H. Lahtonen, T. Tilus, Software Main- Escola Superior de Tecnologia de Setúbal, Portugal, tenance Cost Estimation and Modernization Sup- 1999. port, ELTIS-project, Technical Report, University [14] U. Greiner, J. Ramsch, B. Heller, M. Löffler, R. Müller, of Jyväskylä, Information Technology Research In- E. Rahm, Adaptive guideline-based treatment work- stitute, 2003. flows with adaptflow, in: Computer-based Support [2] B. Fjukstad, L. A. Bongo, A Review of Scalable for Clinical Guidelines and Protocols - Proceedings Bioinformatics Pipelines, Data Sci. Eng. 2 (2017). of the Symposium on Computerized Guidelines and [3] J. A. Novella, P. E. Khoonsari, S. Herman, D. White- Protocols, CGP 2004, Prague, Czech Republic, 12-14 nack, M. Capuccini, J. Burman, K. Kultima, April, 2004, volume 101 of Studies in Health Tech- O. Spjuth, Container-based Bioinformatics with nology and Informatics, IOS Press, 2004. Pachyderm, Bioinform. 35 (2019). [15] S. Grafberger, J. Stoyanovich, S. Schelter, Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines, in: CIDR, 2021. [16] S. Grafberger, S. Guha, J. Stoyanovich, S. Schelter, MLINSPECT: A Data Distribution Debugger for Ma- chine Learning Pipelines, in: SIGMOD, ACM, 2021. [17] J. Stoyanovich, B. Howe, H. V. Jagadish, Responsible Data Management, Proc. VLDB Endow. 13 (2020). [18] S. Schelter, S. Grafberger, S. Guha, O. Sprangers, B. Karlas, C. Zhang, Screening Native Machine Learning Pipelines with ArgusEyes, in: CIDR, 2022. [19] M. Klettke, A. Lutsch, U. Störl, Kurz erklärt: Mea- suring Data Changes in Data Engineering and their Impact on Explainability and Algorithm Fairness, Datenbank-Spektrum 21 (2021). [20] G. Baudart, M. Hirzel, K. Kate, P. Ram, A. Shinnar, J. Tsay, Pipeline combinators for gradual AutoML, in: NeurIPS, 2021. [21] I. Holubová, P. Koupil, J. Lu, Self-adapting Design and Maintenance of Multi-Model Databases, in: B. C. Desai, P. Z. Revesz (Eds.), IDEAS, ACM, 2022. [22] A. Hillenbrand, U. Störl, S. Nabiyev, M. Klettke, Self- adapting Data Migration in the Context of Schema Evolution in NoSQL Databases, Distributed Parallel Databases 40 (2022). [23] M. Abufadda, K. Mansour, A Survey of Synthetic Data Generation for Machine Learning, in: ACIT, IEEE, 2021. [24] S. Wagner, D. M. Fernández, M. Felderer, A. Vetrò, M. Kalinowski, R. J. Wieringa, D. Pfahl, T. Conte, M. Christiansson, D. Greer, C. Lassenius, T. Män- nistö, M. Nayebi, et al., Status Quo in Requirements Engineering: A Theory and a Global Family of Sur- veys, in: Software Engineering, volume P-310 of LNI, Gesellschaft für Informatik e.V., 2021.