1. Introduction

1613-0073

Capabilities in Data Pipelines

Kevin Kramer

kevin.kramer@fernuni-hagen.de 0 1 0 2. Evolution in Data Pipelines 1 University of Hagen , Universitätsstr. 1, 58097 Hagen , Germany

Evolutionary change over time in the context of data pipelines is certain, especially with regard to the structure and semantics of data as well as to the pipeline operators. Dealing with these changes, i.e. providing long-term maintenance, is costly. The present work explores the need for evolution capabilities within pipeline frameworks. In this context dealing with evolution is defined as a two-step process consisting of self-awareness and self-adaption. Furthermore, a conceptual requirements model is provided, which encompasses criteria for self-awareness and self-adaption as well as covering the dimensions data, operator, pipeline and environment. A lack of said capabilities in existing frameworks exposes a major gap. Filling this gap will be a significant contribution for practitioners and scientists alike. The present work envisions and lays the foundation for a framework which can handle evolutionary change.

data pipeline data evolution operator evolution data pipeline framework

1. Introduction

The last decade was characterized by ever increasing amounts of data. This also led to new technical demands in the context of data storage, transfer and analysis. In order to cope with these demands complex new systems emerged, which in turn require maintenance. Providing this maintenance is costly and even though the systems themselves might run as expected, changes over time, e.g. to the structure and semantics of data, inevitably induce a need to adjust the systems configuration to restore functionality. One estimate suggests that 50-70% of the total cost of a long running software system can be attributed to maintenance [ 1 ]. Data pipelines are an nance whenever change, i.e. evolution happens. Adding evolution capabilities to data pipelines and thereby reducing maintenance cost and human involvement could be a big contribution for scientists and practitioners alike. The current work takes the first step in this direction by collecting requirements needed for such a system and by envisioning a data pipeline framework which fulfills these requirements.

The following sections are structured as follows. Section 2 describes the general concepts and challenges of evolution in data pipelines. Important terminology is defined and related work is shown in this section as well. In Section 3 a pipeline framework with evolution nEvelop-O LGOBE (K. Kramer) domains such as bioinformatics [ 2, 3 ], manufacturing [ 4 ] and cybersecurity [ 5 ]. Broadly speaking, a data pipeline consists of three components: data source(s), operator(s) and data sink(s). Figure 1 (a) shows such a basic pipeline.

Biswas et al. empirically studied the components and

stages of 71 data science (DS) pipelines [ 6 ]. Their findings suggest that DS pipelines consist of a pre-processing phase, a model building phase and a post-processing phase.

They further extracted tasks and sub-tasks associated with these phases. Subtasks are atomic operators in the context of a pipeline. The pre-processing phase consists of the tasks data acquisition, data preparation and storage which represent the typical components of data engineering and also includes the data source(s). The model building phase is comprised of the tasks feature engineering, modeling, training, evaluation as well as prediction.

CEUR

ceur-ws.org © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License These tasks correspond to basic machine learning (ML) and data mining (DM) functions. The tasks included 2.2. Pipeline Frameworks in the post-processing layer are interpretation, communication and deployment as well as all data sinks. The The number of existing pipeline frameworks is overempirical results show that the pre-processing and the whelming. A popular collection of pipeline tools at model building phases appeared in 96% of examined DS GitHub3 includes 122 pipeline frameworks. At the same pipelines, the post-processing phase only appeared in time there is almost no scientific attention on the ab52% of pipelines. stract concepts of these systems. Some conceptual work

Pipelines can be linear, i.e. one data source, a chain of was made by Maymounkov [ 9 ]. The author proposes operators and finally one data sink. Psallidas et al. em- an important distinction in order to categorize pipeline pirically studied 8M Jupyter notebooks1 from GitHub2 frameworks. He divides frameworks into task-driven [ 7 ]. Their results which were produced by mining and and data-driven. Task-driven frameworks are agnostic analyzing the abstract syntax trees of all notebooks sug- about actual data and operations that occur during a gest that 80% of the pipelines are linear. The structure of pipeline run. Their focus lies on managing inter- and pipelines can be interpreted as a directed acyclic graph intra-pipeline dependencies and scheduling large num(DAG), allowing for pipelines, which can include several bers of pipelines in parallel. Popular proponents of this data sources and sinks as well as branching operators, i.e. category are Luigi4 and Apache Airflow 5. Data-driven operators which have more than one input or output. A pipelines are – to a varying degree – aware of the data widespread example of such non-linear data processing they process and the included operations. These frameare extract transform load (ETL). They are used to extract works put a focus on data (and operator) lineage also data from multiple heterogeneous sources, transform called provenance, i.e. they allow the user to retrace the them to use a common schema and then load them into a history of a data artifact by saving and curating metadata data sink such as a data warehouse (which may become a on all steps of the artifact producing pipeline. A popudata source itself in the following steps) [ 8 ]. Even though lar data-driven framework which logs various metadata pipelines can be created using only functions and mod- during pipeline runs is Dagster6. Some frameworks in ules by chaining their inputs and outputs together [ 7 ], this category enable data provenance by using a version pipeline frameworks allow users to generate, maintain control system similar to Git7. A prominent example of and administrate complex pipelines. 1https://www.jupyter.org/ 2https://www.github.com/ this is Pachyderm8. components and their interactions with each other. The

Comparing pipeline frameworks is made dificult by a changes triggered by disruptors are diverse, but can be number of factors: the sheer amount of diferent frame- broadly categorized into data, operator and environment works, the lack of a theoretical basis for analysis, the over- disruptors. lapping functionality and the difering ways to achieve The structure and semantics of data might change, afthe same goal within two frameworks. A thorough search fecting data sources and sinks as well as data artifacts of related work and literature focusing on such compar- created within the pipeline, e.g. interim results. Strucison, only revealed one paper [ 10 ]. Even though the tural changes in data might occur over time due to altered analysis was geared towards a specific system and its data producers or operators. Semantic changes in data requirements, the general results and especially the com- can emerge from technical, legislative but also societal parison criteria are a helpful first step towards distin- reasons. guishing pipeline frameworks. Some of these criteria Operator functionality might also experience evoluand their possible values include: tion, e.g after a software update, resulting in diferent • Type: business, science, big data APIs or a changed set of available (hyper)parameters. • Model: script-based, event-based, adaptive, Another form of change in this context is choosing a declarative and procedural diferent operator for a specific task which accepts the • Separation of concerns: asks whether or not same input as the old one but produces a diferent output, high-level pipeline definitions can be separated e.g. a diferent data structure. This leads to the need to from low-level data and operator implementa- adapt the pipeline to fit this new operator. tions Also, the environment in which the pipeline is run • Language: general purpose language (GPL), do- can change over time. For example, the hardware could main specific language (DSL) change resulting in more processing power or more clus• Pipeline programming: text-based, graphical, ter nodes becoming available. Adapting to such change visual by increasing the number of pipelines running in par• Reusability: asks whether or not a framework allel or utilizing bigger batch sizes in order to increase provides tools for reusing existing pipeline def- eficiency could be possible examples. initions as well as individual components of a previously defined pipeline 3. Pipeline Framework with • Containerization: asks if pipeline components, whole pipelines and the pipeline framework itself Evolution Capabilities can be deployed in a container • Monitoring: asks whether or not the framework allows for runtime observation of the system or if it is granting logging capabilities Some of these results are referenced in Section 3. In Section 4 these basic criteria are extended with a special focus on evolution capabilities. The particularities resulting from evolution will be presented in more detail in the next subsection. 2.3. Pipeline Evolution Evolution means change over time. In the realm of computer science change can mean a lot of diferent things.

The emergence and widespread adoption of a new data format (such as JSON 9) or programming model (such as MapReduce [ 11 ]) are examples of this. This type of evolution is often gradual and influenced by many diferent factors. In the context of data pipelines and corresponding frameworks evolution can happen over diferent time frames, ranging from gradual to sudden. The main evolution factors are so-called disruptors, which can afect all

8https://www.pachyderm.com/ 9https://www.json.org

In this section a pipeline framework with evolution capabilities is envisioned and discussed. Figure 2, based on a figure from [ 12 ], shows a graphical representation of the proposed framework. The outside of the figure is made up of the environment frame including goals and contracts as well as metadata and statistics. These elements represent the available resources, user objectives and metadata, which the system gathered, stored and aggregated throughout its lifecycle. Within this frame there are essentially five columns. They represent (from left to right) data sources, operators and data sinks. The arrows connect the individual components and show two pipelines, each consisting of a data source, three operators and a data sink. Evolutionary change can happen at several points during a pipeline’s lifecycle. In Figure 2 these disruption points are shown as red flashes.

Structure and semantics of data might change at the data sources as well as within the pipeline. Evolution can also afect the operators and the environment in which the pipelines are run. In any case, an ideal pipeline framework could automatically adapt to these changes.

Concerning adaptability, an important distinction needs to be made. Generally speaking, it is possible to build pipelines in existing frameworks, that are very lfexible. One class of systems, which are very flexible monitoring capabilities and allow for concepts such as are adaptive workflows , first presented by van der Aalst reproducibility and provenance which are closely related et al. [ 13 ]. Besides being mainly task-driven, these sys- to evolution. A tool for inspecting pipelines which runs tems adapt themselves based on strict, predefined rules. on existing Python code is mlinspect [ 15, 16 ]. It extracts An example of such a system is AdaptFlow presented the DAG structure of a pipeline and helps the user to in [ 14 ]. Given a treatment plan in the medical context, identify problems and bugs. For example it can help to AdaptFlow can notice logical errors and choose a difer- identify a skewed data distribution which would lead to ent path in the predefined workflow. This flexibility is unfair [ 17 ] results. ArgusEyes [ 18 ] is a tool for inspectcompletely dependent on and bounded by the treatment ing classification pipelines which builds upon mlinspect. workflow. Generally speaking, the space of possible al- It enables the user to check whether best practices are terations, given such a flexible system, is significantly applied while also providing various metadata to analyze smaller than the space envisioned in the present work. pipelines. Even though these tools are not intended to This stems from the fact that a pipeline framework with track the evolution of pipelines and their components, evolution capabilities dynamically creates and alters this but rather focus on helping practitioners with a specific search space, in order to find an optimal solution, at dif- issue, the underlying architecture can serve as useful ferent times during the system’s lifecycle. This demon- guidance for the development of a pipeline framework strates that flexibility is not the same as adaptability. It is with evolution capabilities. Another important aspect also possible to build meta pipelines especially for moni- is to track data changes across pipeline steps. The autoring changes as well as adapting to these changes. Even thors of [ 19 ] present three measuring approaches that though this is currently the most practical solution for are utilized in order to deal with bias. achieving evolution capabilities in existing frameworks, Monitoring capabilities, gathering and storing metathis approach does not represent real evolution capabil- data as well as calculating and providing statistics on ities as they were defined in the previous sections. In these findings are critical functionalities towards evoluany case, before adapting to evolution, the underlying tion capabilities in pipeline frameworks. They are neceschanges need to be noticed and recognized. sary in all dimensions and are the basis for self-awareness. Tools like mlinspect and ArgusEyes, but also existing 3.1. Self-awareness data-driven frameworks like Dagster can be a starting point towards achieving such functionality. Perceiving The first step in dealing with evolution is to be aware of change in operator results or contracts leading to the change. Figure 1 (b) shows this step in dealing with evo- automatic swapping or parameter change is also fundalution. Data-driven frameworks are usually more aware mentally important. One project that can be of help in of change than task-driven ones since they provide more this regard is IBM Lale [ 20 ] which automatically creates optimal pipelines based on scikit-learn10 functions. Once the system is aware of change, it needs to adapt to the new circumstances.

4. Conceptual Requirements Model As described in the previous sections, there is no frame

3.2. Self-adaption work with comprehensive evolution capabilities yet. This emphasizes the need for a requirements model, encomAutomatic acting upon change can only be done with passing important components and their interplay as respect to a goal. This goal could be as simple as ensuring well as system functionalities. The model presented in functionality and as complex as automatically optimiz- this section is conceptual, i.e. it was not derived through ing the performance and accuracy of several big data a structured method from the field of requirements enpipelines running in parallel given certain hardware. Fig- gineering [ 24 ]. It rather evolved from technical talks ure 1 (c) shows the adaption step, after a disruption has with experienced colleagues and a rough analysis and been perceived by the self-awareness capabilities. In this comparison of existing pipeline frameworks. It can serve context it is decisive to formulate a goal including a fitting as the inception step for a structured requirements gathrepresentation, which the pipeline framework can use ering process and furthermore helps with the testing of to evaluate decisions. The dimensions for pipeline and existing frameworks for their evolution capabilities. environment shown in the last section both contain the The requirements are structured into two categories, evolution requirement to provide an interface for goals. self-awareness and self-adaption as well as four dimenThis reveals a potential conflict: A pipeline with the sions. goal to achieve the best possible accuracy for a ML task • Data: Data sources and sinks, structure and semight want to simulate a lot of diferent pipelines to find mantics of data the best one and to achieve this goal. At the same time • Operator: Modules and functions and their resimulations and tests might cost a lot of computational spective inputs and outputs resources, which could stand in contrast to the environ- • Pipeline: Creation and administration of ment dimension’s goal to provide a certain performance pipelines to all pipelines. A pipeline framework with evolution • Environment: Available hardware and schedulcapabilities needs to have dynamic functionality to deal ing, scaling and orchestration of pipelines with these kinds of conflicts. Table 1 presents an overview of the requirements. The

The vision of self-adapting systems is not unique to following sections describe the requirements listed in the present work. The authors of [ 12 ] present four gener- Table 1 in detail. ations in data engineering for data science ranging from simple data pre-processing to fully automated data curation. In [ 21 ] the authors envision a framework for multi- 4.1. Self-awareness Requirements model databases, which is self-adapting with regard to de- Self-awareness means being aware of change. This sign and maintenance. Similar to the insight gained from change is always relative with respect to some previtools like mlinspect and ArgusEyes in the context of evo- ous state, i.e. in order to be self-aware, a system needs to lution awareness, other self-adaptive systems can help store at least one previous state for comparison with the to understand the underlying components and their in- current state. Therefore, collecting and storing metadata terplay. For example Hillenbrand et al. propose a system over all dimensions is an integral requirement for a selfwhich automatically chooses an optimal data migration aware pipeline framework. Even though comparing two strategy given some constraints like service-level agree- system states is suficient to notice change, in many cases ments [ 22 ]. Pachyderm which runs natively in Kuber- it would be beneficial to have a history of system states. netes11 has a built-in system for distributed computing / Creating a versioned history of metadata allows for more scaling, which is very simple and should be considered in complex concepts and techniques to be applied, e.g. exthe context of the environment dimension. The empirical tracting (meta)data distributions or using window-based results of [ 10 ] showed a complete lack of a simulation anomaly detection to notice change. Versioning of metaenvironment in all studied frameworks. Simulation and data, component artifacts and configuration files would the use of synthetic data [ 23 ] are important components, enable the self-aware system to notice diferent forms of which need to be incorporated especially for the pipeline change and distinguish them. For example it could diferand environment dimensions since their self-adaption entiate between an abrupt change to the interface of an strategies need a search space to optimize towards a goal. operator after a software update and the gradual decrease of data quality, based on the wrong composition of preprocessing operators. Collecting and storing such data is important, but so is managing and curating it, which

Self-awareness Self-adaption Collecting and storing metadata Versioning of metadata Versioning of component artifacts Versioning of configuration files Providing provenance capabilities Analyzing metadata and creating statistics Noticing structural changes Noticing semantic changes Noticing changes to contracts,APIs and interfaces Noticing changes to available computing resources Monitoring processing results and performance Providing an interface for goal definition Initiating an adaption, based on the violation of a goal Automatically swapping operators Automatically changing pipeline structure and components Automatically optimizing resource distribution and scheduling Providing a simulation space to test potential alteration

Dimension leads to the need for provenance capabilities over all di- 4.2. Self-adaption Requirements mensions. Also providing tools to analyze metadata, for example to aggregate historic data into statistical values, Once the system is aware of a significant change, it trigis an important requirement. Aggregated data enables a gers an adaption. Based on the dimension in which the diferent perspective of change. adaption should occur, i.e. operator, pipeline or environ

When looking at the data dimension, the two funda- ment, the prerequisites for all possible adaption operation mental requirements a pipeline framework with evolu- are checked. This first step towards an adaption is an tion capabilities has to fulfill are noticing changes to the important requirement for a pipeline framework with structure of data and noticing changes to the semantics of evolution capabilities, since it creates a search space for data. These disruptors almost always trigger an adaption possible adjustments. The operations, which make up and therefore, being aware and dealing with them, is of these adjustments, represent crucial requirements as well. utmost importance. The same can be said about the oper- They include the automatic swapping of an operator, the ator dimension. A changing operator interface will most automatic change of pipeline structure and/or compocertainly result in an erroneous pipeline. Hence, noticing nents, as well as the automatic optimization of resource such change is a critical requirement. Changes to the en- distribution and pipeline scheduling. The search space vironment do not necessarily result in non-functioning of all possible operations is transformed into a simulapipelines, but rather influence the performance. Still, tion space, in which possible alterations are tested. This noticing changes to the environment, e.g. available hard- space connects the user’s goal definitions with the selfware, is important to achieve framework performance awareness metadata, while at the same time providing goals, such as optimal utilization of available resources. simulation and optimization capabilities, in order to find A similar approach needs to be taken for operator and an optimal adaption. pipeline goals. Processing results and performance of individual operators as well as pipelines need to be mon- 5. Conclusion and Future Work itored, in order to compare these results to predefined goals. Diverse metrics for goal definition can be imag- The present work defined and showcased data pipelines ined, ranging from speed and throughput performance to and their corresponding frameworks. Evolution in the data quality and model accuracy. This leads to framework context of these systems was introduced and a conceptual requiring an interface for goal definition. This interface requirements model was proposed, comprised of all comallows the user to specify objectives with respect to indi- ponents of such systems, categorized by self-awareness vidual operators, pipelines and the whole framework. At and self-adaption and structured into four dimensions. the same time, this goal definition is used for comparison By envisioning a system which fulfills these requirewith the current as well as historic states of the system, ments, a first step was made towards a framework, which to notice change and possibly initiate an adaption. would need less maintenance based on its self-awareness and self-adaption, i.e. evolution capabilities. This type of framework could be a substantial contribution for scientists and practitioners alike.

The paper is concluded with a set of steps that need to be taken by the community towards achieving evolution capabilities in data pipelines. First of all, a proper requirements model using concepts and methods of requirements engineering must be constructed. This must include a structured requirements gathering process comprised of talking to stakeholders, who would benefit from the proposed system, as well as an in-depth analysis of existing concepts and techniques with regard to selfawareness and self-adaption. As a result, this step would produce a system specification encompassing requirements, including non-functional ones, use-cases and a basic software architecture, as well as formal definitions of new terms. In the next step, these results need to be compared to existing frameworks and tools, in order to ifnd working solutions, but also gaps. All dimensions must be thoroughly analyzed and the system specification must be iteratively adjusted. During this phase software engineering and architecture principles, which support evolution capabilities must be derived from existing systems and be incorporated into the specification. The secondary goal of this step is to either find a framework, which provides a good basis for evolution capabilities – at least with respect to a certain dimension –, or to discover the need to conceptualize and implement the missing components from scratch. In any case, the next step would be the creation of a prototype. As a final step, this prototype must be evaluated and validated, given the system specification.

Acknowledgments The author wants to thank Meike Klettke, Stefanie

Scherzinger, and Uta Störl for many prolific discussions as well as helpful suggestions, with regard to evolution capabilities in data pipelines, without which the present work would not have been possible.

[1]

Koskinen ,

Lahtonen , T. Tilus, Software Maintenance Cost Estimation and Modernization Support, ELTIS-project , Technical Report , University of Jyväskylä, Information Technology Research Institute, 2003 .

[2]

Fjukstad ,

L. A.

Bongo , A Review of Scalable Bioinformatics Pipelines , Data Sci. Eng . 2 ( 2017 ).

[3]

J. A.

Novella ,

P. E.

Khoonsari ,

Herman ,

Whitenack ,

Capuccini ,

Burman ,

Kultima ,

Spjuth , Container-based Bioinformatics with Pachyderm , Bioinform . 35 ( 2019 ).

[4]

Ismail ,

Truong , W. Kastner, Manufacturing Process Data Analysis Pipelines: A Requirements Analysis and survey , J. Big Data 6 ( 2019 ).

[5] M. M. Koushki , I. Y.

Abualhaol , A. D.

Raju , Y.

Zhou , R. S.

Giagone , S.

Huang , On Building Machine Learning Pipelines for Android Malware Detection: a Procedural Survey of Practices, Challenges and Opportunities, Cybersecur. 5 ( 2022 ).

[6]

Biswas ,

Wardat ,

Rajan , The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-TheSmall, and In-The-Large , in: ICSE, ACM, 2022 .

[7]

Psallidas ,

Zhu ,

Karlas ,

Henkel ,

Interlandi ,

Krishnan ,

Kroth ,

K. V.

Emani ,

Wu ,

Zhang ,

Weimer ,

Floratou ,

Curino ,

Karanasos , Data Science Through the Looking Glass: Analysis of Millions of GitHub Notebooks and ML . NET Pipelines, SIGMOD Rec . 51 ( 2022 ).

[8]

Vassiliadis , A Survey of Extract-Transform-Load Technology , in: D. Taniar , L. Chen (Eds.), Integrations of Data Warehousing, Data Mining and Database Technologies - Innovative Approaches, Information Science Reference , 2011 .

[9]

Maymounkov , Koji: Automating Pipelines with Mixed-semantics Data Sources , CoRR abs/ 1901 . 01908 ( 2019 ). arXiv: 1901 . 01908 .

[10]

Matskin ,

Tahmasebi ,

Layegh ,

A. H.

Payberah , A. Thomas,

Nikolov ,

Roman , A Survey of Big Data Pipeline Orchestration Tools from the Perspective of the DataCloud Project , in: DAMDID/RCDL, volume 3036 of CEUR Workshop Proceedings, CEUR-WS.org , 2021 .

[11]

Dean ,

Ghemawat , MapReduce: Simplified Data Processing on Large Clusters , in: OSDI, USENIX Association, 2004 .

[12]

Klettke , U. Störl, Four Generations in Data Engineering for Data Science , Datenbank-Spektrum 22 ( 2022 ).

[13] W. M. P. van der Aalst , T. Basten, H. M. W.

Verbeek , P. A. C.

Verkoulen , M.

Voorhoeve , Adaptive workflow-on the interplay between flexibility and support , in: Proceedings of the 1st International Conference on Enterprise Information Systems , Setubal, Portugal, 27 -30 March 1999 ,

ICEIS

Secretariat , Escola Superior de Tecnologia de Setúbal, Portugal, 1999 .

[14]

Greiner ,

Ramsch ,

Heller ,

Löfler ,

Müller , E. Rahm, Adaptive guideline-based treatment worklfows with adaptflow, in: Computer-based Support for Clinical Guidelines and Protocols - Proceedings of the Symposium on Computerized Guidelines and Protocols , CGP 2004 , Prague, Czech Republic, 12 - 14 April, 2004 , volume 101 of Studies in Health Technology and Informatics , IOS Press, 2004 .

[15]

Grafberger ,

Stoyanovich ,

Schelter , Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines , in: CIDR, 2021 .

[16]

Grafberger ,

Guha ,

Stoyanovich , S. Schelter, MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines , in: SIGMOD, ACM, 2021 .

[17]

Stoyanovich ,

Howe ,

H. V.

Jagadish , Responsible Data Management , Proc. VLDB Endow . 13 ( 2020 ).

[18]

Schelter ,

Grafberger ,

Guha ,

Sprangers ,

Karlas ,

Zhang , Screening Native Machine Learning Pipelines with ArgusEyes , in: CIDR, 2022 .

[19]

Klettke ,

Lutsch , U. Störl, Kurz erklärt: Measuring Data Changes in Data Engineering and their Impact on Explainability and Algorithm Fairness , Datenbank-Spektrum 21 ( 2021 ).

[20]

Baudart ,

Hirzel ,

Kate ,

Ram ,

Shinnar ,

Tsay , Pipeline combinators for gradual AutoML , in: NeurIPS, 2021 .

[21]

Holubová ,

Koupil ,

Lu , Self-adapting Design and Maintenance of Multi-Model Databases , in: B. C. Desai , P. Z. Revesz (Eds.), IDEAS, ACM, 2022 .

[22]

Hillenbrand ,

Störl ,

Nabiyev ,

Klettke , Selfadapting Data Migration in the Context of Schema Evolution in NoSQL Databases , Distributed Parallel Databases 40 ( 2022 ).

[23]

Abufadda ,

Mansour , A Survey of Synthetic Data Generation for Machine Learning , in: ACIT, IEEE, 2021 .

[24]

Wagner ,

D. M.

Fernández ,

Felderer ,

Vetrò ,

Kalinowski ,

R. J.

Wieringa ,

Pfahl ,

Conte ,

Christiansson ,

Greer ,

Lassenius ,

Männistö ,

Nayebi , et al., Status Quo in Requirements Engineering: A Theory and a Global Family of Surveys , in: Software Engineering, volume P-310 of

LNI

, Gesellschaft für Informatik e.V., 2021 .