=Paper=
{{Paper
|id=Vol-3741/paper44
|storemode=property
|title=Data Pipelines Assessment: The Role of Data Engine Deployment Models
|pdfUrl=https://ceur-ws.org/Vol-3741/paper44.pdf
|volume=Vol-3741
|authors=Claudio A. Ardagna,Valerio Bellandi,Marco Luzzara,Antongiacomo Polimeno
|dblpUrl=https://dblp.org/rec/conf/sebd/ArdagnaBLP24
}}
==Data Pipelines Assessment: The Role of Data Engine Deployment Models==
Data Pipelines Assessment: The Role of Data Engine Deployment Models Claudio A. Ardagna1,∗,† , Valerio Bellandi1,† , Marco Luzzara1,† and Antongiacomo Polimeno1,† 1 Università Degli Studi di Milano, Department of Computer Science, Via Celoria 18, Milano, Italy Abstract In this paper, we explore different deployment models for data engines and elucidate their implications on data pipeline behavior. Specifically, we examine the impact on data sharing, data protection, pipeline uptime and latency, and the feasibility of moving segments of typical data engines to the edge. Our work demonstrates the consequences of various deployment strategies on non-functional properties of data pipelines, focusing on availability, performance, and privacy. By considering the interplay between data engine deployment and data pipeline requirements, stakeholders can make informed decisions to optimize the efficiency and effectiveness of data-driven systems. Keywords Data Engine, Deployment Models, Non-Functional Assessment, Privacy 1. Introduction The last decades have been characterized by multiple ICT revolutions, from service to cloud-edge computing, from mobile systems to 5G and Internet of Things (IoT), and from big data to machine learning (ML). These technological enhancements brought to a scenario where data production, collection, and analysis are carried out at an unprecedented rate [1, 2], while distributed systems are increasingly non-deterministic and built on miniaturized services composed and executed across the cloud-edge continuum. Data are today the cornerstone of innovation, driving advancements in a variety of sectors from healthcare to finance and beyond. However, as the volume and variety of data continue to expand, so do concerns surrounding privacy and security. In addition, data stands as the lifeblood of ICT infrastructures, driving innovation, decision-making, and efficiency across various domains. The significance of data within ICT infrastructures cannot be overstated, as it serves as the foundation upon which modern systems are built and optimized. Data engines (aka data platforms) are yet another type of modern system and consist of many components for data management often implemented as micro-services. Different architectural solutions for data engine deployment address the peculiarities of complex data-driven environ- SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy ∗ Corresponding author. † These authors contributed equally. Envelope-Open claudio.ardagna@unimi.it (C. A. Ardagna); valerio.bellandi@unimi.it (V. Bellandi); marco.luzzara@unimi.it (M. Luzzara); antongiacomo.polimeno@unimi.it (A. Polimeno) Orcid 0000-0001-7426-4795 (C. A. Ardagna); 0000-0003-4473-6258 (V. Bellandi); 0009-0003-1197-568X (A. Polimeno) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings ments. In the past, the choice of the specific deployment model typically hinged on technical feasibility, while not considering the impact of a specific deployment on the non-functional posture (e.g., performance, privacy) of the data-driven systems. The research community has recently started considering data engine deployment as yet another dimension of data pipeline validation and verification [3], where any changes to the data engine must be assessed on the basis of data (pipeline) peculiarities, including data source, sensitivity, type, and volume, to name but a few. In this work, we claim that a specific data engine deployment model has a direct impact on the behavior of the data engine itself and its ability to satisfy specific non-functional requirements requested by the target data pipeline. We discuss three possible deployments (Sections 3, 4, and 5) and demonstrate their impact on the final data pipeline behavior, particularly concerning the sharing and protection of data, on one side, and data engine availability and performance, on the other side (Section 6). We finally evaluate our deployment models using three data pipelines in the domains of e-commerce, finance, and healthcare (Section 7). 2. Reference Architecture Our reference architecture incorporates a common data engine [4, 5] that orchestrates essential building blocks for effective data management and analysis: i) data storage for extensive data volumes accessed for analytical and operational purposes; ii) resource manager for allocating computational resources, ensuring timely data processing; iii) data analytics for querying and analyzing large datasets; iv) data processing for manipulating data in batch and real-time workflows; v) data visualization for engaging and informative data representations. Initially, the architecture collects data from various sources, directing it to a central data storage repository accessible for processing and analytics. The resource manager coordinates these elements, ensuring structured and efficient resource allocation. Data processing and analytics components can write back to the central storage, which is then accessed for data visualization to present and interpret the processed data. 3. Centralized Deployment The centralized deployment deploys the entire stack in a single location (e.g., in the cloud). It is well-suited for scenarios where data proximity is not a critical factor, and the primary concern is the efficient management of resources and data processing. 3.1. Description Figure 1 presents the architecture of a centralized deployment built on two main blocks: i) data sources, ii) data engine. Data sources, positioned at various layers of a distributed infrastructure, gather data from sensors, devices, and network (edge and cloud) nodes. These data are trans- ferred to the data engine via communication queues or APIs, and stored in the cloud-based data engine. The data then undergoes processing steps like preparation, analysis, and processing. Centralized deployment integrates key components in a single stack, typically located in a data Edge Cloud Edge Data Storage Edge Data Data Processing Analytics Figure 1: High-level architecture for a centralized deployment. center or cloud environment, to simplify management and enhance function accessibility for efficient data handling and analysis. 3.2. Pro and Cons The centralized deployment holds several advantages that contribute to its appeal in various contexts, where simple management and resource availability stand out: i) single point of management, the centralized deployment streamlines administrative tasks, reducing complexity and facilitating more efficient oversight of the entire centralized system; ii) resource availability, built on cloud functionalities with particular reference to scalability and elasticity. The final user has the impression of having infinite resources at its disposal, further empowering its ability to efficiently execute resource-intensive data processing; iii) end-to-end data pipeline control, giving the final user the possibility of managing the entire data pipeline in a single point, leading to a cohesive and orchestrated approach. iv) centralized authentication and authorization, where access to data is centrally regulated reducing the complexities associated with user syncing and access management across multiple locations. The centralized deployment, however, introduces important challenges that call for careful consideration: i) single point of failure, posing a significant risk, as any malfunctions or outages in the centralized infrastructure can disrupt the entire system. It can be relieved by adopting disaster recovery and high availability protocols; ii) transfer costs and scalability limitations, continuous streams of data towards a centralized cluster for processing can result in significant data transfers, leading to high costs in terms of time and resources. Additionally, scalability limitations may impede the adaptability of the architecture to growing data volumes and increasing demands; iii) increased latency, impacting on the ability of the user to carry out real-time processing when data need to be moved from data sources to the cloud; iv) increased risk of data breaches and data leaks, when data traveling from distributed edge locations to the centralized cluster, traverse various untrusted network points, where they can be intercepted or accessed by unauthorized parties. 4. Decentralized Deployment The decentralized deployment deploys the entire stack closer to the data at the edge. It is well-suited for scenarios where data proximity is a critical factor, and the main focus is on Edge Data Data Processing Storage Figure 2: High-level architecture for a decentralized deployment. efficient resource management and data processing with privacy in mind. It can be complex and costly to implement in certain contexts. 4.1. Description Figure 2 presents the architecture of a decentralized deployment. The decentralized deployment includes all building blocks defined in Section 2. The entire stack is deployed at the edge, closer to the data sources. From a technical standpoint, it is necessary to limit the complexity of the deployed stack to reduce costs and system complexity. For example, data storage is simpler and deployed on fewer machines, while data resource management only manages the resources available at the edge. 4.2. Pro and Cons Some of the advantages of a centralized deployment (i.e., single point of management, end-to-end pipeline control, and centralized authentication and authorization) are also valid in a decentralized deployment. The decentralized deployment provides two additional benefits: i) increased data protection, since sensitive information does not have to travel to external servers, minimizing the risk of data breaches; ii) reduced data transfer costs by deploying processing components in proximity of data sources. The decentralized deployment, however, introduces important challenges that call for careful consideration: i) increased complexity, due to the shift of the entire stack to the edge; ii) resource limitations, where the system performance is constrained by the computational resources at the edge, which are typically less powerful than those available in a data center; iii) decreased security, because data at the edge lacks the protection provided by a typical data center with no resource limitations, making it more vulnerable to physical attacks; iv) environmental factors (e.g., temperature and humidity) that can impact the performance of the system. 5. Hybrid Deployment The hybrid deployment combines centralized and decentralized deployments to fully unleashes the potential of microservices technologies. It deploys the building blocks of the data engine where convenient. Edge Cloud Edge Temporary Data Storage Storage Edge Data Processing Data Data Processing Analytics Figure 3: High-level architecture for a hybrid deployment. 5.1. Description Figure 3 presents the architecture of a hybrid deployment. It is built on two main parts, one deployed in the cloud and the other deployed at the edge. The same stack discussed for the centralized deployment (Section 3) is used in the cloud. We note that the resource manager might be compatible and support the communication with its counterpart at the edge. A minimal version of the stack with i) a temporary storage to manage ingestion and ii) a resource manager that receives instructions from the one located in the cloud is used at the edge. A distributed resource manager permits some autonomy at the edge, enabling it to independently manage its own resources and tasks, such as job scheduling or data transformation operations. 5.1.1. Pros and Cons The hybrid deployment provides a balanced solution that merges the advantages of both centralized and decentralized deployment models: i) optimized resource allocation and monitoring, where resources are dynamically allocated based on workload demands, maximizing utilization and performance across the distributed environment. By partially processing data at the edge, computational tasks can be offloaded from the centralized cloud infrastructure, reducing latency and bandwidth usage; ii) increased fault tolerance, because the system remains operational, albeit potentially with limited functionality, in the event of central server failure; confining data within explicitly defined processing boundaries to reduce the likelihood of accidental data exfiltration. The decentralized deployment, however, introduces an additional challenge: i) increased complexity, due to the need to support data partitioning, fault tolerance, coordination, resource management, and troubleshooting. 6. Mapping Non-Functional Properties on Architectural Deployments We discuss how architectural deployments can impact the non-functional properties availability, performance, and privacy of the specific data engine and the data pipelines executed on it. Each property is assigned a level of strength in {low, medium, high}, as presented in Table 6.1. High Multiple replicas, multiple High Resources can scale without High No data are transferred zones any limits Medium Only secondary data are Medium Multiple replicas, single Medium Resources have a limited transferred zones ability to scale Low Both primary and secondary Low Single replica, single zone Low Resources do not scale data are transferred a) Availability b) Performance c) Privacy Table 1 Property strength in {low, medium, high} 6.1. Non-Functional Properties 6.1.1. Property Availability It is the availability that a specific deployment model can guarantee to a data engine and, in turn, to the data pipelines executed on it. We define property availability as follows. Definition 1 (Property Availability). Property Availability 𝑝𝑎 models the system uptime as a function of the number of system replicas (i.e., single, multiple) and their deployment across different zones (i.e., single zone, multiple geographically-distributed zones). The levels of strength associated with property availability are as follows. Low availability refers to a single replica with data stored on a single zone. If a node fails, the data may become inaccessible until the node is repaired or replaced, resulting in significant downtime. Low availability is generally not suitable for critical systems, but may be acceptable for non-critical data or systems, where cost savings are a priority. Medium availability refers to multiple replicas with data distributed across a single zone. While there is some level of fault tolerance, the system may experience downtime or slower response time if a node fails, as the remaining nodes may become overloaded with requests, or the entire zone may experience downtime. Medium availability may be acceptable for systems where occasional downtime or slower response times can be tolerated. High availability refers to multiple replicas with data distributed across different geographically-distributed zones. Even if one or more nodes or zones fail, the system can still serve the data from the remaining nodes or zones. This ensures that the data are always accessible, providing a high level of fault tolerance. High availability is often associated with systems that cannot afford downtime and where data loss is unacceptable. 6.1.2. Property Performance It is the performance that a specific deployment model can guarantee to a data engine and, in turn, to the data pipelines executed on it, as follows. We define property performance as follows. Definition 2 (Property Performance). Property Performance 𝑝𝑝 models the system perfor- mance as a function of the amount of available resources and the time required for moving data from the sources to the data engine. The former aspect considers both static scenarios where re- sources are assigned a priori and dynamic scenarios where resources can elastically scale. The latter aspect depends on the volume of data to be moved and the corresponding network resources. The levels of strength associated with property performance are as follows. Low performance refers to resources that do not scale. This means that the amount of resources is fixed and cannot be increased to handle higher demand. If the demand exceeds the capacity of the resources, the system may experience significant performance issues. Medium performance refers to resources that have a limited ability to scale. The system can handle moderate increases in demand, but may struggle or experience performance degradation if the demand increases significantly. High performance refers to resources that can scale virtually with no limits. The system can add more resources to meet an increasing demand. This is often seen in cloud-based systems where resources can be added or removed as needed. 6.1.3. Property Privacy It is the level of privacy that a specific deployment model can guarantee to a data engine and, in turn, to the data pipelines executed on it. We define property privacy as follows. Definition 3 (Property Privacy). Property Privacy 𝑝𝑝𝑟 models the level of data protection guar- anteed by a specific deployment model as a function of the amount and type (either primary or secondary) of data that are exchanged between different systems. The levels of strength associated with privacy are as follows. Low privacy refers to a scenario where data transfer is requested for both primary data collected from the source and secondary data that have been previously pre-processed. These data can include sensitive information, so it is important to have robust security measures in place to protect them during transfer. Medium privacy refers to a scenario where data transfer is requested for secondary data only. Primary data collected from the source are not transferred, reducing the risk associated with sensitive information leakage. Security measures are still important to protect data during transfer. High privacy refers to a scenario where no data transfer is requested. This is the highest level of privacy, as it eliminates the risk of sensitive information being intercepted or misused during data transfer. However, it also means that the benefits of data sharing, such as collaboration and data analysis, are forbidden. 6.2. Mapping Table 2 describes the correlation between non-functional properties in Section 6.1 and the architectural deployments in Sections 3–5. We use the following symbols to denote the support provided by a deployment model for a specific property level: 3 to denote full support, ~ to denote that the property level can be supported with certain limitations, 7 to denote no support. The subsequent analysis delves into the details of this mapping. Centralized Deployment takes full advantage of cloud capabilities, ensuring optimal avail- ability and solid performance, while introducing several privacy challenges. It supports all availability levels due to the native support for geographically distributed replicas provided by the cloud. We note that the support for high availability (i.e., multiple replicas stored across geographically distributed zones) mitigates the impact of a single point of failure. It stream- lines resource allocation and monitoring, facilitating system scalability and elasticity without Deployment Model Availability Performance Privacy L M H L M H L M H Centralized 3 3 3 3 3 ~ 3 7 7 Decentralized 3 ~ 7 3 ~ 7 3 3 3 Hybrid 3 3 3 3 3 ~ 3 3 3 Table 2 Mapping between non-functional property strengths and architectural deployment models constraints. However, performance could degrade (high latency) when large volumes of data must be transferred from the sources to the cloud. For this reason, while the low and medium performance levels are fully supported, the high level might not be always feasible. Finally, the frequent transfer of raw and unprocessed data to the central server in the cloud raises concerns about unauthorized access or interception, limiting privacy to the low level. Decentralized Deployment executes on a restricted set of resources at the edge, which nega- tively affects availability and performance. On the other hand, data locality guarantees high privacy. It lacks support for geographically distributed zones. Nevertheless, it can accommo- date multiple replicas, sustaining low and medium levels of availability. Operating within a decentralized environment entails coping with limited and less powerful resources. However, proximity to the data source can mitigate latency, thereby increasing the efficiency of data pro- cessing. Decentralized deployment thus supports performance levels low and medium. Finally, leveraging decentralized data storage can fortify privacy measures by reducing exposure to a single point of attack and minimizing data transfer. Decentralized deployment ensures privacy across all levels, from low to high. Hybrid Deployment offers better properties on average. It enjoys the benefits given by cloud resources availability, as well as the protection/anonymization of sensitive data at the edge. Hybrid deployment breaks the monolithic approach followed by centralized and decentralized deployment models, distributing critical components across the cloud-edge continuum. It supports availability across all levels, from low to high. Although this approach may ensure performance at all levels, from low to high, potential bottlenecks in data transfer can still survive. Finally, it enhances privacy by minimizing exposure to a single point of attack and reducing data transfer. Source data can be first pre-processed at the edge and then transferred to the cloud. Hybrid deployment ensures privacy at all levels, from low to high . 7. Evaluation We propose three reference scenarios that evaluate the application of our deployment models in varying contexts. Throughout this section, we also present selected code excerpts illustrating the deployment processes using Docker Compose or Kubernetes. 7.1. Centralized Deployment Scenario - E-commerce Platform Analytics Context: An e-commerce platform that analyzes users’ behavior, sales data, and product trends to optimize its offerings and marketing strategies. Use Case: Using the centralized deployment model, the platform can aggregate data from various sources into a single data center or cloud environment, facilitating complex analytics and machine learning processes to derive actionable insights. Benefits: The centralized deployment model offers high analytics performance and availability. It handles large volumes of data and supports intensive computational tasks, which are crucial for real-time analytics and decision-making in a dynamic e-commerce environment. Analytics performance comes at the cost of an increasing cost in data transfer. Technological Architecture: The centralized deployment model for e-commerce analytics employs Hadoop Distributed File System (HDFS) as data storage, Yet Another Resource Nego- tiator (YARN) as resource manager, Apache Spark for data processing and analytics, Apache Superset for data visualization, and Hive as query engine. Deployment Configurations: This section presents the configurations for a basic centralized setup. For brevity, we propose a Docker-Compose that only includes HDFS, YARN, and Spark. We use YARN as resource manager, because the availability of (potentially unlimited) resources in the cloud makes it a robust and widely adopted solution. Modern orchestrators like Kubernetes can be used, while introducing additional maintenance overhead and complexity. services: env_file: namenode: - ./config image: apache/hadoop nodemanager: command: ["hdfs", "namenode"] image: apache/hadoop ports: command: ["yarn", "nodemanager"] - 9870:9870 env_file: env_file: - ./config - ./config sparkmaster: datanode: image: spark image: apache/hadoop entrypoint: ["bash", "-c", command: ["hdfs", "datanode"] "$$SPARK_HOME/sbin/start-master.sh env_file: --host sparkmaster && sleep inf"] - ./config sparkworker: resourcemanager: image: spark image: apache/hadoop entrypoint: ["bash", "-c", command: ["yarn", "resourcemanager"] "$$SPARK_HOME/sbin/start-worker.sh ports: sparkmaster:7077 && sleep inf"] - 8088:8088 The Docker-Compose file creates a Namenode, which manages the metadata for the HDFS file system, and a Datanode, which contains the actual user data. Regarding YARN, the Docker- Compose file creates a Resourcemanager and a Nodemanager to efficiently manage the resources across the cluster nodes. Finally, it configures a Spark cluster with two containers: the first one becomes the master upon executing the start-master.sh script, while the second one becomes a worker upon executing the start-worker.sh script, passing the master endpoint as a parameter. 7.2. Decentralized Deployment Scenario - Patients’ Health Monitoring Context: A healthcare system that monitors patients’ health data in real-time across various devices and locations to provide immediate care and intervention. Use Case: The decentralized deployment model allows patient data to be processed locally at each healthcare facility or via patient monitoring devices, ensuring quick response times and reducing the need to transfer sensitive data over the network. Benefits: The decentralized deployment model enhances data protection and privacy, both of which are critical in the healthcare sector. It also facilitates real-time monitoring and decision- making by processing data close to their sources. Technological Architecture: The decentralized deployment model for the patients’ moni- toring system employs Minio as data storage, Kubernetes as resource manager, Spark for data processing, Apache Superset for data visualization, and Trino as query engine. Deployment Configurations: This section outlines the configurations for a decentralized setup utilizing Kubernetes, Minio, and Spark. Initially, Kubernetes establishes a pod for the Minio server, which is then accessible as a service named minio-service . Subsequently, Spark is deployed on Kubernetes to leverage the dynamic resource allocation it provides. However, integrating Spark with Kubernetes isn’t straightforward and requires the Spark-Operator, an operator designed specifically for managing Spark applications within the Kubernetes ecosystem. The most efficient installation method for the Spark-Operator is through Helm.1 Following this, a Docker image that contains the Spark code to be executed is created. The final step involves defining a Kubernetes resource type SparkApplication , introduced by the Spark-Operator, for deploying the applications. Below is an example YAML configuration used in this setup. apiVersion: "sparkoperator.k8s.io/v1beta2" driver: kind: SparkApplication cores: 1 metadata: memory: "1024m" name: spark-with-minio labels: spec: version: 3.3.1 type: Python executor: pythonVersion: "3" cores: 1 mode: cluster instances: 2 image: "spark-app" memory: "1024m" mainApplicationFile: local:///app/main.py labels: sparkVersion: "3.3.1" version: 3.3.1 The location of the Spark application is specified in the mainApplicationFile property, while driver and executor specify the system requirements for the driver and the executors, respectively. 1 https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md# installation 7.3. Hybrid Deployment Scenario - Financial Services Risk Analysis Context: A financial institution that analyzes transaction data for real-time fraud detection, while also conducting deeper, historical risk analysis to refine its fraud detection algorithms. Use Case: The hybrid deployment model can be used to initially process transaction data at the edge (local bank servers) for immediate fraud detection. Simultaneously, data are sent to a centralized cloud server for more complex, long-term risk analysis and model refinement. Benefits: The hybrid deployment model leverages low-latency processing at the edge for immediate fraud detection and robust computational resources in the cloud for deep analytics. It offers a solution that ensures real-time responsiveness and advanced analytical capabilities. Technological Architecture: The hybrid deployment model for the risk analysis system employs Minio as data storage at the edge and HDFS as data storage in the cloud, Kubernetes as resource manager, Spark for data processing, Apache Superset for data visualization, and Hive as query engine. The configurations for a hybrid deployment resemble and extend those presented for centralized and decentralized deployments. The configurations of the decentralized deployment are used as is; the centralized deployment is migrated to Kubernetes with the need to define all Kubernetes configuration files. The hybrid model introduces several complexities in the setup process due to the orchestration layer shared between cloud and edge nodes. A detailed explanation of how to efficiently deploy hybrid architectures using Kubernetes is beyond the scope of this paper. 8. Related Work Distributed systems have been studied from several angles across the ICT evolution, focusing on their design, development, deployment, and the evaluation of their non-functional behavior. Recently, particular emphasis has been given to big data architectures, focusing on the design and implementation of big data systems, their deployment on the cloud-edge continuum, as well as the evaluation of their performance and scalability (e.g., [6, 7]). In the literature, solutions based on the Apache ecosystem are still widespread. Aissi et al.[8] present an architecture based on HDFS, Spark, and Hive to process and analyze data from a smart farm. However, the necessity for enhanced resource utilization, especially in edge systems, has spurred the design and implementation of complex frameworks centered around orchestrators like Kubernetes. An example of these frameworks is described by Corodescu et al. [9] and outlines the importance of data locality when data must be processed in a distributed environment. Mosa et. al [10] propose MICADO, another platform designed for the deployment of scalable and autonomously managed solutions. The focus of their work has been the containerization of the Hadoop stack, enabling more effective orchestration in a cloud-native environment. Considering the challenges associated with the design and implementation of a Big Data architecture, Iatropoulou et. al [11] introduce the Big Data Apps Composition Environment (BDACE), a set of components, tools, and best practices, that improve the reliability and flexibility of the solution being developed. BDACE is one of the few approaches that address the non-functional property of security, albeit focusing solely on the aspect of authorization. The impact of distributed systems on the safety, security, and privacy of humans has then been considered, with particular reference to system trustworthiness in terms of governance, risk, and compliance. Several assurance techniques [12] have been defined with the aim of proving a specific system behavior in terms of non-functional properties support. Today, certification is considered by policymakers, regulators, and industrial stakeholders as the most suitable assurance technique for the verification of non-functional properties (e.g., availability, confidentiality, privacy) of distributed systems [13]. Certification followed the distributed system evolution. It was initially used to verify traditional software- based systems [14] and later applied to service- and cloud-based system certification [12]. In this context, Anisetti et al. [15] proposed a multi-dimensional certification scheme for distributed systems, which evaluates distributed applications across several dimensions, including the development process, the verification process, and the target distributed application itself. Anisetti et al. [16] also presented a security assurance methodology for big data pipelines grounded on DevSecOps paradigm to support reliable security and privacy by design. To the best of our knowledge, there is a lack of studies like the one in this paper that systematically investigate the relationship between big data architectures and deployment models, and their impact on non-functional properties. A first solution has been discussed in [3], where a novel assurance process for Big Data holistically evaluates the big data pipelines and the ecosystem underneath to provide a comprehensive measure of their trustworthiness. However, the proposed assurance process does not evaluate the impact of deployment models on the overall trustworthiness. 9. Conclusions We explored different deployment models for data engines in the cloud-edge continuum and shed light on their impact on data analytics pipeline behavior. Critical aspects, such as data sharing and protection, pipeline uptime and latency, have been explored considering non-functional properties availability, performance, privacy. Our results highlighted the significance of data engine deployment as a critical dimension of data pipeline validation and verification. Acknowledgments Research supported, in parts, by i) project “BA-PHERD - Big Data Analytics Pipeline for the Identification of Heterogeneous Extracellular non-coding RNAs as Disease Biomarkers”, funded by the European Union - NextGenerationEU, under the National Recovery and Resilience Plan (NRRP) Mission 4 Component 2 Investment Line 1.1: “Fondo Bando PRIN 2022” (CUP G53D23002910006), ii) project MUSA - Multilayered Urban Sustainability Action - project, funded by the European Union - NextGenerationEU, under the National Recovery and Resilience Plan (NRRP) Mission 4 Component 2 Investment Line 1.5: Strengthening of research structures and creation of R&D “innovation ecosystems”, set up of “territorial leaders in R&D” (CUP G43C22001370007, Code ECS00000037), iii) project SERICS (PE00000014) under the NRRP MUR program funded by the EU - NextGenerationEU, iv) Università degli Studi di Milano under the program “Piano di Sostegno alla Ricerca”. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the Italian MUR. Neither the European Union nor the Italian MUR can be held responsible for them. References [1] European Commission, D2.5 Second Report on Policy Conclusions, Data Market Study D2.5, European Commission, 2023. European Data Market Study 2021–2023. [2] Domo, Data never sleeps 11.0, https://web.archive.org/web/20230315000000/https://www. domo.com/learn/data-never-sleeps-11, 2022. Accessed: 2024-03-18. [3] M. Anisetti, C. A. Ardagna, F. Berto, An assurance process for Big Data trustworthiness, Future Generation Computer Systems 146 (2023) 34–46. URL: https://www.sciencedirect. com/science/article/pii/S0167739X23001371. doi:https://doi.org/10.1016/j.future. 2023.04.003 . [4] J. Wang, Y. Yang, T. Wang, R. S. Sherratt, J. Zhang, Big data service architecture: a survey, Journal of Internet Technology 21 (2020) 393–405. [5] T. R. Rao, P. Mitra, R. Bhatt, A. Goswami, The big data system, components, tools, and technologies: a survey, Knowledge and Information Systems 60 (2019) 1165–1245. [6] J. Dongarra, B. Tourancheau, D. Balouek-Thomert, E. G. Renart, A. R. Zamani, A. Simonet, M. Parashar, Towards a computing continuum: Enabling edge-to-cloud integration for data-driven workflows, Int. J. High Perform. Comput. Appl. 33 (2019) 1159–1174. URL: https://doi.org/10.1177/1094342019877383. doi:10.1177/1094342019877383 . [7] J. C. S. Dos Anjos, K. J. Matteussi, P. R. R. De Souza, G. J. A. Grabher, G. A. Borges, J. L. V. Barbosa, G. V. González, V. R. Q. Leithardt, C. F. R. Geyer, Data processing model to perform big data analytics in hybrid infrastructures, IEEE Access 8 (2020) 170281–170294. doi:10.1109/ACCESS.2020.3023344 . [8] M. E. M. El Aissi, S. Benjelloun, Y. Lakhrissi, S. E. H. B. Ali, A scalable smart farming big data platform for real-time and batch processing based on lambda architecture”, Journal of System and Management Sciences 13 (2023) 17–30. [9] A.-A. Corodescu, N. Nikolov, A. Q. Khan, A. Soylu, M. Matskin, A. H. Payberah, D. Roman, Big data workflows: Locality-aware orchestration using software containers, Sensors 21 (2021) 8212. [10] A. Mosa, T. Kiss, G. Pierantoni, J. DesLauriers, D. Kagialis, G. Terstyanszky, Towards a cloud native big data platform using micado, in: 2020 19th International Symposium on Parallel and Distributed Computing (ISPDC), IEEE, 2020, pp. 118–125. [11] S. Iatropoulou, P. Petrou, S. Karagiorgou, D. Alexandrou, Towards platform-agnostic and autonomous orchestration of big data services, in: 2021 IEEE Seventh International Conference on Big Data Computing Service and Applications (BigDataService), IEEE, 2021, pp. 1–8. [12] C. Ardagna, R. Asal, E. Damiani, Q. Vu, From Security to Assurance in the Cloud: A Survey, ACM CSUR 48 (2015). [13] C. A. Ardagna, N. Bena, Non-functional certification of modern distributed systems: A research manifesto, in: Proc. of IEEE SSE 2023, Chicago, IL, USA, 2023. [14] D. S. Herrmann, Using the Common Criteria for IT security evaluation, CRC Press, 2002. [15] M. Anisetti, C. A. Ardagna, N. Bena, Multi-dimensional certification of modern distributed systems, IEEE Transactions on Services Computing 16 (2023). [16] M. Anisetti, N. Bena, F. Berto, G. Jeon, A devsecops-based assurance process for big data analytics, in: Proc. of IEEE ICWS 2022, Barcelona, Spain, 2022.