Big Data Pipeline Discovery through Process Mining: Challenges and Research Directions⋆ Simone Agostinelli, Dario Benvenuti, Francesca De Luzi, and Andrea Marrella Sapienza Universitá di Roma, Rome, Italy firstname.lastname@uniroma1.it Abstract. Big Data pipelines are essential for leveraging Dark Data, i.e., data collected but not used and turned into value. However, tap- ping their potential requires going beyond the current approaches and frameworks for managing their life-cycle. In this paper, we present the challenges associated to the achievement of the Pipeline Discovery task, which aims to learn the structure of a Big Data pipeline by extracting, processing and interpreting huge amounts of event data produced by several data sources. Then, we discuss how traditional Process Mining solutions can be potentially employed and customized to overcome such challenges, outlining a research agenda for future work in this area. · · Keywords: Big Data Pipeline Pipeline Discovery Process Mining. 1 Introduction With the recent developments of Internet of Things (IoT) and cloud-based tech- nologies, massive amounts of data are generated by heterogeneous sources and stored through dedicated cloud solutions. Often organizations generate much more data than they are able to interpret, and current Cloud Computing tech- nologies cannot fully meet the requirements of the Big Data processing applica- tions and their data transfer overheads [4]. Many data are stored for compliance purposes only but not used and turned into value, thus becoming Dark Data, which are not only an untapped value, but also posing a risk for organizations [10]. Examples of Dark Data range from server log files, which can give clues related to the workflows enactment of an organization, to old files that may not seem relevant (e.g., drafts of internal projects) but are often interesting and valuable for external attackers who aim to exploit them for monetary gain. Big Data pipelines (or simply data pipelines) are composite workflows for processing data with non-trivial properties, commonly referred to as the Vs of Big Data (e.g., volume, velocity, etc.) [17]. Tapping their potential is a key aspect to leverage and, at the same time, protect Dark Data [6]. In this direction, the DataCloud project1 aims to realize novel methods and tools for effective management of the data pipeline life-cycle in the context of Cloud Computing. ⋆ Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 DataCloud is a Research and Innovation project funded by the European Commis- sion under the Horizon 2020 program (Grant number 101016835). The project runs for three years, between 2021–2023. URL: https://datacloudproject.eu/ 2 S. Agostinelli et al. The main objective of the project is to develop a software ecosystem con- sisting of new languages, methods and tools for supporting data pipelines on heterogeneous resources. Six life-cycle phases will be covered: (1) pipeline dis- covery, (2) pipeline definition, (3) pipeline simulation, (4) resource provisioning, (5) pipeline deployment and (6) pipeline adaptation. In this paper, we focus on the phase of pipeline discovery, whose target is to provide robust techniques to learn the structure of data pipelines by extracting, processing and interpreting huge amounts of event data produced by several data sources. To achieve this ambitious yet unexplored research goal, the idea is to employ (and potentially customize) existing Process Mining solutions to the discovery and analytics of data pipelines. In this paper, after presenting in Section 2 the background on data pipelines and the challenges to properly conduct the discovery task, in Section 3 we discuss how the application of existing process mining solutions can be exploited to tackle and overcome the identified challenges, towards the definition of novel approaches for pipeline discovery. Finally, in Section 4, we conclude the paper. 2 Background and Challenges on Pipeline Discovery The literature on Big Data processing and analytics has often neglected the research on pipeline discovery, working with the assumption that the anatomy of data pipelines is already known at the outset, before running any Big Data processing feature. A couple of relevant approaches exists that aims at studying the structure of data pipelines. In [15], a framework that reveals key layers and components to design data pipelines for manufacturing systems is presented. In [16], the authors derive a set of data and system requirements for implementing equipment maintenance applications in industrial environments, and propose an information system model that provides a scalable and resilient data pipeline for integrating, processing and analysing industrial data. However, to date, there is no explicit research study that investigates the issue of pipeline discovery. Consequently, even if the concept of “Big data pipeline” can be traced back to 2012 (cf. [18]), the literature lacks a shared understanding of what a data pipeline is and how it can be defined. For instance, in [15] the authors refer to a data pipeline as the “path through which Big Data is transmitted, stored, processed and analyzed”. In [14], a data pipeline is defined as “a complex chain of interconnected activities from data generation through data reception, where the output of one activity becomes the input of the next one”. Similarly, in [11], a data pipeline is “a set of data processing elements connected in series, often executed in time-sliced fashion, where the output of one element is the input of the next one”. Then, in [19], data pipelines are described as a “mechanism to decompose complex analyses of large data sets into a series of simpler tasks, with independently tuned components for each task”. The above definitions confirm that there is no unified specification of the concept of data pipeline; nonetheless, some common features that are inherently related to it can be identified: Big Data Pipeline Discovery through Process Mining 3 – A data pipeline consists of chains of processing elements that manipulate and interact with data sets; – The outcome of a processing element of a data pipeline will be the input of the next element in the pipeline; – Each processing element of a data pipeline interacts with data sets considered as “big”, i.e., with at least one of the Vs dimensions that is verified to hold. With this knowledge at hand, we performed many rounds of interviews with the five business case partners involved in the DataCloud project (cf. also Section 4), which were useful not only to confirm the validity of the three above features that characterize a data pipeline, but also to identify four major challenges to be tackled towards the development of a robust pipeline discovery approach: C1 Event Data Extraction: The challenge is to analyze and turn torrents of rough data stored in several data sources or exchanged within the underlying Cloud Computing infrastructures into valuable event data that reveal the events that concretely happened into day-to-day operations. C2 Event Log Generation: Event data may contain interleaved information related to the enactment of different data pipelines, or of multiple instances of the same data pipeline. Moreover, the possibility exists that many events must be filtered out by the analysis, since they do not refer to processing elements that manipulate data (e.g., events that track the sending or re- ceiving of notifications). Therefore, the generation of event logs from the set of event data is strongly required to (later) learn the structure of a data pipeline. Each entry of the generated event logs should possess at least the following characteristics: (i) a case identifier that maps each event to a case, (ii) a timestamp that records when the event happened, (iii) the process- ing element associated to the event, and (iv) the set of data processed or manipulated during the event enactment. C3 Pipeline Structure Learning: This challenge is about the analysis and the interpretation of event logs to learn the pipelines’ structure and to extract valuable insights related to their performance and compliance. C4 Dark Data Analysis: This is the hardest part, because it requires to know what to look for and where to look within the event data, without deploying intrusive agents that manipulate the systems and networks of an organiza- tion. Identifying Dark Data through the analysis of data pipelines would enable to unlock their semantics and understand if some of them provide insights and, finally, a certain business value. 3 Pipeline Discovery through Process Mining Even if the specification of a shared definition of a data pipeline is still a research challenge, from the previous section it is evident that many similarities exist between the concepts of “data pipeline” and “business process”. With the main difference that any element of a data pipeline is thought to manipulate some (big) 4 S. Agostinelli et al. data set. Conversely, business processes include activities that do not necessarily interact with any kind of data. In fact, in the Business Process Management (BPM) field, data flow is usually not considered as a first-class citizen [8]. Nonetheless, the discovery of data pipelines resembles the discovery of busi- ness processes [3], as both require an event log as a starting point to enact the discovery task. For this reason, in the range of the DataCloud project, we in- vestigate how the (customized) use of Process Mining solutions [1] may support the development of novel techniques to achieve the pipeline discovery task. Process mining is a family of data analysis techniques that enable decision makers to discover process models from data (process discovery), compare ex- pected and actual behaviours (conformance checking), and enrich models with information retrieved from data (process enhancement). Process mining focuses on the real execution of processes, as reflected by the footprint of reality logged (in the form of explicit event logs) by the software systems of an organization. Within DataCloud, we are investigating and elaborating the following re- search solutions that are inspired from the process mining literature in order to tackle the challenges presented in Section 2: S1 Human-in-the-Loop Methodology for Extracting Event Data. The literature on process mining provides a number of semi-automated methods to support organizations in extracting event data from data sources, such as PM2 and L∗ [9][20]. However, often their application is hampered by the considerable preparation effort that needs to be conducted by human experts at different stages of the extraction procedure [7]. This issue is even more severe in presence of heterogeneous data sources that store huge amount of data, like in the case of data pipelines. Furthermore – to date – there is no deep understanding of how human experts should be involved in the pro- cess of event data extraction. Within DataCloud, we aim to tackle C1 by enhancing the existing event data extraction methods through the identi- fication and specification of the manual activities that the human experts need to perform in the context of event data extraction. This will include, for example, activities like the assessment of the quality of the available data sources, the detection of those data elements that relate to events, etc. S2 Pre-processing, Clustering and Filtering techniques. To tackle C2, i.e., to reduce the overall dataset complexity by extrapolating only its rele- vant fragments for an effective event log generation, we will work on the real- ization of four techniques: (i) Segmentation pre-processes the event data to identify the events that belong to the same pipeline (i.e., case); (ii) Aggregat- ing events reduces complexity and improves the structure of discovery results by merging multiple events into larger ones; (iii) Clustering partitions the event log to discover simpler models for each partition of a complex pipeline; (iv) Filtering removes potential outliers from the log. Concerning (i), (ii) and (iii), the literature on process mining provides many solutions that can be potentially customized and re-used in the context of data pipelines [12]. On the other hand, segmentation is a rather unexplored topic in process mining, since it is assumed that any event in a log is always associated to a known Big Data Pipeline Discovery through Process Mining 5 case. In practice, the majority of information systems do not record case identifiers explicitly. To mitigate this issue, we will leverage our previous works on segmentation performed in the Robotic Process Automation field [2] to semi-automatically detect the different pipeline cases from a log. S3 Pipeline Discovery algorithm. To learn the structure of a data pipeline, we aim to leverage existing process discovery algorithms from process min- ing [3], enhancing them with other techniques coming from different areas, ranging from data mining to automated planning in Artificial Intelligence, like already experienced in [13]. Our solution is to realize a pipeline discov- ery algorithm that enables not only to efficiently build the sequence flow of the discovered pipelines, but also learning all data flows and event-based conditions that ruled their execution. S4 Conformance Checking technique for Dark Data analysis. To tackle C4, we aim to customize existing conformance checking techniques [5] to replay the streams of event data filtered out during the event log generation phase (and stored into a dedicated Dark database) over the structure of the discovered data pipelines. The target is to understand if some of the discarded data can be potentially exploited to improve the quality or the business value of the identified data pipelines. Of course, the definition of specific threshold values to quantify if a dark data should be restored in an event log must be investigated and specified as well. 4 Concluding Remarks The expected impact of the DataCloud project is to lower the technological entry barriers for the incorporation of Big Data pipelines in organizations’ workflows and make them accessible to a wider set of stakeholders regardless of the hard- ware infrastructure. In this context, we discussed the key considerations around the concept of pipeline discovery and suggested a number of research challenges and potential ways to tackle them employing process mining solutions, to serve as a research agenda for the future. All the proposed solutions for pipeline discovery will be validated through a strong selection of complementary business cases offered by four SMEs and a large company targeting higher mobile business revenues in smart marketing campaigns, reduced live streaming production costs of sport events, trustworthy eHealth patient data management, and reduced time to production and better analytics in Industry 4.0 manufacturing. Acknowledgments. This work has been supported by the Horizon 2020 project DataCloud (Grant number 101016835). References 1. van der Aalst, W.: Data Science in Action, pp. 3–23. Springer (2016) 6 S. Agostinelli et al. 2. Agostinelli, S., Marrella, A., Mecella, M.: Automated Segmentation of User Inter- face Logs. In: Robotic Process Automation, pp. 201–222. De Gruyter (2021) 3. Augusto, A., Conforti, R., Dumas, M., Rosa, M.L., Maggi, F.M., Marrella, A., Mecella, M., Soo, A.: Automated Discovery of Process Models from Event Logs: Review and Benchmark. IEEE Trans. on Know. and Data Eng. 31(4) (2019) 4. Barika, M., Garg, S., Zomaya, A.Y., Wang, L., Moorsel, A.V., Ranjan, R.: Orches- trating big data analysis workflows in the cloud: Research challenges, survey, and future directions. ACM Comput. Surv. 52(5) (Sep 2019) 5. Carmona, J., van Dongen, B., Solti, A., Weidlich, M.: Conformance checking. Springer (2018) 6. Chakrabarty, S., Joshi, R.S.: Dark Data: People to People Recovery. In: ICT Anal- ysis and Applications, pp. 247–254. Springer (2020) 7. Diba, K., Batoulis, K., Weidlich, M., Weske, M.: Extraction, correlation, and ab- straction of event data for process mining. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10(3), e1346 (2020) 8. Dumas, M., La Rosa, M., Mendling, J., Reijers, H.A., et al.: Fundamentals of Business Process Management, vol. 1. Springer (2013) 9. van Eck, M.L., Lu, X., Leemans, S.J.J., van der Aalst, W.M.P.: Pm2 : A process mining project methodology. In: Zdravkovic, J., Kirikova, M., Johannesson, P. (eds.) Advanced Information Systems Engineering. pp. 297–313. Springer Interna- tional Publishing, Cham (2015) 10. Gimpel, G.: Bringing dark data into the light: Illuminating existing IoT data lost within your organization. Business Horizons 63(4), 519–530 (2020) 11. Gressling, T.: Data Science in Chemistry: Artificial Intelligence, Big Data, Chemo- metrics and Quantum Computing with Jupyter. De Gruyter (2020) 12. Mannhardt, F., de Leoni, M., Reijers, H.A., van der Aalst, W.M., Toussaint, P.J.: Guided process discovery–a pattern-based approach. Information Systems 76 (2018) 13. Marrella, A., Lespérance, Y.: Synthesizing a library of process templates through partial-order planning algorithms. In: Enterprise, Business-Process and Informa- tion Systems Modeling, pp. 277–291. Springer (2013) 14. Munappy, A.R., Bosch, J., Olsson, H.H.: Data pipeline management in practice: Challenges and opportunities. In: International Conference on Product-Focused Software Process Improvement. pp. 168–184. Springer (2020) 15. Oleghe, O., Salonitis, K.: A framework for designing data pipelines for manufac- turing systems. Procedia CIRP 93, 724–729 (2020) 16. O’Donovan, P., Leahy, K., Bruton, K., O’Sullivan, D.T.: An industrial big data pipeline for data-driven analytics maintenance applications in large-scale smart manufacturing facilities. Journal of Big Data 2(1), 1–26 (2015) 17. Plale, B., Kouper, I.: The centrality of data: data lifecycle and data pipelines. In: Data analytics for intelligent transportation systems, pp. 91–111. Elsevier (2017) 18. Rabl, T., Jacobsen, H.A.: Big data generation. In: Rabl, T., Poess, M., Baru, C., Jacobsen, H.A. (eds.) Specifying Big Data Benchmarks. pp. 20–27. Springer (2014) 19. Raman, K., Swaminathan, A., Gehrke, J., Joachims, T.: Beyond myopic inference in big data pipelines. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 86–94 (2013) 20. Van Eck, M.L., Lu, X., Leemans, S.J., Van Der Aalst, W.M.: PM2 : a process min- ing project methodology. In: International Conference on Advanced Information Systems Engineering. pp. 297–313. Springer (2015)