Data Inaccuracy-aware Design of Business Processes Yotam Evron1 University of Haifa, Mount Carmel, Haifa 3498838, Israel yevron@is.haifa.ac.il Abstract. Business processes are designed with the assumption that the data used by the process is an accurate reflection of reality. However, this assump- tion does not always hold, and situations of data inaccuracy might occur which bear substantial consequences to the process and to business goals. Until now, data inaccuracy has mainly been addressed in the area of business process man- agement as a possible exception at runtime, to be resolved through exception handling mechanisms [‎9]. Design-time analysis of potential data inaccuracy has been mostly overlooked so far. In this paper we describe a research agenda for developing a method for supporting process modelers in designing more robust processes with respect to data inaccuracy. Keywords: Business Process Modelling, Data in Business Processes, Data In- accuracy, Formal Analysis 1 Introduction Business processes consist of activities, which are executed by humans and by re- sources with the support of information systems, in order to accomplish specific busi- ness goals. An information system holds data which should inform process partici- pants about the current state and serve as a basis for decisions and for selecting the process paths to be taken. An underlying implicit assumption is that the data as rec- orded and used by the system is an accurate representation of the values in the real world. Based on this assumption, process aware information systems (PAIS) can op- erate as a closed system and actions can be triggered based on the data values in the system, without a need to actually “sense” the values in the real world. Nevertheless, it is a well-known fact that the data stored in the database of an information system is not always completely reliable [‎2], and situations may occur when data values do not match the real-world values they should reflect. Those inaccurate data values may affect the ability to reach the process goals. We use the term data inaccuracy to refer to such situations. Since the data stored in an information system has an impact on business goals, it is important to investigate such situations to improve processes. While several works have addressed the issue of 1 Supervised by Pnina Soffer and Anna Zamansky. data quality at process design [‎3][‎8][‎12], to the best of our knowledge, no existing modeling formalism provides explicit support for representing and reasoning about potential data inaccuracy at design time. Until recently, data inaccuracy has mainly been addressed in the area of business process management as a possible exception at runtime [‎9]. The overreaching goal of this research is to develop a method that will assist pro- cess designers to build more robust processes, resilient to data inaccuracy. This will be done by incorporating considerations of data inaccuracy at business process de- sign. In this paper we present a detailed research plan for achieving this goal. The rest of the document is organized as follows: Section ‎2 provides background and related work, Section ‎3 introduces the main research questions. Based on these questions, the approach and research methodology are presented in Section ‎4. In Sec- tion ‎5, we discuss the evaluation methods. In Section ‎6, the conclusions and expected contributions are discussed. 2 Related work Data quality has been extensively addressed outside the context of business processes [‎2][‎6]. In business process research, limited work concerning data quality has been done. Rodriguez et al. [‎8] introduced a data quality-aware model which allows mod- elers to specify data quality requirements in business process models. Their approach is semi-formal and without a conceptual or analytical method. Gharib and Giorgini [‎4] introduce a goal-oriented approach to model and analyze information quality (IQ) requirements in business processes from a socio-technical perspective. Their work mainly focuses on human interaction without a comprehensive support including computational methods. Data accuracy is an important dimension of data quality. Soffer [‎12] was the first to suggest an explicit analysis of data inaccuracy at design time, providing a conceptual formulation of the problem and discussing potential consequences of data inaccuracy in business processes. This provides the main con- ceptual basis for our research. 3 Research Question The main objective of this research is to develop an approach to enable assessing and reducing data inaccuracy consequences during process design. Our main research question is: How can we incorporate considerations of data inaccuracy in business process design? This question gives rise to several sub questions: 1. How can we formally represent a business process in a way that will enable reason- ing about data inaccuracy? 2. How can we utilize this formal representation to identify relevant properties of processes that might be associated with data inaccuracy? 3. How can we utilize this representation and properties as a basis for a method that will support process designers in developing processes that are more robust to data inaccuracy? 4 Approach 4.1 Basic Premises of Our Approach Following Soffer [‎12], we take the state-based view of the Generic Process Model (GPM) [‎11], in which a process takes place in a domain which is typically captured by a set of state variables, whose values at a given moment reflect the domain state at that moment. A subdomain is part of the domain described by a subset of the domain state variables. Note that there are many ways to partition a given domain into sub- domains and different partitions can reflect different views of the process domain. A process is viewed as a sequence of state transitions, which are governed by a trans- formation law. However, not all state variables are relevant (or need to be “sensed”) in order to make a transition. Thus, the domain may be decomposed into independent subdomains in some parts of the process. Observation 1. A process may involve multiple sequences of transitions in several independent subdomains. As explained in [‎11], subdomains which operate in parallel or independently, may reach a state where a dependency between them exists. Considering the threads of transitions in these subdomains, we call such states synchronization points. Observation 2. Sequences of transitions which take place concurrently in independ- ent subdomains merge at synchronization points. In many cases some of these subdomains include only state variables (of the physi- cal world), while others include and rely on data items stored in information systems, under the assumption that they reflect corresponding domain variables 2. However, as already noted above, this is not always realistic: Observation 3. A discrepancy between a state variable value 𝑥𝑖 and its corresponding data item 𝑑𝑖 might occur. In what follows we refer to such discrepancy as data inaccuracy. As long as a sub- domain relying only on a state variable 𝑥𝑖 operates independently of a subdomain which includes 𝑑𝑖 (and vice versa), the existence of data inaccuracy (discrepancy between 𝑥𝑖 and 𝑑𝑖 ) cannot be recognized. It will only be recognized at a synchroniza- tion point between these two subdomains. However, it may be too late and the process might get “stuck”, or some compensating action would be needed. Therefore, the exact place of a synchronization point matters for mitigating risks of data inaccuracy. 2 We make a basic assumption of a well-designed data structure, which means that all relevant domain variables are represented by corresponding data items. As a running example, consider a process of organizing and executing a business party by a catering company. From the company’s perspective, the goal of the process is reaching a state where the business party is completed successfully. Let us assume a customer met the company’s representatives, who recorded the agreed upon details (such as date, type of food, price etc.) as data in the company’s IS. Now the company can execute the process on its own without a need for any further information in order to proceed. This reflects Observation 1: the company executing a part of the process is an independent subdomain, depending solely on the IS data items, without “sensing” their values outside the IS (e.g., by comparing the date registered in the IS to the date known to the customers). At some point, however, when the party eventually takes place, the company’s independent subdomain and the customer’s subdomain merge, coming to a synchronization point (Observation 2). Now consider a scenario, in which the planned date has been falsely recorded in the information system and does not reflect the actual agreed upon date. This is a manifestation of data inaccuracy (Obser- vation 3). As noted, it is not recognized as long as each subdomain operates inde- pendently, but will necessarily be detected at the synchronization point: the company will have everything ready for the planned date recorded in the IS, while the guests will arrive at the meeting point on the agreed upon date as they know it (see “Party execution” in Fig. 1). Since these do not match, the party cannot take place. Fig. 1. Conceptual visualization of our example This example highlights the important role synchronization points play in the detec- tion and handling of data inaccuracy situations. 4.2 Formal Representation of Processes with Data Inaccuracy Based on GPM, we view a process as a sequence of transformations, in which state variables 𝑥𝑖 obtain values and data items 𝑑𝑖 are updated to reflect these changes, and as a result data inaccuracy may be introduced in 𝑑𝑖 with respect to 𝑥𝑖 . In this case, inaccuracy cannot be detected in an independent subdomain relying only on 𝑑𝑖 until it reaches a synchronization point with a subdomain containing 𝑥𝑖 . This needs to be explicitly captured in formal representations of processes. While the GPM view is useful for our general conceptual understanding of data in- accuracy, it does not provide computational analysis capabilities needed for analyzing models of specific processes. For this purpose a model representation which supports design and analysis operations of specific processes is needed. Our first research stage has two goals: 1. To propose a conceptual view of processes which builds on GPM and incorporates the notion of synchronization points (with respect to a given data item) between independent subdomains 2. To propose a more explicit model representation, which can be used for designing specific processes and performing a formal analysis For achieving the first goal, our starting point is to extend GPM and define in pre- cise terms the general notions of data inaccuracy and synchronization points which emerge from our observations above. For achieving the second goal, we have posed a number of requirements for a pro- cess modelling language to be appropriate for explicit representation of concepts re- lated to data awareness. These included among others explicit representation of data, and the ability to make a distinction between independent subdomains. After a prelim- inary examination of several well-studied languages (such as Petri nets, YAWL, Workflow nets with data, and BPMN) against these requirements, we decided to choose the Petri-net based formalism of Workflow nets with data (WFD-nets) [‎10], due to the popularity of Petri-nets and their advanced computational analysis, which can be used as a basis for process analysis with respect to data inaccuracy. WFD-nets are data-based extensions of workflow-nets, which can be naturally adapted for our purposes. 4.3 Identifying Properties of Processes Associated with Data Inaccuracy Based on the formal model, we intend to investigate two properties of processes relat- ed to data inaccuracy. The first is soundness [‎1], which is an important and well- studied formal property of processes. A process is sound if and only if three require- ments are satisfied: Option to complete, Proper completion, and No dead transitions. Current techniques for verifying soundness are mostly restricted to the process control flow and do not consider data. One of our key observations in this research is that data inaccuracy has a direct impact on soundness of processes, namely a process that is sound may in fact not reach proper completion due to the presence of data inaccuracy. Hence, soundness can only be guaranteed if we are aware that the data is accurate. This naturally leads to the second property we intend to investigate: awareness of (the existence of) data inaccuracy. As mentioned above, in an independent subdomain operating solely on the basis of data items (without sensing their real values), we may be unaware of data inaccuracy, until a synchronization point is reached, where inaccuracy will be detected, and we become aware of it. Following this, we can decompose a process into aware and unaware parts. Moreover, if a data item 𝒅𝒊 is read in an unaware (with respect to 𝒅𝒊 ) part of the process, it might be used based on an inaccurate value, and thus might hamper the process from reaching its goals. Going back to soundness, in its usual sense it can only be established in case of data inaccuracy awareness (DIA). Data inaccuracy unawareness, on the other hand, poses a threat to soundness. This leads to the notion of soundness with respect to DIA (DI-soundness). Our goal at this stage is to formally establish and investigate the properties of DIA and DI-soundness. We intend to provide formal definitions of DI awareness and DI- soundness, and develop algorithms assessing them regarding specific process models. 4.4 A Method for Supporting Process Design At design time we can anticipate two kinds of data inaccuracy situations that can ma- terialize at runtime: data inaccuracy which we know about, and data inaccuracy which exists and we are not aware of it. In the first case, correction of the data before its use might be needed and possibly a compensation to avoid negative consequences. In both cases inaccurate data might be used, causing unexpected negative consequences. Our envisioned method seeks to address both situations, based on DI Awareness and DI soundness (see Section ‎4.3). Processes have synchronization points which are inherent in their design. When a synchronization point is reached it might be too late to avoid negative consequences, thus, a mechanism that will help avoiding these situations is required. Furthermore, it is possible to artificially introduce additional synchronization points as a control mechanism of data accuracy. To distinguish between these two notions, we refer to the former as “natural” (NSP) and to the latter as “controlled” synchronization points (CSP). NSPs are located at places where independent sub-domains become mutually dependent as part of the process design. The CSP is a synchronization point which can be artificially added to the process for a certain data item, thereby enforcing a ”reality check”. However, such reality checks are costly, and should only be used if necessary. This gives rise to the following questions: inaccuracy of which data items may pose threats to soundness, or require substantial compensation? For such data items, where would be a good place to “plant” CSPs, if at all? Could CSPs for several critical items be put together? Our third goal is to develop a method for supporting a modeler at the design of processes, allowing him/her to make processes more robust with respect to data inac- curacy. In particular, this will allow the modeler to make informed decisions on the need to insert CSPs, and find the right places for doing so. To this end we plan to progress in the following stages: (i) develop a mechanism for identifying a need for CSPs, (ii) propose an algorithm for identifying critical data items, (iii) based on (ii), propose a method for calculating the most appropriate points in the process to add CSPs, (iv) propose a method for assessing which of the synchronization points need compensating paths to maintain soundness, and (iv) develop a modeling tool with visual support for the proposed method and with process verification capabilities. We intend to use CSPs as means for avoiding negative consequences of data inac- curacy. They can form a useful solution since they will allow a better control of the data item’s value. An important question arises: where and when to add CSPs in the process, and which are the relevant data items worth examining in order to reduce data inaccuracy consequences. Essentially, CSP will assist us to determine the possi- bility to encounter a specific data inaccuracy situation at runtime. To deal with identifying the relevant data items worth examining, we will use data impact analysis. In general, data impact analysis examines the influence of a specific data item on other process elements (e.g. activities, decisions, data items) of a busi- ness process. Tsoury et al. [‎13] provide a method for data impact analysis which can serve as a basis for identifying critical data items. Based on a formal process model definition, relationships among elements of a specific process model (including data) are stored in a database and can be queried for identifying chained dependencies. We will adapt this method by introducing synchronization points into it, identifying criti- cal data items by assessing their impact on the process. To deal with the questions of where and when to insert CSP in the process, we will develop a time-dependent cost function which will include a probabilistic calculation of cost implications in every point and time in a given process that will assist us in redesigning the process. The decision variable of this function is where and when to add CSPs. Note that there are probably some situations when it will not be worthwhile to add CSP since the insertion of such CSP also has costs. We can view such situation as a tradeoff between improving the process robustness and avoiding high costs (note that “cost” is a general term, and can also refer to other performance indicators such as time). Robustness considerations may point towards handling data inaccuracy as the early as possible, but this may imply higher costs or delays. The developed func- tion will enable evaluating alternative solutions considering additional CSPs. In addition, we will investigate the impact of each NSP on the DI-soundness prop- erty. For example, if we evaluate a specific NSP as crucial for the process to become DI-sound, we will have to incorporate alternative routes for this NSP in the process (to maintain the soundness property of the process). Last, we will develop a modeling tool (or extend an existing one), with visual sup- port for using our method. It will enable marking the independent subdomains and their synchronization points, assessing costs (or any other user-defined parameter) of CSPs, analyzing a process in terms of DIA areas, and verifying its DI-soundness. 5 Evaluation As in a design science research, our developed artifacts need to be evaluated. There are many evaluation techniques for design science research in the literature, such as [‎5][‎7]. Our evaluation relates to two main criteria: usefulness and usability. 5.1 Usefulness evaluation The main objective in this phase is validation of the properties of DIA and DI- sound- ness, and their use for predicting situations of data inaccuracy for a given process. Later on, we can assess our method for proposing improvements in the process. The above will be performed using a collection of case studies, which are currently being conducted in our research group, such as vehicle industry (small organization, purchasing process), weapon development and manufacturing (large organization, employee training processes), municipality (large organization, water meter manage- ment process) etc. In these case studies information about data inaccuracy situations is being collected, based on interviews with stakeholders, concrete process models, and event logs. In our evaluation, we will investigate whether applying our developed method could help (a) predicting existing data inaccuracy situations, and (b) support earlier detection of data inaccuracy at runtime. To this end, we will first transform the pro- cess models obtained from the organizations into WFD-based models (enriched with synchronization points). Next we will apply our algorithms for identifying possible data inaccuracy situations based on DIA and DI-soundness, analyze the results and compare them to the baseline of the case study. We expect our method to predict the majority of data inaccuracies which might exist in a given process. Moreover, we may predict a possibility for additional data inaccuracy situations which were not indicated in the case study. These might require additional investigation in the organization. After exploring all data inaccuracy situations and comparing them to the data col- lected from the organizations, we will make suggestions for improving the robustness of the processes with respect to data inaccuracy by introducing CSPs in a cost- effective manner. When possible, we will present these suggestions to the process owners and ask them to scrutinize the suggestions and identify new insights (if any). In addition, using the relevant event logs, we will measure the % of cases where data inaccuracy was spotted at a later point in the process as compared to the introduced CSPs. These cases represent improvement achieved by the redesigned process. 5.2 Usability evaluation A second important evaluation criterion is the usability criterion. Practically, we will examine whether humans can easily use our method, and what difficulties are in- curred. We will conduct a controlled experiment to test whether the suggested meth- od, as implemented in a modeling tool, supports modelers in designing more robust processes. The participants of the experiment shall be process designers (if possible) or students. The experiments will rely on a repeatable procedure: with and without the new method (or with different modalities of the modeling tool). The subjects will be asked to design or redesign processes while considering possible data inaccuracy issues. The experiments should provide insight regarding the usability of the method and its perceived usefulness. 6 Conclusion and Expected Contribution Data has an enormous impact on business processes. Until now, data inaccuracy has mainly been addressed in the area of business process management as a possible ex- ception at runtime, to be resolved through exception handling mechanisms [‎9]. In this research we focus on analysis of data inaccuracy at design time, introducing the con- cept of synchronization points and their effect on data inaccuracy awareness (DIA) and soundness. Hence, an innovative aspect of our proposal is the possibility to ex- plicitly address potential consequences of data inaccuracy at design time. This has the potential to improve the robustness of the processes with respect to data inaccuracy. The expected contributions of this study are both for research and for practice. For research, this study will open directions to investigate additional process properties that can relate to the accuracy of data. It will also highlight the need to study and quantify related risks. For practice, our research will provide support and guidance for designing robust processes, enabling prediction of data inaccuracy situations so or- ganizations can investigate their existing processes and improve them. Acknowledgements. The author is supported by the Israel Science Foundation under grant agreement no. 856/13. References 1. Aalst, W.M., Kees M. van Hee, Arthur HM ter Hofstede, Natalia Sidorova, H. M. W. Ver- beek, Marc Voorhoeve, and Moe Thandar Wynn. "Soundness of workflow nets: classifica- tion, decidability, and analysis."Formal Aspects of Computing 23, no. 3: 333-363, 2011. 2. Agmon, N., & Ahituv, N. Assessing data reliability in an information system. Journal of Management Information Systems, pp. 34-44, 1987. 3. Cappiello, C., Caro, A., Rodriguez, A., and Caballero, I. An Approach to Design Business Processes Addressing Data Quality Issues. In ECIS, pp. 216, 1987. 4. Gharib, M., & Giorgini, P. (2015). Modeling and reasoning about information quality re- quirements in business processes. In Enterprise, Business-Process and Information Sys- tems Modeling (pp. 231-245). Springer International Publishing. 5. Helfert, M. and Donnellan, B. “The case for design science utility - Evaluation of design science artefacts within the IT capability maturity framework”, accepted to the Internation- al Workshop on IT Artefact Design & Work Practice Intervention, Barcelona, 2012. 6. Orr, K. Data quality and systems theory. Com. of the ACM, 41(2), pp. 66-71, 1998. 7. Pries-Heje, J., Baskerville, R., and Venable, John R. "Strategies for Design Science Re- search Evaluation". ECIS 2008 Proceedings. Paper 87, 2008. 8. Rodríguez A., Caro A., Cappiello C. and Caballero I., “A BPMN Extension for Including Data Quality Requirements in Business Process Modeling”, Business Process Model and Notation (LNBIP 125), pp. 116-125, Springer-Verlag Berlin, 2012. 9. Russell, N., van der Aalst, W.M.P., ter Hofstede, A.H.M.: Workflow exception patterns. In:Dubois, E., Pohl, K. (eds.) Advanced Information Systems Engineering (CAiSE 2006). LNCS, vol. 4001, pp. 288–302. Springer, Heidelberg, 2006. 10. Sidorova, N., Stahl C., and Trčka, N. "Workflow soundness revisited: Checking correct- ness in the presence of data while staying conceptual." Advanced Information Systems Engineering. Springer Berlin Heidelberg, 2010. 11. Soffer P., Kaner M., Wand Y. Assigning Ontological Meaning to Workflow Nets, Journal of Database Management, 21(3), pp. 1-35, 2010. 12. Soffer, P. Mirror, mirror on the wall, can i count on you at all? Exploring data inaccuracy in business processes. In Enterprise, business-process and information systems modeling (pp. 14-25). Springer Berlin Heidelberg 2010. 13. Tsoury, A., Soffer, P. and Reinhartz-Berger, I. “Towards Impact Analysis of Data in Busi- ness Processes”. BPMDS, 2016.