Towards Understandability Evaluation of Business Process Models using Activity Textual Analysis Andrii Kopp, Dmytro Orlovskyi and Sergey Orekhov National Technical University “Kharkiv Polytechnic Institute”, Kyrpychova str. 2, Kharkiv, 61002, Ukraine Abstract There are two purposes of business process modeling. Business process models are created by business analysts for understanding, analysis, and improvement of process scenarios, search, and elimination of weak spots and bottlenecks in organizational activities. Another purpose of business process models is the requirements engineering in software development projects. In both cases, the quality of created business process models is the core issue. Poor models are similar to text documents written with mistakes – they are not understandable, which may negatively impact the real processes they represent and the software workflows they describe. However, existing studies in the field of business process model quality mostly focus on the structural analysis of models using size, complexity, and other metrics with thresholds, while the textual analysis of activity labels is omitted. Therefore, in this paper, we propose an approach to the analysis of business process model understandability taking into account best practices of activity labeling. The proposed approach includes the use of natural language processing techniques, so the respective software tool was developed to perform experiments with a set of business process models. According to obtained results, we suggest considering both textual and structural qualities to achieve the understandability of business process models due to the bad correlation between these metrics (0.0171) – well-structured models can have unclear activity labels and vice versa. Keywords 1 Business Process Model, Model Quality, Model Understandability, Textual Analysis. 1. Introduction: Related Work and Problem Statement Business processes are organized sequences of activities that take different kinds of input and produce value for customers, e.g. goods or services. Nowadays Business Process Management (BPM) is the widely used management approach. This approach is based on the business process modeling technique – a visual representation of organizational activities, events, and decisions using graphical diagrams. Business process models are the most valuable assets of the BPM lifecycle. They help to design, analyze, improve, and automate organizational workflows [1]. Business process modeling helps stakeholders to understand, capture (i.e. document using graphical models), analyze, and improve the enterprise workflows. The analysis stage includes performance measurement and errors detection activities, which help to improve captured business processes [2]. 1.1. Related Work According to the analysis of the latest survey, there are various business process modeling notations used to document business operations in companies that practice the BPM approach [3]:  64% of respondents use BPMN (Business Process Model and Notation); MoMLeT+DS 2022: 4th International Workshop on Modern Machine Learning Technologies and Data Science, November, 25-26, 2022, Leiden-Lviv, The Netherlands-Ukraine EMAIL: kopp93@gmail.com (A. Kopp); orlovskyi.dm@gmail.com (D. Orlovskyi); sergey.v.orekhov@gmail.com (S. Orekhov) ORCID: 0000-0002-3189-5623 (A. Kopp); 0000-0002-8261-2988 (D. Orlovskyi); 0000-0002-5040-5861 (S. Orekhov) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org)  18% of survey participants use EPC (Event-driven Process Chain);  4% of organizations use IDEF-based notations, e.g. IDEF0 and DFD (Data Flow Diagram). Other survey participants use less popular business process modeling notations, however, the BPMN notation is a leader and currently the de-facto standard for business process modeling [3]. According to [4], BPMN models describe workflows as sequences of tasks and events connected using control flows (Fig. 1). Moreover, business processes described using the BPMN notation contain start events and end events to signalize their beginning and finishing (Fig. 1). Hence, the simplest BPMN business process consists of events and activities [4]:  things that happen in an instant are represented by events;  activities are work units that have a set duration. Also, events and activities are logically related in a business process workflow using sequences. A sequence means that one event or activity is followed by another event or activity [4]. Fig. 1 shows the most basic business process structure, described using BPMN graphical notation, that consists of events (start and end) and activities connected using sequences (also referred to as arcs). Figure 1: The most basic business process structure described using BPMN graphical notation [4] According to Fig. 1, when describing a business process using BPMN graphical notation, the modeler should answer the following questions:  “when a new instance of the business process starts?” – for the start event;  “when the instance completes?” – for the end event;  “what to do on the particular process step?” – for activities. Thus, if events are usually named as combinations of nouns followed by verbs in past participle form (i.e. “order received”, “order fulfilled”), which is quite intuitive, empirical studies have shown that real-world business process models created by many practitioners do not always follow naming conventions for activities [5]. The verb-object labeling style (i.e. a verb in infinitive form followed by the noun: “submit order”, “confirm order”, etc.) is recommended for activity labels [5]. This rule is even included in the Seven Process Modeling Guidelines (7PMG) by Mendling et al. [6]. Fig. 1 demonstrates all the essential elements of BPMN graphical notation [7]. Figure 2: The essential elements of BPMN graphical notation [7] Advanced business process models created using BPMN graphical notation may contain particular elements to demonstrate the branching and merging workflow scenarios, business process boundaries, and participants. Gateways (Fig. 2) are particular elements that define parallel (AND), inclusive (OR), or exclusive (XOR) branching within workflow scenarios. Pools describe the boundaries of business processes, while lanes define different roles of business process participants [7]. According to [8], there are various metrics and thresholds exist to evaluate BPMN models:  size (i.e. the number of tasks, events, gateways, and control flows).  gateway mismatch (the sum of gateway pairs of different types).  connectivity coefficient (the number of arcs divided by the number of nodes).  control flow complexity (the sum of gateways weighted by their possible combinations of states after the split). Other studies are also focused mostly on size metrics for the evaluation of business process model efficiency from understandability and maintainability views:  authors of [9] have analyzed a large collection of BPMN models created by practitioners and found that improper usage of splits and joins, message flows, decomposition, and labeling lead to the poor quality of business process models;  in [10] authors propose control-flow complexity metrics and corresponding threshold values they have obtained using data mining techniques to help designers evaluate the quality of business process models;  authors of [11] formulate the importance of having high-quality business process models as inputs for requirements engineering since the quality of BPMN models influences the software quality; however, this study proposes quality checklists for model reviewers instead of metric and formal approaches to verify the business process model quality. We have discovered within the context of BPMN and quality assurance two more interesting studies [12] and [13] that consider the quality of the business process itself and do not analyze the quality of a business process model reflecting a particular process. 1.2. Problem Statement Thus, poorly designed business process models are hard for understanding and maintenance, and they cannot be efficiently used to document business operations, measure business performance, or find workflow errors that may reduce organizational performance. However, existing studies mostly focus on structural analysis of BPMN model flow using the size and control-flow metrics, and thresholds, while relatively smaller attention is paid to the textual analysis of activity labels used in business process models. Hence, in this study, we propose to pay more attention to labeling styles used for business process model activities (i.e. tasks and collapsed sub-processes) when analyzing the understandability of BPMN models. The soundness of the business process model structure is extremely important for the proper understanding of process scenarios, decisions, occurring events, and other important workflow elements by readers. However, improper naming of activities may mislead the essential understanding of which particular tasks should be completed on each step of the business process scenario or which exactly sub-processes should be initialized. This misunderstanding caused by invalid activity labels can negatively impact business processes and software guided by business process models with these poorly-described activities. Let us formally describe a business process model as a coherent directed labeled graph [14]: BPGraph  N , F , L,  , (1) where:  N is the set of business process elements, which includes subsets of activities A , events E , and gateways G ;  A is the set of activities;  E is the set of events, which includes subsets of start events E s , intermediate events E i , and end events E e ;  G is the set of gateways, which includes subsets of XOR gateways G xor , AND gateways G and , and OR gateways G or ;  F is the set of sequence flows between business process elements, F  N  N ;  L is the set of labels defined for business process elements and sequence flows;   is the mapping that assigns labels to business process elements and sequence flows,  : N F  L. Thus, the formal statement of a high-quality business process modeling to achieve understandable diagrams may be given as the following:   QStructural BPGraph  max, (2)   QTextual BPGraph  max, where:  QStructural is the mapping that assigns respective structural quality values to business process models, QStructural : BPGraph  0,1 ;  QTextual is the mapping that assigns respective textual quality values to business process models, QTextual : BPGraph  0,1 . Equation (2) formally describes the problem of business process modeling, according to which created BPMN diagram should be of maximum structural and textual quality [5]. The demonstrated graph (1) can be built automatically, as the result of a BPMN file processing, which is the XML (eXtensible Markup Language) document created according to the specific schema of the BPMN 2.0 format [15]. Hence, we suggest the following workflow of the approach to understandability evaluation of BPMN 2.0 business process descriptions (Fig. 3). Figure 3: The BPMN 2.0 business process models understandability evaluation workflow The proposed approach (Fig. 3) may not only allow evaluation of the understandability of BPMN models based on the textual analysis of business process activities but also answer the following question – “does the structural quality of business process models affects their textual quality?”. This may help to formulate recommendations for business process modelers to pay attention not only to the structural soundness of created diagrams but also to the textual quality of described business process steps to achieve better understandability of models and make sure they serve their purpose. Therefore, in this study, we need an approach to the textual analysis of business process model activity labels to elaborate the techniques of understandability evaluation of BPMN diagrams. We assume that our approach may include the use of Natural Language Processing (NLP) techniques and work with collections of BPMN 2.0 files, so the particular software tool should be developed to perform experiments with a set of business process models. In general, this study considers the process of business process modeling using BPMN graphical notation and aims at the improvement of created models’ quality to assure their understandability by stakeholders for organizational activity analysis and software engineering. The rest of this paper is organized as follows. Section 2 outlines the textual analysis approach for the evaluation of business process model understandability. Section 3 proposes the structural analysis of business process models based on metrics and thresholds. Section 4 includes experiments, analysis, and discussion of the obtained results. 2. Textual Analysis of Business Process Model Activity Labels 2.1. Activity Labels Extraction from BPMN Models Before the proposed approach outline, let us demonstrate the sample BPMN 2.0 business process model and its file representation (Fig. 4). According to the example below (Fig. 4), the “process” tag includes all core business process items such as events (i.e. “startEvent” and “endEvent”), activities (i.e. “task”), and sequence flows (i.e. “sequenceFlow”) [16]. Thus, it is quite easy to read such an XML document and represent it formally using the coherent directed labeled graph (1). Figure 4: Example of BPMN 2.0 model translation into the graph (1) Described graph (Fig. 4) consists of the following sets of business process items:    start events E s  e1s ;  end events E  e ; e 1 e  activities A  a1 , a2 ;  sequence flows F   f1 , f 2 , f 3 . In addition, the mapping  assigns labels to business process elements and sequence flows, which can be extracted using the “name” attribute of respective tags (Fig. 4):     e1s " Order received" – using the “name” attribute of the “startEvent” tag;   a1  " Confirm order" – using the “name” attribute of the first “task” tag;   a1  "Send goods" – using the “name” attribute of the second “task” tag;     e1e " Order fulfilled" – using the “name” attribute of the “endEvent” tag. Therefore, it is possible to obtain the set of activity labels Lactivity  L :   Lactivity  l iactivity , i  1, A , (3) where l iactivity is the label assigned to the i -th activity ai  A , i  1, A . 2.2. Activity Labels Analysis Method based on Natural Language Processing Let us describe the proposed method of textual analysis of business process model activity labels extracted from BPMN 2.0 documents (3). 1. Tokenize each activity label l iactivity  Lactivity , i  1, A to get bags of words that correspond to each of the business process activities.  : Lactivity  W activity , (4) where:   is the mapping that assigns a bag of words wiactivity W activity to each activity label l iactivity  Lactivity , i  1, A ;  W activity is the collection of bags of words wiactivity W activity formulated for each activity label l iactivity  Lactivity , i  1, A . 2. For each word of tokenized activity labels (4) define one or several parts of speech to which it belongs:  : wactivity  PoS , i (5) where:   is the mapping that assigns one or several parts of speech PoSi  PoS to each word that belongs to the bag of words wiactivity W activity created for each activity label l iactivity  Lactivity , i  1, A ;  PoS is the set or all parts of speech that can be assigned to each of words in tokenized activity labels, PoS  Noun,Verb, Adjective, Adverb. 3. For each activity label check its length (i.e. the number of words it contains) and if the label consists of at least two words, check if the first and second words are verbs and nouns correspondingly (5):   0, wiactivity   l iactivity  2,  (6)         l iactivity  Lactivity : q iactivity l iactivity  1, Verb   wiactivity 0  Noun   wiactivity 1 , i 1, A 0, else,  where q iactivity is the mathematical logic predicate that returns 1 for activity labels that match the verb- object labeling style and 0 for activity labels that do not match the verb-object labeling style, qiactivity  0,1 . 4. Calculate the textual quality as the ratio between the number of activities, which labels match the verb-object labeling style (6), and the total number of business process activities: (7)   A   QTextual BPGraph  1  A i 1 qiactivity l iactivity . Fig. 5 demonstrates the algorithm of the proposed activity labels analysis method. Figure 5: The algorithm of activity labels analysis method Activity labels tokenization and part of speech assignment to extracted words can be achieved using particular NLP software components, which will be used for experiments in Section 4. 3. Structural Analysis of Business Process Models based on Metrics and Thresholds Let us also describe the method for structural analysis of business process models to then answer the question of how the structural quality of business process models affects their textual quality. 1. Calculate values of the basic structural metrics proposed in [5] and [6] to manage the business process model’s structural quality: M   N , E s , E e , G or , Structural  (8) where:  N is the number of nodes;  E s is the number of start events;  E e is the number of end events;  G or is the number of OR gateways. 2. Therefore, using business process modeling guidelines defined in [5] and [6], the following threshold values can be defined for the respective structural metrics (8): TStructural  31,2,2,0. (9) Given threshold values (9) reflect the business process modeling guidelines suggested by authors of [5] and [6], which say:  do not use more than 31 nodes;  do not use more than 2 start and end events;  do not use OR gateways. These threshold values (9) were also confirmed in the latest paper by Mendling et al. [17]. 3. Then, using values of the basic structural metrics (8) and corresponding threshold values (9), calculate the structural quality as the average of inverse sigmoid function results: M Structural (10)  QStructural BPGraph  1  M Structural  v m j ,t j , j 1 where:  m j is the value of j -th structural metric (8);  t j is the threshold value for j -th structural metric (9);    v m j , t j is the function that returns values in the range 0,1 : 1, m j  t j , (11)   v m j ,t j    1  m j t j 1 , m j  t j . 1  e   In (11) obtained v m j , t j  1 values signalize that the value of j -th structural metric m j  completely corresponds to the respective threshold value t j while smaller values v m j , t j  1 signalize violations of thresholds (9) by the metric values (8). 4. Results and Discussion Let us use the collection of BPMN diagrams created during business process modeling training sessions by Camunda company. This collection of BPMN 2.0 diagrams includes four subsets that describe four business processes: goods dispatch, insurance recourse, credit-scoring, and self-service restaurant flows. It is freely available in Camunda’s GitHub repository for research purposes [18]. In general, this dataset includes 197 models in English:  67 models are alternative versions that describe the goods dispatch business process;  47 models are alternative versions that describe the insurance recourse business process;  34 models are alternative versions that describe credit-scoring business processes;  49 models are alternative versions that describe self-service restaurant business processes. Hence, to perform experiments with such a collection of BPMN 2.0 files, the software tool was created. It was built using the Python programming language, which has a great tool NLTK (Natural Language Toolkit) for working with computational linguistics [19]. Fig. 6 below demonstrates the workflow and dependencies of the developed software tool, which will be used to perform experiments in this study. Figure 6: The software tool created to conduct experiments According to Fig. 6, the developed software tool uses the following external packages:  the “os” and “xml” packages for working with the file system and processing BPMN 2.0 models that are stored as XML files;  the “nltk” package for tokenization of activity labels (the “word_tokenize” utility) and words tagging (the “wordnet” lexical database);  the “math” package for calculations, e.g. exponentiation;  the “pandas” package for the correlation analysis to study the relationship between business process models’ textual and structural quality. Table 1 below shows correlation analysis results obtained using the Pandas package that allows the computation of the Pearson standard correlation coefficient [20]. Table 1 The correlation analysis results Metrics Textual quality Structural quality Textual quality 1.0000 0.0171 Structural quality 0.0171 1.0000 Calculated correlation analysis results (Table 1) demonstrate bad correlation (0.0171) which means there is no relationship between textual (7) and structural (10) quality coefficients calculated for each of the experimental BPMN business process models [18]. All of these business process models were designed by different persons that were using textual descriptions of business processes they are supposed to create as part of BPMN training sessions. Thus, we may conclude that textual and structural quality dimensions of business process modeling using BPMN graphical notation are not connected. For example, among the obtained calculation results we can discover perfect BPMN models from the textual quality point of view, but poor BPMN models from the structural quality point of view and vice versa. Table 2 demonstrate such cases:  the business process model of high textual quality (1.00) has structural issues (0.88) – the OR gateway is used (Fig. 7); Figure 7: The model of high textual quality but with structural issues  the business process model of high structural quality (1.00) has poor textual quality (0.43) – 4 of 7 activities has labelling style that does not match the recommended verb-object style. Figure 8: The model of high structural quality but poor textual quality Table 2 Examples of business process models with opposite textual and structural quality indexes Business process model Textual quality Structural quality Warenversand_035d8eef52bc4e36aac840bdd2feff21.bpmn 1.00 0.88 Exercise_1_21a36e3570ab48d59098702f4f8ad279.bpmn 0.42 1.00 Indeed, the model can be perfectly structured but have uninformative activity labels (see 2nd row in Table 2), while there could be desired labeling style used (e.g. verb-object style as the recommended best practice) but the process scenario can be poorly structured so there will be barely understandable in which way activities and events follow each other (see 1st row in Table 2). 5. Conclusion and Future Work In this paper, we addressed the problem of the understandability evaluation of business process models using the textual analysis of activity labels. We focused on the BPMN diagramming notation since it is the de-facto standard for business process modeling nowadays, which allows the creation of not only visual models but also machine-readable XML-alike files for interexchange between BPM suites and workflow automation. As it was discovered in the related work in the domain of business process model quality analysis, the structural-based approaches that use metrics and thresholds are much more elaborated than approaches based on textual analysis of BPMN activity labels. We identified this situation as a serious limitation – a business process model can have a perfect structure but can have poorly labeled activities making such a model hard to understand by involved stakeholders. Poor models that are not understandable can lead to errors in organizational improvement and software development projects, cause extra resource allocation to fix arising errors, and, therefore, more costs. Therefore, in this paper, we proposed an approach to the analysis of business process models’ understandability taking into account best practices of activity labeling. The proposed approach and the software tool created for experimental processing of the sample BPMN 2.0 files collection are based on particular NLP techniques such as tokenization and part of speech tagging. Obtained results confirm that the structural quality of a business process model does not mean its understandability since there is a bad correlation between these metrics (0.0171). Provided examples (Fig. 7 and 8, Table 2) show how the models of high textual quality (1.00) can be of moderate structural quality (0.88) and vice versa – how the models of poor textual quality (0.42) can be of high structural quality (1.00). Therefore, understandable business process models, which are valuable for the stakeholders, should demonstrate high textual and structural quality. Thus, we can recommend business process modelers pay for the textual quality and proper activity labeling as much attention as they pay to the structural quality of business process scenarios. Having a business process model both structurally and textually sound will make it serve its initial purpose to communicate knowledge about ongoing or planned business processes. Future work in this field may include the use of advanced NLP and machine learning methods and techniques to allow the automatic correction of poorly named activity labels to ensure the understandability of business process models. Also, more advanced metrics of structural analysis can be applied to continue the study of the relationship between the textual and structural quality of business process models. 6. References [1] M. Hammer, J. Champy, Reengineering the Corporation: A Manifesto for Business Revolution, Zondervan, 2009. [2] W. M. P. van der Aalst, Business process management: a comprehensive survey, in: International Scholarly Research Notices, volume 2013, Hindawi, 2013, pp. 1–37. doi:10.1155/2013/507984 [3] P. Harmon, The State of Business Process Management, in: The State of the BPM Market, volume 2016, BPTrends, 2016, pp. 1–50. [4] M. Dumas, M. La Rosa, J. Mendling, H. A. Reijers, Fundamentals of business process management, Springer, Heidelberg, 2013. doi:10.1007/978-3-642-33143-5 [5] J. Mendling, Managing structural and textual quality of business process models, International Symposium on Data-Driven Process Discovery and Analysis, Springer, Berlin, Heidelberg, 2012, pp. 100–111. doi:10.1007/978-3-642-40919-6_6 [6] J. Mendling, H. A. Reijers, W. M. van der Aalst, Seven process modeling guidelines (7PMG), Information and software technology 52(2) (2010) 127–136. doi:10.1016/j.infsof.2009.08.004 [7] H.G. Ceballos, V. Flores-Solorio, J. P. Garcia, A Probabilistic BPMN Normal Form to Model and Advise Human Activities, in: International Workshop on Engineering Multi-Agent Systems, Springer, Cham, 2015, pp. 51–69. doi:10.1007/978-3-319-26184-3_4 [8] F. Corradini, F. Fornari, S. Gnesi, A. Polini, B. Re, Quality assessment strategy: Applying business process modelling understandability guidelines, University of Camerino, Italy, 2015. URL: https://openportal.isti.cnr.it/data/2017/380283/2017_380283.pdf [9] L. Henrik, J. Mendling, O. Günther, Learning from quality issues of BPMN models from industry, IEEE software 4(33) (2015) 26–33. doi:10.1109/MS.2015.81 [10] W. Kbaier, S. A. Ghannouchi, Determining the threshold values of quality metrics in BPMN process models using data mining techniques, Procedia Computer Science 164 (2019) 113–119. doi:10.1016/j.procs.2019.12.161 [11] W. M. C. da Silva, A. P. F. Araújo, M. T. Holanda, R. T. de Sousa Jr., A Method for Quality Assurance for Business Process Modeling with BPMN, in: Developments and Advances in Intelligent Systems and Applications, Springer, Cham, 2018, pp. 169–179. doi:10.1007/978-3- 319-58965-7_12 [12] A. L. da Costa, S. A. F. Salles, R. L. Carvalho, A. S. C Morais, S. V. and Silva, BPMN and quality tools for process improvement: a case study. Gepros: Gestão da Produção, Operações e Sistemas 14(4) (2019) 156–175. doi:10.15675/gepros.v14i4.2308 [13] P. Peggy, H. Schlieter, Process-based quality management in care: adding a quality perspective to pathway modelling, in: OTM Confederated International Conferences “On the Move to Meaningful Internet Systems”, Springer, Cham, 2019, pp. 385–403. doi:10.1007/978-3-030- 33246-4_25 [14] M. T. Gómez-López, J. M. Pérez-Álvarez, A. J. Varela-Vaca, R. M. Gasca, Guiding the creation of choreographed processes with multiple instances based on data models, in: International Conference on Business Process Management, Springer, Cham, 2016, pp. 239–251. doi:10.1007/978-3-319-58457-7_18 [15] M. Kurz, F. Menge, Z. Misiak, Diagram Interchangeability in BPMN 2, 2014. URL: https://www.omg.org/oceb-2/documents/BPMN_Interchange.pdf [16] Business Process Model and Notation (BPMN), Version 2.0, 2011. URL: https://www.omg.org/spec/BPMN/2.0/PDF/changebar [17] J. Mendling, L. Sanchez-Gonzalez, F. Garcia, M. La Rosa, Thresholds for error probability measures of business process models, Journal of Systems and Software 85(5) (2012) 1188–1197. doi:10.1016/j.jss.2012.01.017 [18] BPMN for research. URL: https://github.com/camunda/bpmn-for-research [19] Natural Language Toolkit. URL: https://www.nltk.org/ [20] pandas.DataFrame.corr – pandas 1.5.0 documentation URL: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html