Algorithmic Classification of Layouts of BPMN Diagrams Elias Baalmann1 and Daniel Lübke1[0000−0002−1557−8804] Leibniz Universität Hannover, Germany e.baalmann@stud.uni-hannover.de,daniel.luebke@inf.uni-hannover.de Abstract. Previous research is concerned with differences in BPMN diagram layout, e.g., with regards to understandability. However, lay- outs have neither been formally described nor their classification been automated. We aim at formalizing BPMN layouts and automating di- agram layout classification for BPMN diagrams: We calculate sequence flow directions and encode them. By using regular expressions, these are clustered to diagram layouts. This results in a set of formally described BPMN layouts and a corresponding algorithm which we implemented in a tool. The results are very similar to previous work of manual layout classification on the GitHub process set. Researchers can use our defini- tion when conducting BPMN diagram analysis and industry experts can use our tool for validating models against their layout guidelines. Keywords: BPMN · Diagram Layout · Diagram Layout Formalization · Diagram Layout Detection · Flow Layout 1 Introduction BPMN is the standard modeling language for business processes [1]. 2006 it was accepted as an OMG standard [4], the current version (BPMN 2.0) specifies multiple diagram types to model processes in different levels of detail [14]. Of the three specified types, only the process or collaboration diagram is considered in this paper. The BPMN is a documentation and communication tool which should allow readers to easily comprehend complex coherences. Thus, one key requirement for a model is understandability. Much research has been concerned with this topic recently, e.g., [6, 9, 8, 10–13]. One branch of BPMN understandability research is concerned with the lay- out of BPMN processes. The underlying hypothesis states that layout has a big impact on understandability. Besides small grained metrics like number of sequence flow crossings, the overall BPMN diagram layout has come into focus. Up to now, layouts are only ‘specified’ by giving examples and appealing to the intuitive understanding of the reader (“top-down layout”, “left-right lay- out”). This makes it hard to a) fully understand findings, b) replicate research and c) compare different research results. Furthermore, industry users cannot de- cide whether their diagrams are compliant to the latest research, thus preventing the implementation of scientists’ recommendations for diagram layout. J. Manner, D. Lübke, S. Haarmann, S. Kolb, N. Herzberg, O. Kopp (Eds.): 14th ZEUS Workshop, ZEUS 2022, Bamberg, held virtually due to Covid-19 pandemic, Germany, 24–25 February 2022, published at http://ceur-ws.org Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Algorithmic Classification of Layouts of BPMN Diagrams 43 In this paper, we want to outline a formal definition of prevalent flow layouts found in GitHub models and create an algorithm and followingly automated tool support for classifying BPMN diagram layouts. The term flow layout is chosen to clarify that the flow aspect of the Layout is considered. Other aspects, such as edge crossings or arrow lengths are not relevant here. Previous work, that investigated similar topics might use different terms such as “flow direction” [10] or “layout direction” [13]. But since ‘direction’ is to specific to describe the relatively complex layouts that are distinguished here, ‘flow layout’ seems to be a more adequate choice. By providing an implementation, that can classify diagrams based on formalized flow layouts we enable researchers to investigate statistics on large data sets such as “Does flow layout depend on the reading direction of the diagram author?” and practitioners to determine the most used layouts for example in their company and establish standards. The paper is structured as follows: In the next section, we will introduce related work in the area of BPMN layouting, followed by a clear outline of our research questions in Sect. 3. In Sect. 4, we present the formalization and classification algorithm. The identified flow directions deducted from a large set of BPMN models found on GitHub are shown in Sect. 5 after which we conclude and provide an outlook. 2 Related Work One of the first questions that arises in our context is how BPMN diagrams are laid out by practitioners. Effinger et al. [7, p. 400] state that “[i]n BPMN dia- grams the flow direction is usually top-to-bottom or left-to right.” This statement is empirically validated by Lübke & Wutke [13, p. 52], who found that 79.52% of BPMN diagrams from their GitHub data set are laid out left-to-right. They also identified other layouts, like most prominently, top-down layouts and more complex layouts like multiline and snake layouts. A more theoretical approach is taken by Figl & Strembeck. [11, p. 60] who state that “[b]asically, there are four main options for the overall direction: left- to-right, top-to-bottom, bottom-to-top, right-to-left.”, i.e., they take all four possible main directions as principal layout directions. However, they have also added that “zigzag models” should be subject to future research, thereby recog- nizing the use of more complex layouts in practice. All modeling guidelines we found recommend left-to-right layouts, e.g., [2]. Even the BPMN specification itself favors left-to-right modeling [14, p. 42]. al. [5, p. 49] define a guideline (number 43) that process modelers should make their models long and thin by aligning all edges with a general workflow direction as much as possible. However, more recently, a study by Lübke et al. [12, p. 127] has shown that the understandability of large diagrams profits from more complex layouts like snake or multiline layouts in order to avoid the penalty of scrolling these di- agrams on screen. For the case of smaller diagrams, this experiment found a slight advantage for left-to-right layouts in contrast to top-down layouts, af- 44 Elias Baalmann and Daniel Lübke firming Figl & Strembeck’s earlier experiment. However, the findings are either minimal (some understandability metrics in the former experiment) or not sig- nificant (some metrics in the former experiment and all metrics in the latter experiment). 3 Research Questions In this paper, we want to answer the following research questions. RQ1: How can diagrams be classified automatically? The automatic classification of flow layouts has many applications in research to answer questions such as “does the layout choice depend on the size of the diagram” and industry for example to enforce a style guideline. RQ2: How can flow layouts be formalized objectively? While formalizing all identified flow layouts is beyond the scope of this paper, we want to describe how such a formalization could be realized. RQ3: Which are the most commonly used flow layouts, and are they worth formalizing? By analyzing a large set of diagrams, we identify the most common layouts. Afterwards, we attempt to generalize the layouts to remove any biases that may be introduced by the data set. 4 Analyzing the Direction of Sequence Flow Fig. 1. Flow Layout Classification Algorithm To classify BPMN diagrams automatically and thus answer the first research question, a modular algorithm is designed. Figure 1 illustrates the structure of the algorithm. First, the BPMN file is parsed and some sanity checks are per- formed to determine if the diagram can be classified at all. Since some BPMN editors do not serialize the diagrams in a standard way [1, p. 12], BPMN files ex- ist, that are, e.g., missing layout data for the elements. For reference, an overview of the symbols and elements used in the BPMN is shown in the ‘BPMN-Poster’ by the BPM Offensive Berlin [3]. These diagrams cannot be classified with the Algorithmic Classification of Layouts of BPMN Diagrams 45 current implementation. To determine the flow layout of the diagram, each path along sequence flows from any start element to any end element without loops is analyzed individually. Fig. 2. BPMN Example with Vector Chain for each Layout Path in Colored Striped Arrows The first of four tasks performed on the layout path is converting it to a vector chain. This is a list of the vectors between the centers of the flow elements on the layout path from the start to the end element. Some special cases need to be considered here. This is demonstrated by the example diagram shown in Fig. 2. Every path in the diagram is directed as straight as possible left to right. Gateways and boundary events do not allow for precisely straight layouts without overlapping the different paths. To handle these cases, the (x or y) component of the vector between the centers of the elements (where the source element is a boundary event or a split, or where the target element is a join) that points in the orthogonal direction to the direction of the split, join or boundary event is set to zero. The direction of one element is determined by the following rules depending on the element type. 1. The element is a boundary event: the direction is horizontal if it is connected to its parent at the top or bottom side; otherwise it is vertical. 2. The element is a split: two cases are differentiated. Provided that the split element has an incoming sequence flow, the direction is horizontal, if the absolute value of the x component of the vector for the previous sequence flow is bigger than the absolute value of its y component. Otherwise it is vertical. The second case occurs if the split element is a start element. In this case, the direction is determined by constructing a vector that points into the average direction of the outgoing sequence flows of the split element (and comparing the x and y component as above). 3. The element is a join: here the direction is calculated similar as for a split, just in opposite order. First, the next sequence flow is considered, and, if it does not exist (the join is an end element), the average direction of the incoming sequence flows is used. Join elements pose a problem, as the vector for the next sequence flow is not determined when the 46 Elias Baalmann and Daniel Lübke direction of the join is needed. To circumvent this issue, the direction of the first sequence flow, on the path from the join to the end element that is not entering another join, is used. If no such sequence flow exists, the average direction of the outgoing sequence flows of the join is used. The colored arrows in Fig. 2 showcase the vector chains which result from this step of the algorithm for each of the four layout paths. The vertical position of the vectors illustrates how each sequence flow is converted into a vector. In reality, only the vectors (x and y components) are relevant. Due to the rules explained above, all vertical (y) components of the vectors are set to zero, resulting in four straight vector chains from left to right. In the next step, the vector chain is simplified by combining subsequent vectors with similar directions. The angle threshold is based on the number of 360◦ discrete vector directions (NODVD) and calculated by the formula NODVD . The vectors in the simplified chain which are a combination of at least two vectors get marked. Marked vectors are those that lay on a straight path in the diagram with at least one element between the start and the end of that path. This marking is important as it allows us to differentiate between otherwise indistinguishable flow layouts, e.g., Multiline and Snake (see Sect. 5). After that, each vector in the resulting simplified vector chain is mapped to a discrete direction. Currently, the NODVD used in the reference implementation is 16. This value was chosen because it felt natural, as a smaller NODVD like 8 would restrict the classification to much and a higher value like 32 would prevent many combinations of vectors and thus require diagrams to adhere very closely to a specific flow layout to be classified as that layout. The calculation of the angle threshold in the previous step guarantees that no two consecutive vectors have the same discrete direction. To determine the flow layout for the path, the list of discrete vector directions is classified using regular expressions (see Sect. 5). In the end, the flow layout for the whole diagram is defined as the flow layout that occurs for most of the paths. 5 Classifying the Diagram Flow Layout Rather than trying to describe every possible flow layout, our goal is to find commonly used layouts, formulate their distinguishing features, and build a clas- sification algorithm that can detect these layouts and is extendable to possibly handle other layouts that are deemed worthy of classification in the future. Lübke and Wutke identified six flow layouts while manually classifying 5299 diagrams: Left-Right, Top-Down, Snake Horizontal, Snake Vertical, Multiline Horizontal and Multiline Vertical [13]. For this paper, a larger data set from GitHub (53984 diagrams) was used to identify possibly relevant flow layouts. The data set is a super set of the one used by the before mentioned authors. Because of the vast quantity, manual classification of all diagrams is unfeasible. Thus, the process shown in Fig. 3 was used. First, the algorithm described in Sect. 4 was applied to all diagrams. The discrete vector directions determined by the algorithm were used to find common flow layouts. Though we established 16 distinct vector di- rections, only four distinct directions named north (N), east (E), south (S) and Algorithmic Classification of Layouts of BPMN Diagrams 47 Fig. 3. Methodology west (W) are used in the following examples to foster comprehensibility while keeping the regular expressions manageable. The marking of the vectors (see Sect. 4) is depicted by upper-case letters for marked vectors and lower-case for non-marked vectors. Grouping the diagrams by the discrete vector directions for each layout path showed that some sequences of discrete directions occurred in multiple diagrams. For instance, 55% of all diagrams had only layout paths with the direction E and 64 diagrams had only layout paths with the sequence EsW. Manual inspection of the grouped diagrams showed that multiple direction sequences exist for the same flow layout. E.g., the sequences EsW and EsWsE would both be considered Snake Horizontal. Thus, regular expressions were con- structed to classify all variations of a particular flow layout. A simplification of the regular expression for Snake Horizontal would be EsW(sEsW)*(sE)?(s)?. This allows for an arbitrary number of lines. This way of formalizing flow layouts with regular expressions is our way to approach RQ2. Fig. 4. The Seven Categories of Flow Layouts The seven categories of flow layouts that have been identified to answer RQ3, are: Straight, L, Multiline, Stairs, Snake, U and Z (Fig. 4). Multiple variants of flow layouts exist for each of these categories. Left-Right, Top-Down, Right-Left and Down-Top are the four variants of the straight category. Other categories can have more distinct flow layouts. One example is the multiline category. Eight variants can be differentiated as shown in Fig. 5. Even though not all 48 Elias Baalmann and Daniel Lübke Fig. 5. Variants of Multiline Layouts Fig. 6. Distribution of Flow Layouts in GitHub Data Set. Layout Categories (left) and Variants of Straight Flow Layout (right). these variants occur in the data set, they are all possible multiline layouts and should thus be identifiable. This extension allows us to generalize the usability of the classification by removing biases introduced by the data set as best as possible. Figure 6 illustrates the distribution of the automatic classification for the large data set. The left diagram shows the seven flow layouts, the right diagram the four variants of the Straight flow layout. Diagrams that could be analyzed (no file reading error, no missing layout information,...) but not classified are shown as Other. The Mixed category in the right chart contains diagrams where two thirds of the layout paths where classified as Straight but no single variant of the Straight category occurred for this many paths. Not analyzable Diagrams are not shown, 5315 of the 53984 .bpmn files where not analyzable as they contained some syntactic error or missed layout information etc. The charts demonstrate how strongly the Straight flow layout is favored especially in the Left-Right and Top-Down variants (note the logarithmic scale). 6 Conclusion & Outlook By analyzing a large data set of BPMN diagrams, we demonstrated that there are many flow layouts which are used for multiple diagrams. Subsequently, we identified seven categories of commonly used layouts. Formalizing flow layout with the use of regular expressions on discretized vector chains for each layout path is sensible. The formalization allowed us to create an algorithm and a tool implementation that showed promising results and can, e.g., be used by researchers to answer questions such as how diagram layouting differs between Algorithmic Classification of Layouts of BPMN Diagrams 49 less experienced users and experts of BPMN. The tool can also be used by teams in the industry to validate models against their layout guidelines. This paper provides a concise overview of our work but fails to describe every detail of the complex subject that is layout detection. Examples of aspects that were considered but not explicitly reported in this paper are: the impact of swimlanes or subprocesses on flow layouts and how an accuracy score can be determined to indicate how exact a diagram is adhering to a particular flow layout. Furthermore some parts of the parameterization of the algorithm where chosen by feel and might appear arbitrary. For example determining the optimal NODVD based on more scientific metrics than ‘does it feel natural’ could be an interesting topic for future work to investigate. References 1. Allweyer, T.: BPMN 2.0: Introduction to the standard for business process mod- eling. BOD - Books on Demand, Norderstedt, 2nd, updated and extended edition edn. (2016) 2. Birchler, A., Bosshart, E., Märki, M., Opitz, P., Pauli, J., Rigert, B., San- doz, Y., Schaffroth, M., Spöcker, N., Tanner, C., Walser, K., Widmer, T.: eCH- 0158 BPMN-Modellierungskonventionen für die öffentliche Verwaltung. WWW: https://www.ech.ch/dokument/fb5725cb-813f-47dc-8283-c04f9311a5b8 (Septem- ber 2014) 3. BPM Offensive Berlin: BPMNPoster – www.bpmb.de (2011), http://www.bpmb.de/index.php/BPMNPoster 4. Chinosi, M., Trombetta, A.: Bpmn: An introduction to the stan- dard. Computer Standards & Interfaces 34(1), 124–134 (2012). https://doi.org/10.1016/j.csi.2011.06.002 5. Corradini, F., Ferrrari, A., Fornari, F., Gnesi, S., Polini, A., Re, B., Spagnolo, G.O.: Quality assessment strategy: apply- ing business process understandability guidelines for learning, http://pumax.isti.cnr.it/dfdownloadnew.php?ident=cnr.isti/cnr.isti/2015-TR- 03&langver=i&scelta=Metadata 6. Corradini, F., Ferrari, A., Fornari, F., Gnesi, S., Polini, A., Re, B., Spagnolo, G.O.: A guidelines framework for understandable BPMN models. Data & Knowledge Engineering 113, 129–154 (2018). https://doi.org/10.1016/j.datak.2017.11.003 7. Effinger, P., Siebenhaller, M., Kaufmann, M.: An interactive layout tool for BPMN. In: 2009 IEEE Conference on Commerce and Enterprise Computing. pp. 399–406 (2009). https://doi.org/10.1109/CEC.2009.36 8. Effinger, P.: Layout patterns with BPMN semantics. In: Dijkman, R., Hofstetter, J., Koehler, J. (eds.) Business Process Model and Notation. pp. 130–135. Springer Berlin Heidelberg, Berlin, Heidelberg (2011) 9. Effinger, P., Jogsch, N., Seiz, S.: On a study of layout aesthetics for business process models using BPMN. In: Mendling, J., Weidlich, M., Weske, M. (eds.) Business Process Modeling Notation. pp. 31–45. Springer Berlin Heidelberg, Berlin, Heidel- berg (2010) 10. Figl, K., Strembeck, M.: On the importance of flow direction in business process models. In: 2014 9th International Conference on Software Engineering and Appli- cations (ICSOFT-EA). pp. 132–136. IEEE Computer Society, Los Alamitos, CA, USA (2014). https://doi.org/10.13140/2.1.3445.8247 50 Elias Baalmann and Daniel Lübke 11. Figl, K., Strembeck, M.: Findings from an experiment on flow direction of business process models. In: Kolb, J., Leopold, H., Mendling, J. (eds.) Enterprise modelling and information systems architectures. pp. 59–73. Gesellschaft für Informatik e.V, Bonn (2015) 12. Lübke, D., Ahrens, M., Schneider, K.: Influence of diagram layout and scrolling on understandability of BPMN processes: an eye tracking experiment with BPMN diagrams. Information Technology and Management 22(2), 99–131 (2021). https://doi.org/10.1007/s10799-021-00327-7 13. Lübke, D., Wutke, D.: Analysis of prevalent BPMN layout choices on GitHub. In: Manner, J., Haarmann, S., Kolb, S., Herzberg, N., Kopp, O. (eds.) Pro- ceedings of the 13th European Workshop on Services and their Composition (ZEUS 2021), Bamberg, Germany, February 25-26, 2021. CEUR Workshop Pro- ceedings, vol. 2839, pp. 46–54. CEUR-WS.org (2021), http://ceur-ws.org/Vol- 2839/paper9.pdf 14. Object Management Group: Business process model and notation (BPMN), version 2.0, https://www.omg.org/spec/BPMN/2.0/PDF