Analysis of Prevalent BPMN Layout Choices on
                        GitHub

        Daniel Lübke1,2[https://orcid.org/0000−0002−1557−8804] and Daniel Wutke3
                     1
                      Digital Solution Architecture, Hanover, Germany
                         2
                           Leibniz Universität Hannover, Germany
                   daniel.luebke@digital-solution-architecture.com
                    https://www.digital-solution-architecture.com
                                   3
                                      dwutke@gmail.com


          Abstract. Layout of BPMN diagrams greatly influences their under-
          standability. The primary objective of this study is to understand prevalent
          choices of modelers for their design of BPMN diagrams. As a research
          method we use repository mining to analyze BPMN diagrams we found
          on GitHub. We found that BPMN diagrams on GitHub are mostly laid
          out from left-to-right and that layout direction choices differ by the mod-
          eling tool, process model type (pool vs. no pools) and purpose (toy vs.
          non-toy).

          Keywords: BPMN Layout · GitHub Mining · Repository Mining


   1    Introduction
   Layout is one of the influencing factors of understandability of BPMN diagrams [3].
   While some empirical research exists on this topic, we want to explore real-world
   BPMN processes and analyze the use of layouts – and influencing factors of those;
   for example, what layout direction (left-right vs. top-bottom) is dominantly used?
       While GitHub has been used in software engineering research [6], its use for
   BPMN-related research is only in the beginning [4, 5]. Within this paper we
   re-use the dataset by Heinze et al. [4] and present a preliminary analysis of layout
   direction choices made for the BPMN diagrams contained therein.
       We present a preliminary study, which is structured as follows: First we
   present our research design in Sect. 2 before we explain how we mined GitHub
   and how we handled the obtained models in Sect. 3. Results of our statistical
   analysis are presented in Sect. 4 for which we give our interpretation in Sect. 5.
   Finally, we discuss threats to validity in Sect. 6 before we conclude and give
   possible future research topics.


   2    Research Questions
   We want to answer the following research questions related to the layout direction
   of BPMN diagrams found on GitHub:


      J. Manner, S. Haarmann, S. Kolb, N. Herzberg, O. Kopp (Eds.): 13th ZEUS Workshop,
ZEUS 2021, Bamberg, held virtually due to Covid-19 pandemic, Germany, 25-26 February 2021,
                               published at http://ceur-ws.org
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License
                          Attribution 4.0 International (CC BY 4.0).
            Analysis of Prevalent BPMN Layout Choices on GitHub                   47


                      Fig. 1: The Research Process Followed


RQ1: What layout directions are used and how is their usage dis-
tributed?
Because existing research predicts better understandability for horizontal lay-
outs [2, 3] and existing modeling guidelines mandate them [1, 9], we hypothesize
that left-right diagrams are in the majority.
RQ2: What influence does the modeling tool have on layout direction?
Differences due to modeling tools have been shown for BPEL [8]. We expect that
such differences also exist with BPMN.
RQ3: What influence has project ownership on layout direction?
We expect that (explicit or implicit) modeling guidelines and shared authorship
within a project will lead to uniformity of layouts in any given project. Thus, we
expect that the majority of diagrams within a project will have the same layout.
RQ4: Are “toy” diagrams laid out differently?
Because it has been demonstrated before that models exhibit different properties
based on different purposes [7] (e.g., productive vs. example), we expect toy
diagrams to be smaller and thus to have simpler layouts (horizontal & vertical
only).
RQ5: Are diagrams with pools laid out differently?
Because we expect that laying out pools with more complex layouts is difficult,
we expect pools to have significantly more left-right and top-bottom layouts.
Because we think that pools lead to even more left-right modeling, we expect
that the proportion of this layout is even larger in the diagrams with pools.
    In order to answer our research questions we followed the following research
process as illustrated in Fig. 1: We downloaded all files in the list of [4] (not all
of which were still available online) and filtered those files according to the steps
described in Sect. 3.
48       Daniel Lübke and Daniel Wutke

    Later, both authors manually classified the layout of the diagrams and calcu-
lated BPMN metrics with a self-written tool called BPMN Layout Analyzer for
various diagram metrics4 . All classification data and metric data was stored in
CSV files on which statistical analysis was performed with R. These steps are
described in Sect. 4.


                       Fig. 2: Names of Layout Directions


   During analysis of layout directions, we labeled BPMN diagrams as shown in
Fig. 2.


3     GitHub Mining and Data Cleansing

We started by downloading the BPMN files from GitHub as they have been
identified by Heinze et al. [4]. This means that we did not mine GitHub per
se but downloaded all models by the list of models identified by Heinze et al.
Although the original list contained 8904 unique BPMN files, only 8467 files
were still available as of 2020-04-06.
    For each diagram we generated a PNG file by using BPMN.io’s bpmn-to-
image. This failed for some files due to missing diagram interchange information
or other file format compliance issues. This left us with 5299 unique processes.
    In addition BPMN DI layout information or the XML itself were corrupt,
which was ignored by BPMN.io so that after merging in the results from our
BPMN Layout analyzer tool, only 4638 diagrams were left. In order to exclude
“junk diagrams” we removed diagrams from the data set, which did not have
a) at least two activities (neither counting events nor gateways), and b) were
4
    Freely available at: https://github.com/dluebke/bpmnlayoutanalyzer/
           Analysis of Prevalent BPMN Layout Choices on GitHub                  49


                   Table 1: Distribution of Diagram Layouts
          Layout               % total % non-toy % toy % no pools % pools
          left-right            79.52      73.35 86.30      78.21 86.80
          multiline-horizontal   0.55       0.67 0.42        0.41    1.32
          multiline-vertical      0.10      0.19 0.00        0.12    0.00
          other                  9.34      10.07 8.54        9.12 10.56
          snake-horizontal        1.96      2.88 0.95        2.19    0.66
          snake-vertical         0.20       0.29 0.11        0.24    0.00
          top-down               8.33      12.56 3.69        9.71    0.66


connected enough. Too low connectedness is found in diagrams that are just used
for placing all BPMN elements without any sequence flows, which was probably
done to make illustrations.
    Therefore, we used the following threshold for the number of subgraphs sg:

                       
                           2 × |poolsexpanded| : poolsexpanded > 0
                  p=                                                           (1)
                           2                   : poolsexpanded = 0
                                       se = |subprocessesexpanded |            (2)
                               e                          expanded
                              e = 2 × |eventsubprocess                |        (3)
                                   c                      collapsed
                                   e = |eventsubprocess       |                (4)
                                                     e   e   c
                                           sg ≤ p + s + e + e                  (5)
     First we define how many subgraphs our process-flow is allowed to have
(equation 1): We want to exclude diagrams in which the main process falls apart
into more than 2 subgraphs. For diagrams with pools we allow 2 subgraphs per
expanded pool. Next, we allow an additional subgraph for an expanded subprocess
(equation 2) because a new process must be contained in it. Event subprocesses
are different because they are not connected to the main process flow. As such,
an additional subgraph must be granted for each event subprocess, if the event
subprocess is collapsed (equation 4). If the event subprocess is expanded two
additional subgraphs are allowed: one for the event subprocess and one for the
process contained in it (equation 3).
     This left us with 1992 diagrams for analysis. The analysis of the influence of
project ownership on layout directions includes duplicates and is based on a total
of 7396 processes and 2745 processes after determining metrics and relevance
filtering.


4    Execution & Statistics
For getting process and layout direction counts both authors manually and
independently classified the diagram layout direction. After the first round,
approx. 10% differences between layout direction classifications had been found
which were resolved later in a shared session to reach a unified understanding.
50                           Daniel Lübke and Daniel Wutke

                            125                                                     125


                            100                                                     100


     Number of Flow Nodes


                                                             Number of Flow Nodes
                             75                                                      75


                             50                                                      50


                             25                                                      25


                              0                                                       0
                                     Non−Toy      Toy                                         No Pools      Pools


                                  (a) Toy vs. No-Toy                                      (b) Pools vs. No Pools

     Fig. 3: Comparison of Flow Node Count for different Diagram Subsets


    The total distribution of layout directions is shown in the first data column in
Table 1: The most common layout direction was left-to-right, followed by “other”
layouts, which describe chaotic or too unclean layout directions, and top-down
layouts. More advanced layouts like snake or multi-line layouts are rarely used.
    We further classified if a BPMN model is a “toy” model or not by searching
for the key words “test”, “dummy”, and “example” in the complete file name
including path. The distribution of layout directions with regards to toy vs.
non-toy processes are shown in the middle columns of Table 1. Left-right layouts
are used even more frequently in toy diagrams, while top-down layouts are used
more often with non-toy diagrams. We performed a simulated Fisher’s Exact
Test (100,000 rounds) to test whether the distributions of layout directions is
significantly different between the toy and non-toy subsets.
    In the following we calculated the number of pools and flow nodes in the
processes with the BPMN Layout Analyzer. The distribution of layout directions
for BPMN diagrams with or without pools are shown in the last two columns in
Table 1. There are more left-right layouts used in conjunction with pools than
without and more top-down layouts are used without pools than with pools. We
again used a simulated Fisher’s Exact Test (100,000 rounds) in order to check
whether the distributions of layout directions is significantly different between
diagrams with or without pools. This test again yields a highly significant p-value
(p = 9.999 9 × 10−6 ).
    In a next step we analyzed the sizes of diagrams measured in number of flow
nodes for toy vs. no toy diagrams (see Fig. 3a) and for diagrams with or without
pools (see Fig. 3b): The mean number of flow nodes of the toy subset is 11.24
and the mean of the non-toy subset is 13.4. A Wilcoxon hypothesis test for a
                            Analysis of Prevalent BPMN Layout Choices on GitHub                                                                                                    51

                                           Activiti   515          4                                            79       25                 4                125
                                        ADONIS          3
                                           Bizagi       2
                                      BonitaSoft        2                                                        3
                                 BPMN Modeler           1
                                   BPMN Studio          1
                                         bpmn.io       34                                                        6
                           Camunda (unknown)           26                                                        1
                       Camunda Modeler (new)          234          2                                            12       5
                                                                                                                                                                        Frac. layouts
                        Camunda Modeler (old)          53          2                                             2       1                                                  1.00
                                    Drools Flow       324          1                      2                     28       5                                   39
 Exporter


                                  Eclipse BPMN         37          1                                             2       1                                    1             0.75
                            Enterprise Architect        1
                                                                                                                                                                            0.50
                                        Fix Flow       18                                                        2
                                        Flowable       10                                                        4                                                          0.25
                                         ibo.NET        1
                                 jBPM Designer        103          1                                            31
                         jBPM Process Modeler                                                                    1
            Process Modeler 6 for Microsoft Visio      1
                                            ProM       1
                                        Signavio      105                                                        8                                            1
                    Trisotech Workflow Modeler        13
                                       unknown        17                                                         5
                         Yaoqiang BPMN Editor         78                                                         2       2
                                 Zeebe Modeler         4
                                                      left−right


                                                                   multiline−horizontal


                                                                                          multiline−vertical


                                                                                                                 other


                                                                                                                         snake−horizontal


                                                                                                                                            snake−vertical


                                                                                                                                                             top−down
                                                                                                               Layout


Fig. 4: Absolute and Relative Numbers of Processes by Layout and BPMN Editor


difference of means yields a p-value of 0.013 1. The mean number of flow nodes
of the pools subset is 18.08 and 11.34 for diagrams without pools. A Wilcoxon
hypothesis test for a difference of means yields a highly significant p-value of
1.514 × 10−19 .
    The “BPMN Layout Analyzer” tool also extracts the exporter meta data
(which describe the BPMN tool that wrote that file) from BPMN files. When
no exporter meta data was found, some heuristics, e.g., namespace names, were
used to find a potential BPMN editor. However, there are still some ambiguities,
e.g., Camunda and bpmn.io have different names and we do not know for sure
whether these name changed in different releases or whether the exporter info
was set incorrectly by some other BPMN tool.
    We broke down the number of diagrams grouped by layout direction and the
BPMN editor as shown in Fig. 4: Nearly all tools have left-right or other layouts
only with small exceptions. However, both Activiti and Drools also have a large
number of top-down layouts. They are practically the only editors, which have
been used to create top-down layouts, although the majority of diagrams created
with these tools still follow a left-right layout. We performed a simulated Fisher’s
Exact Test (100,000 rounds) to test whether the distribution of layout directions
is independent from the modeling tool used. This test yields a highly significant
p-value of p = 9.999 9 × 10−6 .
52      Daniel Lübke and Daniel Wutke

    Lastly, we analyzed the layout direction “cleanliness” for the repositories.
For each repository we calculated the most dominant layout direction for all
diagrams contained therein. Then we calculated the percentage of diagrams
that have this layout direction compared to all diagrams within this repository.
Thus, cleanliness of 100% means that all diagrams in such a repository have the
same layout direction. 85.02% of all repositories had the same layout direction
for all their diagrams. Furthermore, we performed a simulated Fisher’s Exact
Test (100,000 rounds) for the different layout direction distributions against the
repositories, which yields a highly significant p-value of p = 9.999 9 × 10−6 .


5    Interpretation

RQ1: What layout directions are used and how is the usage distributed?
We found that the majority (79.52 %) of diagrams are laid out left-right. Although
we do not know what the causal reasons are, the left-right layout is predominantly
used as recommended by theory, existing guidelines, and the BPMN specification.
RQ2: What influence does the modeling tool have on layout direction?
The hypothesis test is highly significant indicating that the modeling tool has an
impact on the diagram layout direction. Interestingly, the Activiti and Drools
modelers are responsible for nearly all top-down layouts. However, it is unclear at
this point, why these tools have been used for top-down layouts, which warrants
further investigation. Many editors have preference for left-right layouts, e.g.,
Camunda and Signavio. Thus, investigating editor preferences and linking them
to actually used layouts can possibly give more insights.
RQ3: What influence does the project ownership has on layout direc-
tion?
Layout directions differ highly significantly depending on the project ownership,
i.e., the owning repository. While we could not dive deep into the data yet, the
differences are highly significant: 85.02% of the repositories followed only one
layout paradigm; others had diagrams with different layouts. This means that
there are forces which will make diagrams in a projects more similar. Future
research can investigate what those forces are (e.g., same developers, guidelines,
. . . ).
RQ4: Are “toy” diagrams laid out differently?
Within the dataset toy diagrams have a highly significantly different layout
distribution and are highly significantly smaller with regards to their flow node
count. As such we conclude that “toy” diagrams are not representative for the set
of “non-toy” BPMN diagrams and future research should be concerned whether
to include or exclude those depending on the research questions.
RQ5: Are diagrams with pools laid out differently?
Within the dataset diagrams with pools have a highly significantly different
layout distribution and are highly significantly larger with regards to their flow
node count. As such we conclude that for future research into BPMN models and
diagrams, pool and non-pool diagrams should be researched separately.
            Analysis of Prevalent BPMN Layout Choices on GitHub                   53

    Because this is a exploratory study based on existing data without any control,
all these correlations can be due to confounding reasons or because they are
really causal. Further research is required to establish the relationship type.


6    Threats to Validity

Like in software engineering research a general threat is the usage of GitHub data
that might not be representative and generalizable [6]. In fact, we have shown
that further analysis must take care of diagram types. Also, due to manual nature
of the layout classification other researchers might come to other results. We also
experienced problems with the diagram interchange information in the BPMN
files, which can skew the results to more reflect compliant editors. Lastly, the
tool distribution found on GitHub does not match those found in organizations
(e.g., IBM, SAP, and Oracle tools are missing; Signavio is underrepresented etc.).
Some BPMN models had more than 1 diagram, which we did not evaluate. On a
technical note, the exporter information in the BPMN diagrams itself are not
as reliable as one would hope: Missing information and ambiguous names are
possible sources of error.


7    Conclusion & Outlook

Within this paper we have shown that the majority of BPMN diagrams found
on GitHub are laid out left-right. The results suggest that the type (“toy” or
“non-toy”) of a process model influences the size and the layout direction. We have
found that further research into tool usage is warranted as nearly all top-down
diagrams are laid out using only two editors. Furthermore, we have shown that
most – although not as many as expected – repositories only contain diagrams
with one layout.
    This work opens up new research angles: 1) How can “real” models be separated
from “toy” models automatically? Our heuristics of using key words in the file
path is a first approximation but while manually classifying the layouts, we also
encountered other diagrams (empty or default labels, unfinished diagrams, . . . )
that should possibly excluded. 2) The exact causal relationships between model
properties (size, pools), modeling tooling and the diagram layout needs to be
researched. We showed correlations but no causal relationships in this work. 3)
This study should be replicated and compared to model repositories from larger
organizations created by process modelers in their respective environments. 4)
Formalization of layout direction and automation of its detection in order to
scale to larger datasets and make the classification more objective. All in all, this
explorative work has laid the foundations for answering these questions.
54      Daniel Lübke and Daniel Wutke

                              Bibliography


1. Angela Birchler, Elisabeth Bosshart, Mike Märki, Peter Opitz, Jürg
   Pauli, Beat Rigert, Yves Sandoz, Marc Schaffroth, Nicki Spöcker,
   Christian Tanner, Konrad Walser, and Thomas Widmer.                  eCH-0158
   BPMN-Modellierungskonventionen für die öffentliche Verwaltung. WWW:
   https://www.ech.ch/dokument/fb5725cb-813f-47dc-8283-c04f9311a5b8,
   September 2014.
2. Kathrin Figl and Mark Strembeck. On the importance of flow direction in
   business process models. In 2014 9th International Conference on Software
   Engineering and Applications (ICSOFT-EA), pages 132–136. IEEE, 2014.
3. Kathrin Figl and Mark Strembeck. Findings from an experiment on flow
   direction of business process models. In EMISA 2015, 2015.
4. Thomas Heinze, Viktor Stefanko, and Wolfram Amme. Bpmn in the wild:
   Bpmn on github. com. In Proceedings of the 12th ZEUS Workshop on Services
   and their Composition, pages 26–29. CEUR-ws. org, 2020.
5. Thomas S. Heinze, Viktor Stefanko, and Wolfram Amme. Mining bpmn
   processes on github for tool validation and development. In Selmin Nurcan,
   Iris Reinhartz-Berger, Pnina Soffer, and Jelena Zdravkovic, editors, Enterprise,
   Business-Process and Information Systems Modeling, pages 193–208, Cham,
   2020. Springer International Publishing.
6. Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M.
   German, and Daniela Damian. The promises and perils of mining github.
   In Proceedings of the 11th Working Conference on Mining Software Reposito-
   ries, MSR 2014, pages 92–101, New York, NY, USA, 2014. Association for
   Computing Machinery.
7. Daniel Lübke, Ana Ivanchikj, and Cesare Pautasso. A template for categorizing
   empirical business process metrics. In Business Process Management Forum -
   BPM Forum 2017, 2017.
8. Daniel Lübke, Tobias Unger, and Daniel Wutke. Analysis of data-flow com-
   plexity and architectural implications. In Daniel Lübke and Cesare Pautasso,
   editors, Empirical Studies on the Development of Executable Business Pro-
   cesses, pages 59–80. Springer, 2019 (to be published).
9. Bruce Silver and Bruce Richard. BPMN Method and Style, volume 2. Cody-
   Cassidy Press Aptos, 2009.