<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>BPMN in the Wild: BPMN on GitHub.com</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas S. Heinze</string-name>
          <email>thomas.heinze@dlr.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viktor Stefanko</string-name>
          <email>viktor.stefanko@uni-jena.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wolfram Amme</string-name>
          <email>wolfram.amme@uni-jena.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Friedrich Schiller University Jena</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>German Aerospace Center</institution>
          ,
          <addr-line>DLR</addr-line>
        </aff>
      </contrib-group>
      <fpage>26</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>We present our eforts in creating and analyzing a corpus of BPMN process models by mining software repositories. Systematically searching for BPMN process artifacts in 6,163,217 repositories or 10% of all repositories hosted on GitHub.com, at the time of conducting our research, resulted in a diverse corpus of 8,904 BPMN 2.0 process models. Within the last years, an increasing number of software projects have shifted towards using platforms such as GitHub.com for their software development. Using these platforms as a source of data for empirical research allows for addressing a wide range of questions on the practice of software development and receives more and more attention, as indicated by the popularity of the flagship conference on the topic: International Conference on Mining Software Repositories (MSR)1. Research in the domain of business process modeling can as well benefit from such a data-driven approach. Due to characteristics of the domain, i.e., “process equals product”, there is a lack of larger and commonly available datasets with real-world process models, which hinders empirical research in this area [2,11,13]. Mining software repositories, i.e., systematically retrieving, processing and analyzing process models from software repositories hosted on platforms such as GitHub.com, can help to overcome this lack and provides a complimentary approach to empirical research besides existing methods like case studies, experiments, and surveys. For example, research questions on how a language such as the Business Process Model and Notation (BPMN) [1] is used in practice can be addressed, in order to diferentiate between the frequently and the rarely used parts of the language, thus advancing language and tool development. Analyzing modeling styles furthermore allows for investigating best practices and guidelines to help process designers. Eventually, best practices and tools as proposed by academic research or industry can be evaluated more realistically [12]. In this paper, we present our approach for mining software repositories on GitHub.com to create and analyze a corpus of BPMN process models. Due to the sheer number of repositories on GitHub.com and time constraints, we limited our approach to a randomly selected subset of 6,163,217 repositories or 10% of 1 http://www.msrconf.org</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>GHTorrent</p>
      <p>GitHub</p>
      <p>GitHub</p>
      <p>API v3
1 Repository</p>
      <p>Selection
2 Data</p>
      <p>Extraction
3 Filtering /</p>
      <p>Cleansing
4 Analysis
all software repositories on GitHub.com at the time of conducting the research.
As a result, we were able to identify and analyze 8,904 distinct process models
which are defined using BPMN 2.0’s XML-based serialization format.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The Lindholmen dataset has been an inspiration for this paper [
        <xref ref-type="bibr" rid="ref12 ref5">5,12</xref>
        ]. In Hebig et
al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the authors describe their approach to mine GitHub.com for UML models
and report on gained insights. The dataset is considerably larger than our corpus,
counting 93,596 models [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. UML though is a family of general-purpose modeling
languages while BPMN is one domain-specific modeling language. We are not
aware of other work, which mines software repositories for BPMN models.
      </p>
      <p>
        There have also been community eforts to create model collections [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The
BPM Academic Initiative provides a platform to create and share business
process models for academic teaching [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. According to Ho-Quang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
the recent number of models is 29,285, but data collection has discontinued and
the focus is on conceptual models as most models originate from students. A
similar platform has been introduced last year under the name RePROSitory [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
including 174 business process models in its current database. Another initiative
is the BenchFlow project, where business process models were collected from
industrial partners. The authors claim to have collected 8,363 models, with a
share of 64% of BPMN [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Unfortunately, the collection is not publicly available.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Mining BPMN on GitHub.com</title>
      <p>Mining software repositories is a data mining task, consisting of steps of defining
a research objective, selecting and extracting appropriate data, preprocessing
and data cleansing, data analysis, and finally interpreting the analysis results.</p>
      <p>In the first step of our implemented data mining pipeline, compare with
Fig. 1, we got a list of all software repositories on GitHub.com by querying a
local instance of the GHTorrent2 database. We then randomly selected a subset</p>
      <sec id="sec-3-1">
        <title>2 http://ghtorrent.org/</title>
        <p>
          Thomas S. Heinze et al.
of 6,163,217 non-forked repositories. All 6,163,217 repositories were examined for
potential BPMN process model artifacts using the GitHub API 3 in the second
step. To this end, the default branch and its file structure were queried for
each repository. Potential BPMN process model artifacts were then identiefid
by searching for the term "bpmn" in their file name and file extension. Among
the analyzed repositories, we found 1,251 repositories, with at least one potential
BPMN process model artifact and overall 21,306 artifacts. We downloaded the
identified repositories and artifacts. In the third step, since the artifacts included
a wide range of file formats, we filtered for BPMN 2.0’s XML-based serialization
format, which lowered the number of artifacts to 16,907. Additionally removing
duplicates4, yielded the corpus of 8,904 distinct BPMN 2.0 models. All the
BPMN artifacts were finally subject to a preliminary analysis in the fourth step.
Information on the corpus and analysis outcomes are available online [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]5.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Preliminary Analysis</title>
      <p>
        In our preliminary analysis, we were mainly interested in the diversity of the
found BPMN process model artifacts. We here sketch some of the results. Looking
at the artifacts’ age, more than each third was modified in the last year at the
time of conducting our research. We though also found artifacts older than 8
years. Using the locations of repository contributors allowed us to reason on the
artifacts’ geographical origin, where China, USA, and Germany played prominent
roles. The corpus spans a range of diferent model sizes. While half of the process
models are smaller than 20 nodes, we also identified 57 models with more than
1,000 nodes. We were also able to confirm the finding reported in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], that models
play a rather static role in software repositories. Up to three quarter of all the
BPMN process model artifacts were thus never updated at all.
      </p>
      <p>
        Since the design of BPMN process models is known to be error-prone, we
were also interested in the number of errors found in the models and the need
for analysis tools to help process designers in avoiding those. Various analysis
tools have been developed in recent years, ranging from simple linters [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], over
tools based on data flow analysis [
        <xref ref-type="bibr" rid="ref6 ref8">6,8</xref>
        ], to full-fledged model checkers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Note
that most of the tools are evaluated using case studies or artifical process models.
Therefore, evaluating analysis tools using our corpus of 8,904 BPMN process
models allows to verify existing tool evaluations based upon a complimentary
empirical means. We have chosen the linting tool BPMNspector 6 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for checking
process models with respect to their compliance with the BPMN 2.0 standard [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Running the linter revealed violations of the standard’s rules for almost all of the
process models in the corpus. Only 1,471 models were identified as valid BPMN
process models, thus confirming the results for the case study used to evaluate
BPMNspector in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which found 42 invalid among overall 66 BPMN models.
      </p>
      <sec id="sec-4-1">
        <title>3 https://developer.github.com/v3 4 http://doubles.sourceforge.net 5 https://github.com/ViktorStefanko/BPMN_Crawler 6 https://github.com/uniba-dsg/BPMNspector</title>
        <p>
          In this paper, we introduced our approach of systematically extracting a corpus of
BPMN business process models from software repositories hosted on GitHub.com.
Mining a fraction of 10% of all software repositories, at the time of conducting our
research, resulted in 8,904 distinct serialized BPMN 2.0 process models. We believe
that our corpus of BPMN models provides a starting point for understanding
more about the practice of BPMN. Note though the general limitations of the
idea of repository mining [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In future work, besides increasing the coverage of
analyzed software repositories, we want to research on questions about BPMN’s
use on GitHub.com, e.g., what are frequently and rarely used constructs or are
there certain characteristics that can be used to predict modeling errors [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Business</given-names>
            <surname>Process</surname>
          </string-name>
          <article-title>Model and Notation (BPMN), Version 2</article-title>
          .0.
          <string-name>
            <given-names>Object</given-names>
            <surname>Management Group (OMG) Standard</surname>
          </string-name>
          (
          <year>2011</year>
          ), https://www.omg.org/spec/BPMN/2.0/PDF
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Corradini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fornari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Re</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tiezzi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>RePROSitory: a Repository Platform for Sharing Business PROcess modelS</article-title>
          .
          <source>In: BPM PhD/Demos</source>
          <year>2019</year>
          . pp.
          <fpage>149</fpage>
          -
          <lpage>153</lpage>
          . CEUR (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fahland</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Favre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jobstmann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koehler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lohmann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , Vo¨lzer, H.,
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Instantaneous Soundness Checking of Industrial Business Process Models</article-title>
          .
          <source>In: BPM 2009</source>
          . pp.
          <fpage>278</fpage>
          -
          <lpage>293</lpage>
          . Springer (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Geiger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neugebauer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vorndran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Automatic Standard Compliance Assessment of BPMN 2.0 Process Models</article-title>
          .
          <source>In: ZEUS 2017</source>
          . pp.
          <fpage>4</fpage>
          -
          <lpage>10</lpage>
          . CEUR (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hebig</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quang</surname>
            ,
            <given-names>T.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudron</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robles</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>The Quest for Open Source Projects that use UML: Mining GitHub</article-title>
          .
          <source>In: MODELS 2016</source>
          . pp.
          <fpage>173</fpage>
          -
          <lpage>183</lpage>
          . ACM (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Heinze</surname>
            ,
            <given-names>T.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amme</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moser</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Static analysis and process model transformation for an advanced business process to Petri net mapping</article-title>
          .
          <source>Softw.: Pract. &amp; Exp</source>
          .
          <volume>48</volume>
          (
          <issue>1</issue>
          ),
          <fpage>161</fpage>
          -
          <lpage>195</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Heinze</surname>
            ,
            <given-names>T.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stefanko</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amme</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Mining von BPMN-Prozessartefakten auf GitHub</article-title>
          .
          <source>In: KPS 2019</source>
          . pp.
          <fpage>111</fpage>
          -
          <lpage>120</lpage>
          . DHBW Stuttgart (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Heinze</surname>
            ,
            <given-names>T.S.</given-names>
          </string-name>
          , Tu¨rker, J.:
          <article-title>Certified Information Flow Analysis of Service Implementations</article-title>
          .
          <source>In: SOCA 2018</source>
          . pp.
          <fpage>177</fpage>
          -
          <lpage>184</lpage>
          . IEEE (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Ho-Quang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudron</surname>
            ,
            <given-names>M.R.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robles</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herwanto</surname>
            ,
            <given-names>G.B.</given-names>
          </string-name>
          :
          <article-title>Towards an Infrastructure for Empirical Research into Software Architecture: Challenges and Directions</article-title>
          . In: ECASE@ICSE
          <year>2019</year>
          . pp.
          <fpage>34</fpage>
          -
          <lpage>41</lpage>
          . IEEE (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kalliamvakou</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gousios</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blincoe</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>German</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Damian</surname>
            ,
            <given-names>D.E.</given-names>
          </string-name>
          :
          <article-title>The Promises and Perils of Mining GitHub</article-title>
          .
          <source>In: MSR 2014</source>
          . pp.
          <fpage>92</fpage>
          -
          <lpage>101</lpage>
          . ACM (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kunze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luebbe</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weidlich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weske</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Towards Understanding Process Modeling - The Case of the BPM Academic Initiative</article-title>
          . In: BPMN 2011 Workshops. pp.
          <fpage>44</fpage>
          -
          <lpage>58</lpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Robles</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho-Quang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hebig</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaudron</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>An extensive dataset of UML models in GitHub</article-title>
          .
          <source>In: MSR 2017</source>
          . pp.
          <fpage>519</fpage>
          -
          <lpage>522</lpage>
          . IEEE (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Skouradaki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roller</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leymann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferme</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pautasso</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>On the Road to Benchmarking BPMN Workflow Engines</article-title>
          .
          <source>In: ICPE 2015</source>
          . pp.
          <fpage>301</fpage>
          -
          <lpage>304</lpage>
          . ACM (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>