Social-Collaborative Inductive Reference Model Mining in a Knowledge-Based Organization Andreas Sonntag1 1 Saarland University, 66123 Saarbrücken, Germany Abstract. The aim of reference model mining is to support the efficient execution of process instances. Social network analysis has a great potential for reference model mining as it reveals social and functional relations, which are critical for the efficiency of collaborative business processes. This study demonstrates and evaluates an approach to applying social network analysis to a human aspect of reference model mining. The approach is based on a dynamic performer network, which is an evolving social collaboration network in a knowledge-based organization. For this purpose, agent-based simulation is applied to a longitudinal dataset concerning researcher collaboration in an internationally renowned 'center of excellence' for industry-oriented research in the field of artificial intelligence. The resulting performer network can be used as a reference model for efficient researcher collaboration, and it is reusable for future process execution of similar organizations. Keywords: Reference Model Mining, Human Aspects of BPM, Researcher Collaboration. 1 Introduction One important requirement for individuals who participate in a multitude of business processes is the availability of business process models which can be executed efficiently. A reference model should represent an efficient and reusable implementation of processes in an organization, simplifying internal structures to reduce the complexity and resources needed for business process mining [1], [2]. Our research community ignores the influence of social collaboration between people working on processes, called “performers” in the following. Since business processes entail social processes, it becomes essential to employ social network analysis for reference model mining in the knowledge work domain. Previous research on collaboration around knowledge-based processes in the field of business process management are the following: Tomasello and colleagues [3], [4] investigate the formation and performance of collaboration networks by evaluating over time the link formation events involving a knowledge flow between the collaborating parties. [5] introduce a method to interpret the workflow using social networks inferred by interviews and questionnaires with employees. The authors seek to identify gaps between the management view on processes and their actual execution. [6] introduce a concept for the derivation of a reference process model from knowledge-based Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 process models. Thereby, a performer network with topological success structures, holding for a corpus of given process models denotes a reference process model. [7] introduce the concept of "generated performer networks" for optimizing the efficiency of process models to determine "topological success properties" applying agent-based simulation on a given event-driven process chain. Originated on the static performer network concept of [6], we introduce a procedure to derive a reference model from a process execution observed in a knowledge-based organization. This reference model consists of a reference social collaboration topology that should be similar to a real organization, independent of an organization-specific performer assignment to process functions. The reference model is able to explain the evolvement of the past organization and to show its re-usability by predicting the organization’s future evolvement concerning social collaboration topology and efficiency. The paper is structured as follows: First, we describe fundamental terms and methods, then we explain our approach; this is followed by the evaluation of our concept with the experimental design, its evaluation and the discussion of our findings. Subsequently, a conclusion brings this work to a close. 2 Performer Networks Social network analysis (SNA) investigates similar social structures in organizations, especially in the communication behavior between individuals, applying mainly methods of graph analysis [8]. SNA should be applied on aggregated data when increasing the observation scales [9]. In addition, structural network properties are to the fore towards the outcome of the actual human relationships [11]. Hence, the social topology of people has, traditionally, a greater impact on the result of their collaboration than certain aspects of their collaboration such as e-mailing frequency, sympathy or locations. Many SNA theories and studies are based on simplified but plausible and broadly examined social structures, particularly the formation of clusters/groups, the emergence of hierarchy, sparsity and short paths [9], [12]–[15]. A process model (short PM) is represented as an event-driven process chain. This is a directed graph structure, consisting of a set of edges that indicates an order between a set of process functions and operators. Each process function has a label, which is an expression in natural language and describes the action(s) to be done at this point in the process flow. Events and other (meta) information that might be provided is not considered. A performer network (short PN) extends a PM by a social network of collaborating performers. Performers are agents working on process functions with a set of capabilities in a PN. Capabilities for the purposes of this study are simplified as a set of mappings between a process function and a number indicating the extent of being capable/efficient to work at this function. Every mapping in [0;1] is possible, even mappings with nonexistent functions in order to represent ineffective performer/capability combinations. PNs are formalized as social networks which contain performers as nodes having a unique ID, social connections, capabilities and the ability to work at a process function to which he is assigned to; and two kinds of edges: social edges that connect performers and functional edges that connect performers with process functions (assigned_nodes). The other arrows 3 are process edges, connecting process nodes. Additional definitions: A path is a sequence of edges that connects two nodes. The degree of a node is the number of directly connected neighbors. The mean degree of a PN is the averaged degree over all performers. The average clustering coefficient is the actual number of edges between an agent’s neighbors divided by its possible number, averaged over all 2𝑚 performers. A PN’s density is defined as with n the number of performers (𝑛(𝑛−1)) and m the number of social edges in the PN. [16] see process efficiency as time and money saving potential for the process execution. [16] require, all processes to follow a sequence of tasks over time timepoints. From a resource-oriented perspective, efficiency can be seen as the minimal use of resources for a certain goal. A comparable definition for the efficiency of an organization is summarized by [17], defining the term "organizational effectiveness" as more than a financial profit but also as an efficient work of employees and managers for the outcome of the organization. Social collaboration is a resource in the scope of organizational structure. In the context of process execution in an organization, a goal is the completion of a sequence of tasks that is based on the process design. The time needed for the goal to be reached depends on the coordination of the actors participating in the effort to complete the tasks. Other resources that cannot (inter)act autonomously such as inanimate goods (steel, paper etc.) can be assumed to have a constant influence on the goal achievement as they can only be used and not contribute an effort. This means that only actors can influence their utilization. Actors in turn need collaboration to utilize resources, especially if different resources require different capabilities to deal with them. As an indicator for the impact of hierarchy on a PN, we measure the number of key performers for each year. A key performer is, adopted from the hub definition by [18], defined as an individual with a degree over three standard deviations above the mean degree. Hierarchy is constituted by a very small minority of key performers, standing against a vast majority of performers with a degree around the mean degree. This structure is often observed in real social networks [13]. A formal and quantitative effort/efficiency definition of a PN is introduced and explained in detail by [7] and [6] as an algorithm based on social network analysis and agent-based simulation: The efficiency of a PN can be computed and optimized only in combination with a PM or a set of PMs. Efficiency in this formal context means how little effort the performers need to undertake to complete one or several tasks simulated simultaneously through a given PM. The effort for the performers to complete the tasks is the sum of capabilities needed by the performers to reach the tasks effort function by function following the process control flow. Neighboring performers help each other with half their common capability. PN effort describes the sum of effort for the whole PN to complete all tasks. The efficiency of a PN in combination with a PM, called “PN efficiency”, is defined on the basis of [7] as 1- PNeffort/(max_effort) with max_effort as the maximal possible PN effort referring to a PN consisting of only one incompetent performer, who has a universal capability of 0.1, assigned to all process functions. The PN efficiency definition punishes a PN with many capable performers. A different definition of efficiency, which we outline as activity intensity, comes from [3], [4] as the number of collaboration events (tasks 4 in our context) on which an agent worked in a time window in ratio to the total number of collaboration events involving all agents in the network in the same time window, averaged over all agents. Finding an efficient PN around PM(s) requires finding an efficient combination of social topology, functional topology and capability distribution. We establish a social topology using the power cluster network generator by [19] which is an extended version of the Barabasi-Albert model [12] that replicates all of the plausible social structures mentionned above. This network generator was tested by [7] to be very effective for generating efficient PNs around a variety of process models. 3 Approach Our goal is to optimize the efficiency of the social collaboration topology, measured by means of PN efficiency [7], to let a set of performers collaborate to accomplish all pregiven tasks over a period of time. Our approach is based on agent-based simulation [20] and social network analysis [21], [22]. For the subsequent experimental design, we make the following assumptions for the approach, which are also limitations: 1. All performers work 100 percent on all of their tasks. They have a constant minimal competence for other tasks. 2. An organization is reduced to performers, tasks, social- and functional assignments. 3. All tasks follow the same process model; the model topology implies serial and parallel work. 4. All tasks have the same effort of 1.0, simulated or not. 5. An organization grows at a linear pace in terms of number of tasks and performers. 6. The efficiency of social collaboration is measured with activity intensity [4] and PN efficiency [7]. 7. Time periods are discrete and equidistant, independent of the exact effort needed. Algorithm 1. Simulates a Timeline of Performer Networks (PNT[t : timepoint]) to Fulfill a Given Set of Tasks With simulateTimeline(initial_Pnumber,tasks) the evolvement of the social links between the performers, the distribution of their capabilities and process assignments are simulated. initial_Pnumber is the number of performers in initial_timepoint. tasks is a set of tasks where a task is a process instance or, in other words, one complete execution of a process model. As many performers are created as in the observed organization at initial_timepoint. The tasks to do here are those that were done by the observed organization at initial_timepoint (tasks[1988]). The procedure of simulating the evolvement of social edges is based on the extension of the BA model by [19]. Now, the particular tasks are simulated through the generated PN following the approach of [7]. Therefore nesesary capabilities are distributed over the performers (distributeCapabilities). Every performer’s capability for a process function is normal distributed with N(0.5,1) and constrained co-domain to [0;1]. This gives every performer a chance of 65.5 percent to be more competent than the minimum competence of 10 percent. The assumption here is that every employee has always a minimal knowledge of all process functions. Thus, the flexibility of employees in organizations to substitute other colleagues is taken into account. Next, the performers 5 are assigned to the process functions (assignPerformers). Thus, a performer can pick his process assignment according to his best capability and other assignments with the probability proportional to the extent of the corresponding capability. With distributeTasks(tasks, performers), now the tasks are uniformly distributed over the |𝑡𝑎𝑠𝑘𝑠| performers, so that each task is assigned to performers. Then, |𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑒𝑟𝑠| optimizeEfficiency determines and optimizes/minimizes the PN effort of the PN at that time point with the implementation of [7] and [6]. The optimization reestablishes the social edges by tuning the network generation parameters from [19]. In the next step, the algorithm continues with the next time point (initial_timepoint + 1) in which a new performer network is generated. 4 Experimental Design We implemented our approach into a Java-based research prototype. The hardware configuration for the execution of the evaluation scenario comprises 64 AMD Opteron(TM) 6272 processors @ 1.40GHz and 16GB of RAM. As an evaluation scenario, we employ all scientific publications of the German Research Center for Artificial Intelligence from its foundation with 12 authors in 1988 until 31.12.2017. The organization is the biggest research center in the area of artificial intelligence worldwide, both in terms of number of employees and of external funds; it belongs to Germany’s prestigious "Centers of Excellence". According to the most recent data, it has 480 highly qualified researchers and 376 graduate students from more than 60 countries; they are working on 180 research projects. Many of those research projects, which produced a total of 7520 scientific publications, written by a total of 6704 authors, are co-operations with industry. Fig. 1. Process Model for Research Publication The number of publications rose from 8 in 1988 to 370 in 2017, with an annual average increase of 12.48 (standard deviation: 41.82). The PM in figure 1 describes the minimal requirements for the typical publishing research papers process which are explicitly documented in the organization. Only authors are performers. Each 6 publication is a task for the performers to fulfill. With simulateTimeline(12,tasks), for each publication, a PN is derived from the present co-authorship data (not with the social network generator) consisting of the authors involved and the year of publication. The set tasks corresponds to all 6904 publications mentionned above. The first PN has 12 performers as there were 12 authors in 1988. For each year, performers are connected if they are co-authors. As a result, we have a timeline of PNs consisting of one PN of all co-authorships for each year. In the following, this timeline is called “observed PNT”. simulateTimeline(12,tasks) is also executed in the same parameterization but with generating the social edges. The result is referred to as reference PNT. To pronounce the design implication again, both PNTs are simulated with simulateTimeline, each with the same assumptions. In order to validate, the results of our experimental evaluation scenario must be reproducible (reliability), must explain the model quality (internal validity) and must produce a result that is generalizable/transferable (ecological validity) [23]. The reliability is reached by test-retest, the repetition of all PN efficiency simulations within the approach. The model quality is quantified by the statistical effect between the starting parameters and the model parameter evolvement. For proofing the generalizability of the approach, the significant correlation between both efficiency measures (see section 2) of the observed PNT and the generated reference PNT has to be shown. We also predict the organization’s future evolvement by the same procedure as the reference PNT is developed. For the prediction, the observed PNT from 2010 until 2017 forms the basis for the reference PNT to be evolved from. The prediction is limited to 2022 as our memory capacity is reached at this point of computation effort. 5 Evaluation The scenario described above yields two simulation results, observed- and the reference PNT, both covering the timespan in which the evolved organization existed. As stated in section 1, we want to explain the evolvement of efficiency in the observed organization with a reference PNT, generated by the simulation described in section 3. Figure 2 compares activity intensity, PN effort and PN efficiency between reference (ref) and observed (obs) PNT over time. Topological properties (average clustering, density and mean degree) between reference and observed PNT are also compared. The average slopes of all efficiency values and the topological properties have, over all years, the same sign. The PN effort for the task execution correlates with the number of tasks over time (r = 0.97, p < 0.01) for both PNTs. The PN effort increases for both PNTs but is much greater for the observed PNT. In the observed PNT, density and PN efficiency correlate with r = −0.99 (p < 0.01). This correlation for the reference PNT amounts to r = −0.91 (p < 0.01). The correlation between activity intensity and PN efficiency in the observed PNT amounts to r = 0.49 (p < 0.01). The same correlation for the reference PNT amounts to r = 0.35 (p < 0.05). The independence of the PN efficiency from individual time windows has to be ensured because we want to show that the efficiency of performer collaboration depends on the social collaboration structure and not on the arrangement of single time windows. 7 Fig. 2. Reference (ref) vs Observed (obs) Performer Network Timeline with Fit Line Therefore, we tested hypothesis that the mean of all PN efficiency values for each time point is significantly different to the rolling mean over all possible time windows of a 3-years width. The hypothesis was rejected with p < 0.05. The number of key performers evolve almost linearly over time in the observed and the reference PNT. On average per year, the number of key performers grows by 0.21 (standard deviation: 9.6) in the observed PNT. In the reference PNT, 0.1 key performers supervene as a mean (standard deviation: 1.74). The more tasks are to be done, the more key performers appear in both PNTs. For both PNTs, the following variables have an impact on the number of key performers (in descending order): tasks, PN effort, activity intensity. All influences are strongly positive r > 0.7 (p < 0.05). The PN efficiency has no significant influence of r = 0.25 (p < 0.1) on the number of key performers for the observed and the reference PNT. This insignificance is caused by the PN efficiency to suddenly and over-linearly increasing with more than 2 key performers. We predict the organization’s future evolvement from 2010 until 2022 and compare it to the observed evolvement in the same time window. The observed PN efficiency stays almost constant at 0.9999 between 2010 and 2017. Our predicted PNT becomes 0.0003 percent more efficient over time. The activity intensity increases linearly over time for the observed and the predicted PNTs. The absolute 8 values however are different. In our prediction, the activity intensity is always greater than 0.93, whereas the observed activity intensity lies between 0.0003 and 0.0007. The PN effort decreases from 6384 to 4905 in our prediction. In the observed PNT, the PN effort ranges from 25096 to 36756. The average clustering coefficient in the predicted PNT is on average 123 percent higher than in the observed PNT between 2010 and 2017. Density is falling from 0.0061 to 0.0057 in the predicted PNT and increasing from 0.02 to 0.03 in the observed PNT. The average mean degree increases by 0.002 percent for the predicted PNT and 33 percent for the observed PNT. 6 Discussion Activity intensity and effort increase almost linearly for the observed and reference PNT which means that the reference PNT reaches, for each point in time, an equal or even higher efficiency reached by a lower effort than the observed PNT. All topology parameters, as they are plotted in figure 2, indicate the same slope over time for the observed and reference PNT. The mean degree for the observed PNT increases linearly over time while its reference counterpart reaches its maximum already after 10 years. For both PNTs, the density decreases over time and correlates with the increasing PN efficiency. The density in the observed PNT is much greater than in the reference PNT for all time points while the density within the observed clusters is much higher than in the reference clusters. The reference PNs have sparse clusters, which are but densely inter-connected. Because of the significant correlation between density and PN efficiency, the density of inter-connection between clusters drives the collaboration efficiency, more than the intra-connectivity of teams. The significant correlation between activity intensity and PN efficiency speaks for the generalizability of our approach as both efficiency measures indicate an increase of effective collaboration effort. That means, processes in a knowledge-based organization can be modelled efficiently based on the connection of social and functional reference topologies found by this approach.We predict the organization’s future evolvement until 2022. The positive evolvement of activity intensity and PN efficiency over time indicates the evolvement of an efficient social topology around the publication process based on the observed PN in 2017. During the same period, the PN effort decreases in the predicted PNT, in contrast to the observed PNT. Our explanation for this contrast is the PNs in the predicted PNT to have much more social edges between clusters than the observed organization has. Meanwhile, in the observed organization, on average, more key performers appear than in our reference PNT. That implies the observed performers reached their tasks with a similar efficiency but with more PN effort and more densely connected key performers. Our reference PNT, including the predicted PNT, reaches a smaller PN effort by more social edges between clusters. That way, less collaboration effort respectively fewer social edges are needed between the performers at a common process function to be efficient. The key performers seem to play an important role in the organization’s collaboration coordination. Most key performers are managers, for example research group leaders, which means that the team-overarching cooperation between managers is a critical structure for efficiency. This means, that the social link 9 generation from our PNT simulation procedure (see algorithm 1) can be used to reproduce an efficient collaboration topology in an evolving knowledge-work organization. Our PN simulation can thus be seen as a reference for the efficient placement of personal around a process to be executed in an evolving organization. Translated into a recommendation for modelling efficient collaboration, a performer network should attract more team members around managers and become less dense over time. This evolvement is supposed to lower the costs for social communication by shorting paths in a growing organization. Our assumptions for the PN efficiency simulation, the simplification of the co- authorship process and the constrained generalizability of the co-authorship towards knowledge-based business processes in general entail limitations of our PN model and our findings. The social environment of the authors, their resource/knowledge allocation and transfer are not taken into account. In addition, our approach has no explicit time limit for the end of the PNT. This means that the simulated organization can take a longer or a shorter time to complete all given tasks. 7 Conclusion In this paper, the performer network concept of [7] is applied on a set of tasks executed by real collaborative knowledge workers in order to generate a dynamic performer network that completes the given tasks efficiently. By the comparison of the performer network efficiency to a different measure of collaboration efficiency, the activity intensity [4], topology structures of collaboration, similar to the observed over time, were replicated. We tend to regard the performer networks replicating this topology as generalizable references, which may be used by practitioners as guidelines for inferring efficient performer networks around process models of other knowledge working organizations. Furthermore, our approach can quantify the trade- off between team size vs density vs hierarchy vs efficiency-critical social links for the (re)design of processes in such organizations. In particular, this includes the practical issue of determining the number and position of team/division leaders and knowledge/information transfer hubs necessary for a certain process. Applying our "reference topology", this issue can be resolved before the process is even established. Thus, the risks and costs for the process execution become more transparent and controllable. In future work, we aim to compare our results to further real-world performer networks by using a larger and more diverse data set such as evaluating event logs of the execution of business processes. In particular, we want to understand how exactly process model topologies affect the performance of the assigned performers. References 1. vom Brocke, J.: Design principles for reference modelling: Reusing information models by means of aggregation, specialisation, instantiation and analogy. In: Innovations in 10 Information Systems Modeling: Methods and Best Practices, University of Muenster, Germany (2009). 2. Scheer, AW., Nüttgens, M.: ARIS Architecture and Reference Models for Business Process Management. In: Business Process Management - Models, Techniques, and Empirical Studies, W. van der Aalst, J. Desel, and A. Oberweis (eds), pp. 376–389, Springer, Berlin (2000). 3. Tomasello, MV., Tessone, CJ., Schweitzer, F.: The effect of R&D collaborations on firms’ technological positions, Proc. 10th Int. Forum IFKAD, no. June, pp. 260–276 (2015). 4. Tomasello, MV.: Collaboration networks : their formation and evolution, ETH Zurich, (2015). 5. Kim, ED., Busch, P.: Workflow Interpretation via Social Networks, Springer, Cham, pp. 241–250 (2016). 6. Sonntag, A., Fettke, P., Loos, P.: Inductive Reference Modelling Based on Simulated Social Collaboration, (2017). 7. Sonntag, A., Fettke, P.: Efficiency Of Generated Performer Networks In Collaborative Business Process Models. In: 18th IEEE Conference on Business Informatics 1, pp. 26–34 (2016). 8. Adamic, L., Adar, E.: How to search a social network, Soc. Networks, vol. 27, no. 3, pp. 187–203 (2005). 9. Easley, D., Kleinberg, J.: Networks, crowds, and markets: Reasoning about a highly connected world, Cambridge University Press (2010). 10. Watts, DJ., Dodds, PS., Newman, J.: Identity and Search in Social Networks, Science (80)., vol. 296, no. 5571, pp. 1302–1305 (2002). 11. Johnson-Cramer, ME., Parise, S., Cross, RL.: Managing Change through Networks and Values, Calif. Manage. Rev., vol. 49, no. 3, pp. 85–109 (2007). 12. Barabási, AL., Albert, R.: Emergence of scaling in random networks. In: The Structure and Dynamics of Networks, Princeton University Press, pp. 349–352 (2011). 13. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: densification laws, shrinking diameters and possible explanations. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 177–187 (2005). 14. Nadel, SF.: The Theory of Social Structure, vol. 23, no. 2. (1957). 15. Li, M., Fan, Y., Chen, J., Gao, L., Di, Z., Wu, J.: Weighted networks of scientific communication: The measurement and topological role of weight, Phys. A Stat. Mech. its Appl., vol. 350, no. 2–4, pp. 643–656 (2005). 16. McKinty, C., Mottier, A.: Designing efficient BPM applications : a process-based guide for beginners, 1st ed. O’Reilly Media, Inc, USA (2016). 17. Schäfermeyer, M., Grgecic, D., Rosenkranz, C.: Factors influencing business process standardization: A multiple case study (2010). 18. Goldenberg, J., Han, S., Lehmann, DR., Hong, JW.: The role of hubs in the adoption process, J. Mark. a Q. Publ. Am. Mark. Assoc., vol. 73, no. 2, pp. 1–13 (2009). 19. Holme, P., Kim, BJ.: Growing scale-free networks with tunable clustering, Phys. Rev. E, vol. 65, no. 2, p. 26107 (2002). 20. Bonabeau, E.: Agent-based modeling: Methods and techniques for simulating human systems, Proc. Natl. Acad. Sci., vol. 99, no. 3, pp. 7280–7287 (2002). 21. Otte, E., Rousseau, R.: Social network analysis: a powerful strategy, also for the information sciences, J. Inf. Sci., vol. 28, no. 6, pp. 441–453 (2002). 22. Wasserman, S., Faust, K.: Social network analysis: Methods and applications, vol. 8. Cambridge university press (1994). 23. Campbell, D., Stanley, J.: Experimental and quasi-experimental designs for research. London: Ravenio Books (2015).