Clustering Software Developer Repository Accesses with the Cophenetic Distance David Sánchez-Charles1 , Josep Carmona2 , Victor Muntés-Mulero1 , Marc Solè Simo1 1 CA Strategic Research, CA Technologies, Spain {David.Sanchez,Victor.Muntes,Marc.Sole}@ca.com 2 Universitat Politècnica de Catalunya, Spain jcarmona@cs.upc.edu Abstract. In this paper we report a case study on the use of the cophenetic dis- tance for clustering software developer behavior accessing to a repository. This distance was recently proposed for the comparison of tree-based process models. We show how hierarchical clustering techniques over the cophenetic distance are capable of detecting homogeneous clusters corresponding to well-defined soft- ware developer roles. The techniques have been evaluated on real-life data from a software development project, and the quality of the clusters over the cophenetic distance is compared with similar techniques in the literature. 1 Introduction Globalization of large enterprises is encouraging the industry to move towards Software as a Service (SaaS) solutions and communicate in digital platforms. One of the key ben- efits of these SaaS solutions is that any employee can access it whenever and wherever they can, and with limited resources such as personal laptops or mobile phones. Besides, these technologies enables for decentralized collaboration with peers in completely dif- ferent time zones, providing diversity and flexibility in the workplace. In the context of software development, Apache Subversion (SVN) and GitHub are two software versioning and revision control systems that allows software developers to collaborate in the maintenance and development of software, by monitoring changes in files such as source code, web pages and documentation. Those system have a record of actions performed by users, in order to increase visibility of developers or in case some changes need to be reverted. The objective of this paper is to test if we are able to recover the original function (or role) of some workers of an organization, based on the actions performed in the software repository. Instead of directly analyzing sequences of actions, we propose to summarize them in the form of a process model. I.e. we will assume that a process model is capable of representing the behavior of the individuals in the platform. On the long-run, this is the first step to assess if one is able to gain some knowledge about individuals by analyzing, and comparing, how they behave in a digital platform. In case of success, the techniques of this paper open the door for interesting appli- cations, ranging from profiling of user behavior, detection of outlier behavior that may be suspicious of malware or fraud, user interfaces and processes that adapt to the be- haviour of the user, and, in general, other scenarios in which we want to predict some 100 attribute of the process owner. It will be of particular interest in crowdsourcing projects, in which organizations outsource particular internal processes to large corpora of vir- tual workers across the globe, for allowing organization to understand which type of workers are contributing to their problem resolution. 2 Methodology In this paper we will try to group users with a similar role by analyzing the actions that they performed in a digital platform. The following diagram summarizes the approach we followed in this paper: Process Model Process Discovery Clustering Comparison First, we preprocess the actions performed by the users and apply a process dis- covery technique over them. Following the process discovery, we will apply a process model comparison for measuring the pairwise dissimilarity of all process models and, implicitly, the dissimilarity of the human behavior that they represent. Finally, a cluster- ing technique will be applied in order to discover groups of users with similar process models. Section 2.1 briefly discusses the discovery of the process models that represents the behavior of the users. Section 2.2 covers the considered similarity metrics on this case study, and then Section 2.3 briefly mentions some clustering techniques appropriated for this scenario. Later in Section 3, we will evaluate such metrics and clustering techniques by applying them on a real-scenario in which the roles of the individuals are already known. 2.1 Process Discovery The first step would be to generate a representation of the behavior of users in the digital platform. For this we have chosen to discover process models that summarizes the sequences of actions performed by users. Analyzing behavior over process models, instead of directly on the sequences, allow us to not focus on the specifics (such as, for instance, activity A has been repeated 5 times) but the general overview of how the user behaved (following the example, a process model simply states that activity A is usually repeated several times, whilst the number of times is not relevant). Some sort of preprocessing may be needed in order to generate a process model. In particular, we may need to clean event names or ensure that traces represent independent runs of the underlying process. In our case, traces must compromise actions in a single session, with a unique purpose. In case this information is not provided, some heuristics must be applied in order to artificially split those sessions. For example, a login event may clearly separate two different sessions. As for the process discovery itself, we will use the infrequent Inductive Miner [7]. This algorithm provides a good generalization of process models, can be efficiently ap- plied to large event logs, and its output is a process tree – as required by some distances used in Section 2.2. 101 2.2 Process Model Comparison The state of the art techniques for comparing process models can be naively split into structural and behavioral. In the former, process models are considered as labeled graphs and the comparison is regarded as edit operations over their edges, nodes or both. In contrast, behavioral techniques focus at the comparison of the execution semantics of the compared models. If we focus on the case of structural comparison of process models, techniques based on graph edit operations have been defined [5,3,4,12]. In particular, we will consider the graph edit distance as defined in [5] for representing the group of structural similarity metrics. This distance counts the number of modifications (addition or removal of nodes and edges) must be performed in order to transform one process model into the other. Analogously to structural techniques, behavioral ones compare how activities are related to each other. A behavioral profile of a process is a representation of the process model as an n × n matrix, where n is the number of tasks in the process [11]. An el- ement wi,j in the matrix states the behavioral relation between the activity represented by the row i and the column j, which can be causality (typically depicted by >), reverse causality (<), mutually exclusive (∨) or co-occurrence (∧) in the case of causal behav- ioral profiles. Behavioral profiles provides one mechanism for comparing the behavior of two process models, by counting the number of cells in which the two processes do not have the same behavioral relation. A B C D A ∧ ∧ > B ∧ ∨ > C ∧ ∨ > D<<< Fig. 1. Example of a Casual Behavioral Profile for a process model. Activity A is concurrently executed (∧) with activities B and C, whilst activities B and C cannot be executed both in the same run as they are mutually exclusive (∨). Finally, activity D is a consequence (>) of activities A, B and C. In [9], we showed that the boundary between structural and behavioral metrics is not very clear, and some metrics capture behavioral differences even though they are purely defined in an structured manner. The Cophenetic Distance fall in this fuzzy boundary, and we will also consider it in this study. 102 In the rest of this section, we provide an informal explanation of the Cophenetic dis- tance, whilst the rest of distances are well-known in the literature. The use of this metric limits the scope of the study to process trees in which activities are not duplicated. Nev- ertheless, the inductive miner algorithm already provides such type of process models, and such constraint also makes feasible the computation time of the aforementioned techniques. Cophenetic Distance over Process Trees A process tree is a labeled rooted tree T in which activities are represented as leaves of the tree and internal nodes describe the control-flow of the process. For the sake of simplicity, we will label internal labels as OR3 , AND, SEQ and LOOP to represent the usual behavioural structures in a process model. We will also denote by gateways to these internal nodes, following the BPMN nomenclature. Children of a SEQ gateway are ordered in order to represent the sequen- tial ordering of the subprocesses they represent. We allow silent activities by labeling them as ∅. Figure 2 depicts an example of a sequential process with a optional branch and two concurrent activities. On the right, the same process is represented as a process tree. SEQ A OR D AND ∅ C B Fig. 2. Example of a structured process and its translation into a process tree. Activities B and C are concurrently executed, but are completely optional. The process always starts with activity A and ends with activity D. In any process tree T , the deepest common ancestor of two nodes u and v–denoted by [u, v]T –holds the direct causal relationship of the two nodes, and the depth of the common ancestor denotes the complexity of the process structure up to this behavioural decision. The depth of [u, v]T is known as Cophenetic value, simply denoted by DepthT 3 Following the semantics of block-structured models in [1], only exclusive ORs are modeled. 103 ([u, v]T ), and the Cophenetic vector is the collection of such Cophenetic values for ev- ery possible pair of nodes u and v. Authors in [2] show that the Cophenetic vectors are enough to discern the structure of a certain class of labeled trees, which includes process trees without activity repetitions. And therefore, we could define a structural distance between process trees based on their cophenetic values. Definition 1. Let T and T 0 be two process trees, and S the set of activities of the two trees. Their cophenetic distance is X dϕ (T, T 0 ) = |DepthT ([i, j]T ) − DepthT 0 ([i, j]T 0 )| i,j∈S For the sake of simplicity Depth([i, j]T ) is zero if either i or j are not present in the process tree T . SEQ1 ABCDE ABCDE SEQ1 A 3 2 1 1 1 A 3 2 1 1 1 OR2 AND2 B 3 1 1 1 B 3 1 1 1 OR2 OR2 C 4 3 2 C 4 3 2 A3 B3 OR3 E3 A3 B3 AND3 E3 D 4 2 D 4 2 C4 D4 E 3 E 3 C4 D4 Fig. 3. Example of process trees and their cophenetic vector (in matrix representation), assuming the depth of the root is 1. For simplicity, we included node’s depth as a subscript of the label. Unfortunately, cophenetic vectors are not enough for determining behavioral sim- ilarity: for instance, Figure 3 depicts two similar processes where two internal nodes have been interchanged, but share the same cophenetic vector due to havin identical structures. In [9], we presented a new approach to compare process trees using the cophenetic distance that, by modifying the notion of tree depth, enables us to overcome such issue and leverages the traditional structural comparison with the behavioural in- formation hard-coded into the new depth. Let’s take the left model of Figure 4 to illus- trate some of the rules defined in [9]. Activities A and B originally had depth 3, but 2.5 considering the proposed depth definition: the depth of their parent, which is 2, plus 0.5 given the dichotomy of the exclusive choice they are representing. The AND gate- way was originally at depth 2 since it was a direct child of the root, whilst we are now positioning it at depth 3.5 as if it was a direct consequence of the two activities A and B. These consecutive depths are triggered by the behavioral function of its parent node – a sequential construct. One can check [9] for more details on how this new depth is defined. 2.3 Clustering of Process Models At this point, we already have a process model summarizing the behavior of each indi- vidual as explained in section 2.1. We have computed a matrix D of distances between 104 SEQ1 A B C D E A B CD E SEQ1 A 2.5 2 1 1 1 A 2.5 2 1 1 1 OR2 AND3.5 B 2.5 1 1 1 B 2.5 1 1 1 OR2 OR3.5 C 5 4.5 3.5 C 5 4 3.5 A2.5 B2.5 OR4.5 E4.5 D 5 3.5 D 5 3.5 A2.5 AND4 E4 B2.5 E 4.5 E 4 C5 D5 C5 D5 Fig. 4. Example of process trees and their cophenetic vector (in matrix representation), using the depth described in [9]. For simplicity, we included node’s depth as a subscript of the label. all process models, i.e. cell Di,j includes the distance between the process models of users i and j. This distance may be difned by the graph edit distance [5], differences in their causal behavioral profiles [11] or the cophenetic distance [9] as briefly discussed in section 2.2. In this final step, our objective is to create groups of individuals that are close to each other with respect to the distance matrix D. The standard de-facto technique for clustering based on a distance matrix is Hierarchical clustering. Hierarchical clustering Hierarchical clustering [8] follows a bottom-up approach, in which every individual starts within their own cluster and then iterates by merging the two most similar clusters into a bigger cluster. One tackles the issue of finding the two most similar clusters by averaging the distances of the individuals within the clusters. The only drawback of this technique is that the number of groups must be manually fixed. In our experiments, we will run the hierarchical clustering for all the possible number of groups. 3 Evaluation In this section, we test the methodology explained in Section 2 with a real industrial dataset. First, we describe the provided behavioural data of 200 individuals and their organizational roles in the company. Process discovery is used to summarize the be- havioural data from each individual, and dissimilarities between those processes are meant to measure differences in the behaviour of the individuals. Then, we test some clustering approaches with three different similarity metrics to measure how good they approximate the original organizational rules. 3.1 Framework of the evaluation Apache Subversion (SVN) is a software versioning and revision control system. Soft- ware developers use SVN software to collaborate in the maintenance and development of software, by monitoring changes in files such as source code, web pages and docu- mentation. All accesses to a SVN repositroy are done through HTTP/S, as specified in the WebDAV/DeltaV protocol. It turns out [10] that those read and write requests over HTTP/S can be translated to human-friendly SVN commands such as svn update or svn 105 commit. Continuining the work done by Li Sun et.al. [10], we model the behaviour of developers by using process discovery techniques. First, SVN commands are retrieved from the system and considered as events of a system that represents the developer, and then a trace is defined as all commands executed during a complete business day. As already mentioned in Section 2.1, we discover Process Trees using the default settings of the Inductive Miner Plugin (ProM 6.5.1). This industrial dataset contains all the accesses of more than 200 individuals to one repository of CA Technologies in production for three years. After pruning users with few accesses to the repository, 83 individuals were kept in the study and their organizational roles were retrieved at the end of the monitoring phase. In particular, 37 Forward Engineering, 19 Quality Assurance Engineers, 16 Sustaining Engineer, 5 Support, 2 Services, 1 SWAT Engineer, 1 Infrastructure, 1 Technical Writers. The following list summarizes the responsibilities for each role. – Forward Engineers (R1) are in charge of the implementation of new features. – Quality Assurance Engineers (R2) plan, run and design use cases or tests. – Sustaining Engineers (R3) are in charge of solving defects, as well as ensuring that software successfully passes all tests. – SWAT engineers (R4) are in charge of implementing custom integrations. – Support (R5), Services (R6) and Infrastructure Engineers (R7) interact with inter- nal and external customers with respect to defect detection and solution, software installation and configuration, and maintenance of the infrastructure of Software as a Service solutions provided by the company. Support Engineers might push some quick fixes into products. – Technical Writers (R8) collaborate with Forward, Sustaining and Quality Assur- ance Engineers for creating helpful Knowledge Base and User Guides. Technical Writers are asked to use the source code repository to maintain different versions of the documentation. Among all the engineers, and fairly distributed among roles, 9 individuals are Man- agers of a team. Besides, one agent is labeled as a bot, although the purpose of such bot is unknown to the authors of this paper. Notice that one has the possibility of advanc- ing in their career and change to another department, and, therefore, some individuals might have been misclassified as their latest role. Infrastructure and Service Engineers are not supposed to access the repository in their usual pipeline and, therefore, might have been promoted during the project. Nevertheless, clustering may help us to deduce their original roles in the organization. During the rest of the evaluation we plan to answer the following question in regard of this scenario: – How good is clustering of process models for approximating the original role of the individuals? – Which is the expected role of the bot? And what about the role of other anomalies? 3.2 Homogeneity of roles in process-based Clustering In order to measure the quality of the clustering, we will use the purity as the metric for measuring the homogeneity of the discovered groups. Let C = {C1 , . . . , Cm } be a 106 clustering of the process models, and Ri (Cj ) be the number of individual in cluster Cj with the role Ri , then the purity is defined as 1 X Purity(C, R) = max Ri (Cj ) Number of processes Ri j∈C In other words, the purity computes accuracy as if we label all individuals inside a group with the most popular role inside it. In particular, very heterogeneous groups of individuals will lead to a poor purity. Role Manager Fig. 5. The solid line depicts the evolution in the purity of the Cophenetic-based hierarchical clustering as the number of clusters increase, whilst the dashed (resp. dotted) line depicts the graph edit distance (resp. Behavioural Profiles). Two different experiments were performed for detecting individuals’ role and their status as managers. Figure 5 depicts the purity obtained from the SVN-repository dataset with respect to the number of clusters using hierarchical clustering. The X-axis represents the number of clusters considered, and its respective purity is represented by the Y -axis. The solid line depicts the purity of the clustering obtained based on the Cophenetic distance, and one can notice that its purity if significantly bigger when the number of clusters is low. On the right graphic, only the status as a Manager has been considered. Due to the low number of Managers in the dataset, these results denotes the ability of efficiently filter out abnormal behaviour. Starting from 50 clusters, the Cophenetic distance starts to classify worse than the Graph Edit Distance. Nevertheless, at that stage, clusters have 1 or 2 individuals and, therefore, are not representative of the use of clustering techniques. Table 1 summarizes the precision and recall of the hierarchical clustering of Figure 5 when we set the number of clusters to 6. Again, we consider the classification as if the predicted label of all elements inside a cluster is the role of the more popular role inside the cluster. These results show again the capabilities of the Cophenetic distance to highlight the Manager role and provides, in general, better results for all roles, except the Sustaining Engineer. On the other hand, GED and Behavioral Profiles tend to group all individuals in a very big, and heterogeneous, cluster and, therefore, we obtain those roles with high recall but low precision (and those with high precision but very low recall). 107 Cophenetic GED Behavioral Profiles Precision Recall Precision Recall Precision Recall Manager 0.71 0.55 No individual was labelled Forward Engineer 0.64 0.76 0.46 0.97 0.45 1.00 Sustaining Engineer 0.00 0.00 0.57 0.26 0.00 0.00 Quality Assurance 0.34 0.57 No indiv. 1.00 0.05 Support 0.50 0.40 No individual was labelled Others No individual was labelled Table 1. Precision and Recall for each of the roles in the organization by considering a hierarchi- cal clustering with 6 groups. In some cases, none of the groups had enough representation of a role. As for the bad performance with respect to Sustaining Engineers, notice that re- sponsibilities of the Sustaining (R2), Quality Assurance (R3), SWAT (R4) and Support engineers (R5) are all related to defects and bug fixing, and, therefore, they may share some common behaviour and practices. Besides, the number of Sustaining Engineers is slightly below the number of Quality Assurance Engineers, and, hence, it is more likely to label users as Quality Assurance Engineers in case of grouping them together.Table 2 summarizes the precision and recall for a clustering of 6 groups. Notice that precision and recall of the Forward Engineer category are not significantly affected in the case of the cophenetic distance, indicating the existence of groups with a strong presence of Forward Engineers. On the other hand, precision and recall are very affected in both GED and Behavioral Profiles cases. The results provided by the GED are an indication of one or more small groups groups of Forward Engineers (perfect precision, but low recall), and a big group in which half of the developers have a role in R2345 and the rest are Forward Engineers or other minor roles. As for behavioral profiles, results are slightly worse than the cophenetic distance, but still incapable of detecting the group of Managers. We have run the same experiments using DBSCAN [6] as the clustering method. The key benefit of DBSCAN is that the number of clusters is not fixed prior to the clustering, as it defines clusters as groups of individuals that are densely together4 . Un- fortunately, results are significantly worse than the provided by the hierarchical clus- tering – with purity not surpassing 0.5 across several hyperparameter of the DBSCAN algorithm. Inducing the real role of outliers Some role anomalies were present in the dataset. For instance, two individuals were classified as Service Engineers (R6) although ac- cessing to the source code repository is not part of their responsibilities. As we have already mentioned, the role data was obtained during the finalization of the project and, hence, the worker may have changed from one department to another. In this case, one service engineer (R6) is more close to Quality Assurance Engineers (R3), and the other is close to a group of Forward Engineers (R1). The three distances are consistent with 4 I.e. for every process model in the cluster, there must be at least k process models at distance less or equal than d. Both k and d are manually fixed. 108 Cophenetic GED Behavioral Profiles Precision Recall Precision Recall Precision Recall Manager 0.71 0.55 No individual was labelled Forward Engineer (R1) 0.60 0.78 1.00 0.05 0.65 0.35 R2345 0.72 0.63 0.50 1.00 0.56 0.85 Others No individual was labelled Table 2. Precision and Recall for each of the roles in the organization by considering a hierar- chical clustering with 6 groups after merging Sustaining, Quality Assurance, SWAT and Support Engineers into a unique role R2345. In some cases, none of the groups had enough representation of a role. these results. With respect to the Infrastructure Engineer (R7), the cophenetic distance and behavioural profiles map this user close to Sustaining Engineers (R2) whilst the graph edit distance relate him to Forward engineers (R1). Finally, with respect to the agent labeled as a BOT, the cophenetic and the graph edit distance group it with other Quality Assurance Engineers (R3). This might be a hint that the bot is indeed an auto- matic testing system. Nevertheless, Behavioural Profiles are less accurate and relate this agent close to a mixed group of Forward (R1), Sustaining (R2) and Quality Assurance engineers (R3). 4 Conclusions and Future Work In this paper, we have applied the cophenetic distance for process models to recover groups of individuals with similar behaviours, and hence similar roles and responsibil- ities. Our approach is based on comparing the behaviour of process models discovered on real logs, instead of comparing them directly on the logs, allowing us to compare a generalization of the behaviour instead of falling into the specificness of traces. For instance, our approach allowed us to realize that a bot was working for a specific team, as this bot behaved as the other team members. We compared our cophenetic approach with three other process similarity metrics and we have seen that our approach consis- tently provides better role retrieval, as well as detecting a small group of individuals acting as Managers of a team. As future work, we would like to investigate the possibility of discovering process models for each cluster such that the trace-fitness within the individuals in the cluster is high, whilst significantly lower when applied to individuals outside the cluster. That would help in understanding the behavior of new users into the system. Acknowledgements. This work is partially funded by Secretaria de Universitats i Re- cerca of Generalitat de Catalunya, under the Industrial Doctorate Program 2013DI062, and the Spanish Ministry for Economy and Competitiveness (MINECO), the European Union (FEDER funds) under grant COMMAS (ref. TIN2013-46181-C2-1-R). 109 References 1. J. C. A. M. Buijs, B. F. van Dongen, and W. M. P. van der Aalst. A genetic algorithm for discovering process trees. In 2012 IEEE Congress on Evolutionary Computation, 2012. 2. Gabriel Cardona, Arnau Mir, Francesc Rosselló, Lucı́a Rotger, and David Sánchez. Cophe- netic metrics for phylogenetic trees, after sokal and rohlf. BMC Bioinformatics, 14(1):1–13, 2013. 3. Remco M. Dijkman, Marlon Dumas, and Luciano Garcı́a-Bañuelos. Graph matching algo- rithms for business process model similarity search. In BPM 2009, Ulm, Germany, Septem- ber 8-10, pages 48–63, 2009. 4. Remco M. Dijkman, Marlon Dumas, Luciano Garcı́a-Bañuelos, and Reina Käärik. Aligning business process models. In EDOC 2009, 1-4 September 2009, Auckland, New Zealand, pages 45–53, 2009. 5. Remco M. Dijkman, Marlon Dumas, Boudewijn F. van Dongen, Reina Käärik, and Jan Mendling. Similarity of business process models: Metrics and evaluation. Inf. Syst., 36(2):498–516, 2011. 6. Martin Ester, Hans-Peter Kriegel, Jrg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. pages 226–231. AAAI Press, 1996. 7. Sander J. J. Leemans, Dirk Fahland, and Wil M. P. van der Aalst. Discovering block- structured process models from event logs containing infrequent behaviour. In Business Process Management Workshops 2013, pages 66–78, 2013. 8. Fionn Murtagh and Pierre Legendre. Ward’s hierarchical agglomerative clustering method: Which algorithms implement ward’s criterion? J. Classif., 31(3):274–295, October 2014. 9. David Sánchez-Charles, Victor Muntés-Mulero, Josep Carmona, and Marc Solé. Process model comparison based on cophenetic distance. In Business Process Management Forum - BPM Forum 2016, Rio de Janeiro, Brazil, September 18-22, 2016, Proceedings, pages 141– 158, 2016. 10. Li Sun, Serdar Boztas, Kathy Horadam, Asha Rao, and Steven Versteeg. Analysis of user behaviour in accessing a source code repository. Technical report, RMIT University and CA Technologies, 2013. 11. Matthias Weidlich, Jan Mendling, and Mathias Weske. Efficient consistency measurement based on behavioral profiles of process models. IEEE Tr. Soft. Eng., 37(3):410–429, 2011. 12. Zhiqiang Yan, Remco M. Dijkman, and Paul W. P. J. Grefen. Fast business process similarity search. Distributed and Parallel Databases, 30(2):105–144, 2012. 110