A Performance Comparison of Graph Analytic Methods for Supporting Enterprise Architecture Model Maintenance Nikhitha Rajashekar1 , Simon Hacks2 , and Nuno Silva3 1 Research Group Software Construction, RWTH Aachen University, Germany nikhitha93@gmail.com hacks@swc.rwth-aachen.de 2 Division of Network and Systems Engineering, KTH Royal Institute of Technology, Sweden shacks@kth.se 3 Department of Computer Science and Engineering, Technical University of Lisbon, Lisbon, Portugal nuno.silva@tecnico.ulisboa.pt Abstract. Enterprise Architecture (EA) models help enterprise archi- tects to make business decisions and to support organisations to under- stand and analyse its structure. Creating and maintaining such an EA model is expensive and time-consuming in large organisations. In this pa- per, the authors provide a review of how graph analytic methods can be used to support EA model maintenance. To achieve the best results for the duplicate identification, it is necessary to evaluate and optimise dif- ferent graph analytic and Machine Learning (ML) algorithms to support the maintenance process. Keywords: Enterprise Architecture · Model · Maintenance · Graph · Machine Learning 1 Introduction The practice of Enterprise Architecture (EA) is, inter alia, about providing a blueprint and roadmap for aligning business strategy with IT [14]. Usually, in large organisations, it is an effortful task for a single individual to understand how the various enterprise components connect and work together. Thus, EA can be used to create a visual representation of the enterprise system. In todays organisations, it is necessary to deliver insights quickly and analyse large amounts of enterprise data. In this context, it is essential to understand and model the EA components and relationships. EA models are comprised of a set of components, representing the enterprise’s assets, and the links between them. This interpretation allows to understand EA models as graphs. Graph- based tools provide an advantage to enterprise architects by providing a unified view of EA that helps in identifying and understanding it [27, 21]. To provide an effective analysis of the enterprise, the relying EA repository needs to be sound and up-to-date. Maintaining an EA model that is up-to- date and consistent with the enterprise can be an effortful task due to the size Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 N. Rajashekar et al. and complexity of EA models, frequent changes in the architecture, and the challenges of collecting and managing EA information from different stakeholders [10]. Due to the increase in information systems, as a manifestation of the digital transformation age, the documentation of enterprise information as an inherent part of EA model design is often regarded as time-consuming, cost-intensive, and error-prone [9, 8, 15]. Previously, we have elaborated on different Machine Learning (ML) tech- niques to support enterprise architects by identifying possible duplicates within EA models, which cause an unnecessary expansion [4]. However, in our previous research, we identified potential algorithms, which can support the enterprise architects, but we did not focus on the run time of the algorithms. Those long computation times prevent our approach to be adapted in working environments. To cope with this challenge, we set up a ML study addressing the following re- search question: Which graph analytic method performs best to identify duplicate components in EA models? In line with the above, we compare different approaches which address a specific kind of challenge in EA model maintenance: the identification of dupli- cate components in the EA repository. In the following sections, we first discuss the related literature. We then present the case study including the conducted research methodology and introduce the different graph analytic approaches to investigate the EA models. Finally, we illustrate the experimental results, point out the identified limitations, and conclude our work. 2 Related Work Previous research [7, 27, 29] was motivated by analysing EA models as a graph and applying different ML concepts on EA models by providing decision support for enterprise architects. The work of Dreyfus and Iyer [7] focuses on representing the complexity of information systems architecture in social network terms and then capturing insights from the graph representation. Different similarity approaches have been proposed in the literature to iden- tify the similarity and the differences between models and their components to be matched. The work of Dijkman et al. [6] presents three similarity metrics to investigate matching of similar business process models in a given repository, namely (i) structural similarity that compares element labels by considering topology structure of business process models; (ii) behavioural similarity that compares element labels by considering behavioural semantics of process mod- els; and (iii) label matching similarity that compares elements based on words in the labels of business process model elements (string edit distance). The work of Aier and Schnherr [2] presents a clustering approach in determin- ing the structure of Service Oriented Architectures (SOA). The paper shows the application of clustering algorithms in supporting the design of a SOA. However, the publication does not present the evaluation criteria for comparing different clustering methods. Discovering communities or clusters have been extensively Method Comparison Supporting EA Maintenance 3 studied in various research areas [11, 25]. Several research papers discuss the comparison of community detection algorithms [22, 23, 32]. Our research is also linked to our to the field of model reuse for UML dia- grams. Ganser et al. [12] developed an approach to encourage the usage of models or rather parts of models. Therefore, they transform the diagrams into graphs and store them in a library to recommend them to the modeller if necessary. 3 Research Approach As already stated, we like to identify the best performing algorithm to identify duplicate components in EA models. Therefore, we compare different ML algo- rithms inspired by the ML life-cycle model provided by [28]. The main idea is to improve the performance of a model by comparing different algorithms and metrics: 1. Raw data: Any EA model can be expressed in XML, which is mainly used for storage and communication purposes. Thus, we first acquire our EA model in XML format, which serves as input for our approach. 2. Data preprocessing: This step includes transforming raw data into a rele- vant format and cleaning the data set. We parse the raw data obtained from elements and relations files to a graph data object using igraph library [5]. Next, we normalise the data to standardise the range of independent vari- ables and its features. 3. Feature engineering: The next step is to extract or compute relevant graph feature statistics. The features extracted from data will directly influence the predictive models and consequently, the results. Extracting a right feature set can boost the performance of a model. 4. Data modelling: After extracting the relevant data set and feature set, the next step is to perform the core ML task. Making use of supervised/ unsupervised learning algorithms, we can evaluate similarity and investigate clusters/communities formed to analyse EA models in a different context. 5. Model evaluation: The model evaluation mainly supports the find of the best performing model based on suitable performance metrics that represent the data and how well the chosen algorithm will work in the future. 6. Performance improvement: Iterating over different graph analytic or ML algorithms and feature set can result in optimised results. It is essential to know how the optimal solution for a particular scenario looks like, in order to evaluate different algorithms. Therefore, we sketched an airport departure system, which depicts the functionality of the passengers before boarding an aircraft. This case study strategy will provide a richer understanding of the context for the people and the researcher involved. Further, it is used for the evaluation of all algorithms presented in the following. This example is modelled as an EA model based on ArchiMate 3.0.1. The model incorporates the ArchiMate layers business, application, and technology. It consists of 171 different elements and 250 relations. An excerpt of the model is presented in Figure 1. 4 N. Rajashekar et al. Fig. 1: An Excerpt of the Immigration Process’ Application Layer. 4 Graph Analytic Approaches We depict the EA model as a graph representing ArchiMate components as nodes and relationships between the components as edges. Thus, analysing EA model as graph provides a deeper understanding. As there are several graph analytic techniques, we focus on graph-based similarity and community detection (unsupervised learning). 4.1 Similarity Between Enterprise Architecture Models Considering EA model as a graph like structure, we apply graph similarity tech- niques to assess the match between EA models. Graph similarity involves deter- mining a degree of similarity between two graphs [16] quantified by a similarity score between 0 (no similarity) and 1 (complete similarity). Computing node similarity Node matching similarity is based on examining the content of the components name. Therefore, we compare EA models to each other, where the naming of the components differs only slightly like having typing errors, but both models have the same expression. Syntactic Similarity. The syntactic approach is related to both, the occurrence of terms and the number of words in the text. In this paper, we use the Cosine similarity index [3] and Jaccard similarity index [20] as a syntactic approach to detect the match between components of two EA models. Those two algorithms serve as starting point for our research and will be extended in future by other algorithms like Levenshtein, N-Gram, Q-Gram, or Sørensen-Dice coefficient. Method Comparison Supporting EA Maintenance 5 To find the Cosine and Jaccard similarity score between the phrases (com- ponent names) present in two EA models, phrases are turned into words, words are then converted to binary vectors. Apart from the cosine similarity, another well-known measure for determining the degree of similarity is the Jaccard similarity index. The Jaccard similarity index measures the similarity between finite sample sets and is defined as the cardinality of the intersection of sets divided by the cardinality of the union of the sample sets [13]. Latent Semantic Analysis. An alternative method to compare the similarity be- tween EA models is based on fetching the relations between words in the texts. LSA [17] is a corpus-based method which does not use the semantic network, grammar, syntax, and dictionaries. LSA applies a statistical analysis of the la- tent structures of the documents by finding the underlying meaning or concepts between the documents. It tries to map the words in a document into a concept space and then comparing within that space. Computing edge similarity The second similarity metric uses the structure of the EA model. This method is applicable when the node correspondence is known and the edges do not vary significantly. The main idea behind this approach is: a node in one graph is said to be similar to a node in another graph if they share a standard set of neighbourhoods. This works recursively [26]. Architect 1 models the application layer, as shown in Figure 2a, which shows the presence of the realisation relationship between Airline administration sup- port (Application Component) and Identifying boarding pass (Application Ser- vice) and between Boarding control (Application Component) and Security (Ap- plication Service). Architect 2 models a similar kind of application layer, as shown in Figure 2b. Both models have an equal number of components with varying edge connections. Since both the application components have collabo- rated within Boarding & departure control system (Application Collaboration), there is a similarity between the models generated by both the architects irre- spective of edge connection. Motivated from the work of Liben-Nowell and Kleinberg [18], we selected the most widely used structural similarity measures based on edges, namely Jaccard [18], Dice [30], and Adamic-Adar [1] similarity indexes. 4.2 Nearest Neighbour The nearest neighbour approach [26] can provide a graph-based recommendation for enterprise architects in order to avoid modelling already existing elements. Early work on this domain was attempted by Aier and Schnherr [2]. Thus, finding communities on a network helps to investigate the roles in an EA model where components in a single cluster will have similar kind of roles compared to other clusters [25]. 6 N. Rajashekar et al. (a) Application View of M odel1 (b) Application View of M odel2 Fig. 2: Two simple instances illustrating structural similarity. We present a technique based on unsupervised learning method to determine the communities [11]. Considering an EA model as a graph [27], we apply differ- ent community detection algorithms and analyse the results [24]. The community detection method tries to group a set of vertices having a higher probability of being interconnected than being connected to the members of other groups. In this way, it is possible to investigate similar kinds of grouped components. Here, similarity refers to the components having a common subset of neigh- bours. Identifying communities in an EA model provides decision support for the enterprise architects to place new components into the model repository based on identified roles and position of the components. When the size of the model is complex enough to analyse EA model, community detection technique becomes robust regarding execution speed and outperforms SimRank approach, which is discussed by Borozanov et al. [4]. Once a community detection algorithm is implemented and the network is partitioned into communities, it is important to interpret the results to know which algorithm performs best and detect meaningful communities. Since we do not know the gold standard to which the communities should belong, an effective approach to evaluate formed communities is by using internal criteria such as ”modularity” [19]. 5 Results In this section, we outline the evaluation results to select the best performing similarity method and the best performing community detection algorithm that provides automatic view generation of EA models. 5.1 Similarity Between Enterprise Architecture Models Node Similarity Evaluation Method Comparison Supporting EA Maintenance 7 Syntactic Similarity. Concerning our example from section 4.1, Cosine and Jac- card similarity measures are applied to assess the syntactic similarity between component names of the EA models. The comparison results for our case shows that cosine similarity (0.700) has a better similarity score which is closer to value 1 than the Jaccard Index (0.537). Thus, we conclude cosine similarity performs better than the Jaccard index in capturing the similarity. Latent Semantic Analysis. Knowing that the models from section 4.1 convey similar meaning (and from the results presented above), we conclude that by using LSA (0.832) we got the highest score closer to 1. Thus, LSA outperforms Cosine and Jaccard similarity measure in capturing relatedness between EA models. Therefore, any one of the models can be added to the EA repository. Thus, LSA technique is more efficient in finding similarity between EA models compared to simple cosine similarity (0.700) based on the type of content present between the component names. For a certain threshold (similarity score) closer to 1, enterprise architects can decide upon adding a new model into the repository. Structural similarity evaluation In order to assess the structural similarity between two graphs, we first calculate the similarity between the pair of ver- tices by constructing similarity matrices for the graphs using Jaccard, Dice, and Adamic-Adar indexes. Association or correlation between two similarity matrices is later computed by testing correlation to compare whether two graphs are similarly based on its structure. The correlation coefficient can range in value from −1 to +1. The larger the absolute value of the coefficient, the stronger the relationship between the variables. We used the Pearson correlation coefficient to compute the association between two similarity matrices. Our results show that best associations were obtained using the Jaccard similarity coefficient (0.968) whose correlation score is closer to 1 than the Dice similarity coefficient (0.956) and the Adamic-Adar similarity coefficient (0.853). Thus, we choose the Jaccard similarity index as the best performing metric in finding the structural similarity. 5.2 Comparison Between Community Detection Algorithms In this section, we discuss the approach to find the best-performing community detection algorithm. Since we do not know to which the communities should belong, we use modularity score (internal measure) as a general criterion to find the best-performing algorithm. We consider two different cases in determining the communities of similar kind of EA components. Thus, for each case, we evaluate the performance of different community detection algorithms. In the following, we consider that the graph does not contain isolated nodes. Table 1 summarises the resulting community size, modularity score, execution time, and ranking of different community detection algorithms. As shown in Table 1, spinglass takes a top position. The higher the modu- larity score, the better the network partition. Leading eigenvector and walktrap 8 N. Rajashekar et al. Algorithm Size Modularity Running time(in sec) Edge-betweenness 27 0.579 ≈0.05 Infomap 28 0.694 ≈0.08 Label propagation 26 0.601 ≈50 Leading eigenvector 14 0.749 ≈0.11 Spinglass 14 0.786 ≈7.82 walktrap 18 0.746 ≈0.009 Table 1: Modularity evaluation result returns the almost equal result with modularity score of 0.75. In this situation, running time, t, is considered as a tie-break criterion. Since walktrap computes faster than leading eigenvector, the user can decide between the selection of an algorithm based on the running time of the specific algorithm. Edge-betweenness algorithm performed worst in this case with lowest modularity score of about 0.579. The lesser the modularity score indicates, the larger the number of nodes in a single community, which makes it non-trivial to understand the distribution of communities formed within a network. 6 Conclusion With the goal of supporting EA model maintenance, this paper presented a performance-based comparison between different graph similarity and commu- nity detection algorithms as candidates in providing a better user experience and suitability in a working environment. The presented approaches enable en- terprise architects to track changes in EA models and to avoid EA repository pollution by adding duplicate components. One limitation of finding similarity method is that computation time is linear concerning the product of graph sizes due to the size of the similarity matrix. A further challenge is to determine the similarity between networks and to define a measure of similarity. For instance, it would be difficult for a domain expert to choose a threshold on the similarity measures to decide whether two models are similar. Another limitation is that the igraph community detection algorithm fails to handle overlapping communities. Moreover, abstracting EA models as graphs leads to loss of information as it means a focus on the structural facet of the EA model. However, enterprise architects also encode information by grouping elements visually to each other. This information gets lost within our transformation into a graph. Lastly, we focused on finding similarities based on the component names. As future work, additional similarity approaches can be extended to find similarity between EA models based on the description information of an EA component. Furthermore, applying other graph similarity measures may be useful to identify duplicate components in EA models (e.g., [31]). Method Comparison Supporting EA Maintenance 9 References 1. Adamic, L.A., Adar, E.: Friends and neighbors on the web. Social networks 25(3), 211–230 (2003) 2. Aier, S., Schnherr, M.: Integrating an enterprise architecture using domain clus- tering (2007) 3. Alhadi, A.C., Deraman, A., Yussof, W.N.J.W., Mohamed, A.A., et al.: An en- semble similarity model for short text retrieval. In: International Conference on Computational Science and Its Applications. pp. 20–29. Springer (2017) 4. Borozanov, V., Hacks, S., Silva, N.: Using machine learning techniques for evaluat- ing the similarity of enterprise architecture models - technical paper. In: Advanced Information Systems Engineering - 31st International Conference, CAiSE. pp. 563– 578 (2019) 5. Csardi, G., Nepusz, T.: The igraph software package for complex network research. InterJournal, Complex Systems 1695(5), 1–9 (2006) 6. Dijkman, R., Dumas, M., Van Dongen, B., Käärik, R., Mendling, J.: Similarity of business process models: Metrics and evaluation. Information Systems 36(2), 498–516 (2011) 7. Dreyfus, D., Iyer, B.: Enterprise architecture: A social network perspective. In: HICSS’06. Proceedings of the 39th Annual Hawaii International Conference on System Sciences, 2006. vol. 8. IEEE (2006) 8. Farwick, M., Agreiter, B., Breu, R., Ryll, S., Voges, K., Hanschke, I.: Automation processes for enterprise architecture management. In: 2011 IEEE 15th International Enterprise Distributed Object Computing Conference Workshops. pp. 340–349. IEEE (2011) 9. Farwick, M., Agreiter, B., Breu, R., Ryll, S., Voges, K., Hanschke, I.: Requirements for automated enterprise architecture model maintenance. In: 13th International Conference on Enterprise Information Systems (ICEIS), Beijing (2011) 10. Farwick, M., Pasquazzo, W., Breu, R., Schweda, C.M., Voges, K., Hanschke, I.: A meta-model for automated enterprise architecture model maintenance. In: 2012 IEEE 16th International Enterprise Distributed Object Computing Conference. pp. 1–10. IEEE (2012) 11. Fortunato, S.: Community detection in graphs. Physics reports 486(3-5), 75–174 (2010) 12. Ganser, A., Lichter, H., Roth, A., Rumpe, B.: Staged model evolution and proactive quality guidance for model libraries. Software Quality Journal 24(3), 675–708 (Sep 2016) 13. Jain, A., Jain, A., Chauhan, N., Singh, V., Thakur, N.: Information retrieval using cosine and jaccard similarity measures in vector space model. International Journal of Computer Applications 164(6) (2017) 14. Jonkers, H., Lankhorst, M.M., ter Doest, H.W., Arbab, F., Bosma, H., Wieringa, R.J.: Enterprise architecture: Management tool and blueprint for the organisation. Information systems frontiers 8(2), 63–66 (2006) 15. Kaisler, S.H., Armour, F., Valivullah, M.: Enterprise architecting: Critical prob- lems. In: Proceedings of the 38th Annual Hawaii International Conference on Sys- tem Sciences. pp. 224b–224b. IEEE (2005) 16. Koutra, D., Parikh, A., Ramdas, A., Xiang, J.: Algorithms for graph similarity and subgraph matching. In: Proc. Ecol. Inference Conf. (2011) 17. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic anal- ysis. Discourse processes 25(2-3), 259–284 (1998) 10 N. Rajashekar et al. 18. Liben-Nowell, D., Kleinberg, J.: The link-prediction problem for social networks. journal of the Association for Information Science and Technology 58(7), 1019– 1031 (2007) 19. Newman, M.E.: Modularity and community structure in networks. Proceedings of the national academy of sciences 103(23), 8577–8582 (2006) 20. Niwattanakul, S., Singthongchai, J., Naenudorn, E., Wanapu, S.: Using of jaccard coefficient for keywords similarity. In: Proceedings of the International MultiCon- ference of Engineers and Computer Scientists. vol. 1 (2013) 21. O’Neill, T.: Extreme connectivity: EAs need enterprise graphs. https: //www.infoworld.com/article/3045210/enterprise-architecture/ extreme-connectivity-eas-need-enterprise-graphs.html/ (2016) 22. Orman, G.K., Labatut, V.: A comparison of community detection algorithms on artificial networks. In: International Conference on Discovery Science. pp. 242–256. Springer (2009) 23. Orman, G.K., Labatut, V., Cherifi, H.: Qualitative comparison of community de- tection algorithms. In: International conference on digital information and com- munication technology and its applications. pp. 265–279. Springer (2011) 24. Orman, G.K., Labatut, V., Cherifi, H.: Comparative evaluation of community de- tection algorithms: a topological approach. Journal of Statistical Mechanics: The- ory and Experiment 2012(08) (2012) 25. Porter, M.A., Onnela, J.P., Mucha, P.J.: Communities in networks. Notices of the AMS 56(9), 1082–1097 (2009) 26. Saha, S., Ghrera, S.: Nearest neighbor search in complex network for community detection. arXiv preprint arXiv:1511.07210 (2015) 27. Santana, A., Souza, A., Simon, D., Fischbach, K., De Moura, H.: Network science applied to enterprise architecture analysis: Towards the foundational concepts. In: 2017 IEEE 21st International Enterprise Distributed Object Computing Confer- ence (EDOC). pp. 10–19. IEEE (2017) 28. Sapp, C.E.: Preparing and Architecting for Machine Learning. https: //www.gartner.com/binaries/content/assets/events/keywords/catalyst/ catus8/preparing_and_architecting_for_machine_learning.pdf/ (2017) 29. Simon, D., Fischbach, K.: It landscape management using network analysis. In: Enterprise Information Systems of the Future, pp. 18–34. Springer (2013) 30. Sørensen, T.: A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on danish commons. Biol. Skr. 5, 1–34 (1948) 31. Yan, X., Yu, P.S., Han, J.: Substructure similarity search in graph databases. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data. pp. 766–777. ACM (2005) 32. Yang, Z., Algesheimer, R., Tessone, C.J.: A comparative analysis of community detection algorithms on artificial networks. Scientific Reports 6 (2016)