Capitalizing on Hierarchical Graph Decomposition for Scalable Network Analysis Rakhi Saxena Supervised by Sharanjit Kaur and Vasudha Bhatnagar University of Delhi at New Delhi, India rsaxena@db.du.ac.in ABSTRACT [13]. Hierarchy has been recognized as a critical organi- Processing large graphs has become commonplace across zational property for many complex systems, ranging from many academic and industrial applications. We address the biological networks, societies, to road networks and the In- computational challenge of analyzing large networks on a ternet [7]. Massive networks of the current technology era single consumer-grade machine. Our strategy involves ar- have been analyzed, visualized and understood better after ranging networks into layers of smaller, increasingly cohe- hierarchical decomposition [1, 10]. sive subgraphs, which is motivated by the observation that My Ph.D. dissertation focuses on developing effective and real-world networks exhibit a hierarchical organization. We fast algorithms for network analysis using a single PC by decompose large networks to reveal the underlying hierar- leveraging signals acquired from hierarchically decomposed chy and extract signals from this hierarchical topology to networks. The approach is to decompose the given graph solve three network analysis problems viz., network compar- into a nested hierarchy of increasingly cohesive subgraphs. ison, determining influential spreaders, and centrality com- The hierarchy imparts a natural ordering to the vertices, and putation. Empirical investigation reveals that our approach the cohesive regions revealed by the decomposition mimic is effective and faster than state-of-the-art competing algo- community structures. rithms. We leverage hierarchy and approximated community struc- ture to develop algorithms for three common problems in network analysis, namely, network comparison, finding in- 1. INTRODUCTION fluential spreaders and computing node centrality. The ef- Complex networks have attracted immense attention be- fectiveness of these algorithms establishes that i) hierarchy cause of their ability to model a wide variety of associations in networks is a potent property for network discrimination between entities in social networks, power-grids, transporta- ii) links between levels of the hierarchy are a fair approxi- tion, biological systems etc. Many of these networks contain mation of intra- and inter-community ties iii) vertices at the millions of nodes and billions of edges, and the data is easily same level have similar importance and similar capability available on the world wide web. Several distributed as well to diffuse information. In summary, we find that hierarchi- as disk-based frameworks [3] have been specifically designed cal decomposition of networks is a meritorious approach for to analyze such large networks. However, developing algo- three network analysis tasks. rithms for these frameworks requires expertise in the use of Organization: Sec. 2 introduces two graph decomposition highly specialized programming paradigms. methods. Sec. 3 presents three algorithms for network anal- In this study, we explore the question: Is graph decompo- ysis. Sec. 4 delineates directions for future research. sition a viable strategy for effective network analysis using a single consumer-grade machine? We address this ques- 2. HIERARCHICAL GRAPH DECOMPOSI- tion by decomposing graphs into increasingly cohesive parts using two well-known hierarchical graph decomposition al- TION METHODS gorithms [6, 17]. The choice of this strategy is emboldened In this section, we introduce k-core [17] and k-truss [6] by the fact that it is viable to compute graph decomposi- decomposition methods that we use for eliciting network hi- tion for networks of billions of edges on a consumer-grade erarchy. PC [9]. Our approach is simple, intuitive and motivated by Let G = (V, E) be a simple, connected, undirected graph, the observation that real-world networks from a wide vari- where V represents the set of vertices and E ⊆ V × V rep- ety of domains have an inherently hierarchical organization resents the set of edges1 . An edge eij ∈ E iff it connects vertices vi , vj ∈ V . Set Ni = {vj ∈ V |eij ∈ E} denotes the set of neighbours of vertex vi . The degree of a vertex δi = |Ni | denotes the number of neighbours of vi . 2.1 k-core Decomposition The k-core decomposition organizes the graph into a hier- Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de archy of subgraphs (called k-cores) such that the degree of Janeiro, Brazil. 1 Copyright (C) 2018 for this paper by its authors. Copying permitted for We use terms network/graph, node/vertex, and edge/link private and academic purposes. interchangeably. 1 Algorithm Purity Precision Recall Accuracy NMI NSD-C 1.0 1.0 1.0 1.0 1.0 NSD-T 1.0 1.0 1.0 1.0 1.0 NCKD 0.67 0.51 0.87 0.73 0.67 Table 1: Quality metrics for hierarchical clustering of 15 large real-world networks. 3.1 Network Comparison The task of network comparison is encountered in several domains such as pattern recognition, analyses of the func- (a) k-core Decomposition (b) k-truss Decomposition tionality of biological networks, and study of the temporal evolution of networks. Graph comparison entails computa- Figure 1: Hierarchical decomposition of a toy network. tion of distance between a pair of networks to measure the Nodes with the same coreness/trussness have the same color. extent of similarity between them. State-of-the-art network Edges colored red connect nodes in different hierarchy layers. comparison methods extract network features and the dis- tance between features quantifies network similarity. Existing algorithms make use of features extracted from either local neighborhood of vertices [4, 20] or global net- every vertex in a k-core is at least k. A vertex that belongs work topology [12]. Both k-core and k-truss algorithms pro- to a k-core but not to a k+1 core has coreness k. Definitions mote local (node/edge level) features to obtain global (graph adapted from [17] follow. level) feature (i.e. hierarchy) of the network, which forms the basis of characterizing networks. In this way our ap- Definition 2.1. A subgraph, Ck = (Vk , Ek |Vk ) of G is proach plugs the gap between the use of myopic local fea- a k-core iff ∀vi ∈ Vk : δi >= k and Ck is the maximal tures, and non-scalable global features. subgraph with this property.  We first experimented with network signatures extracted Definition 2.2. Coreness (κi ) of vertex vi is k i.e. κi = from simple but efficient features from k-core decomposition k iff vi ∈ Ck ∧ vi ∈ / Ck+1 .  [16]. Encouraged by the results, we extended the study to improve effectiveness by using sophisticated aggregation of Fig. 1a illustrates the k-core decomposition of a toy net- features derived from k-core and k-truss decompositions [14]. work with 16 nodes and 34 edges. We use an efficient O(|E|) Algorithm NCKD uses probability distribution of nodes k-core decomposition algorithm as proposed in [2]. at each level of the hierarchy, and the distribution of edges within the level and between different levels [16]. Jensen- 2.2 k-truss Decomposition Shannon distance between two probability distributions ex- The k-truss decomposition organizes a given graph into tracted from a given pair of networks quantifies structural a hierarchy of subgraphs (called k-trusses) such that every differences between them. Empirical investigations estab- edge in a k-truss is part of at least (k − 2) triangles [6]. An lish that these two distributions are capable of capturing edge that belongs to a k-truss but not to a (k+1)-truss has structural differences between networks and discriminating trussness k. Definitions adapted from [18] follow. between different genres of networks reasonably well. Next, we use more expressive methods of distribution ag- Definition 2.3. Support (σij ) of edge eij is |Ni ∩ Nj |  gregation and additionally examine truss based decompo- sition for extracting network signatures [14]. NSD-C and Definition 2.4. Subgraph, Tk = (Vk , Ek |Vk ) of G is a k- NSD-T algorithms examine core-based and truss-based de- truss iff ∀eij ∈ Ek : σij >= (k − 2), and Tk is the maximal composition respectively, to asses node-level assortativity subgraph with this property.  (propensity to connect with other nodes at the same level) in addition to hierarchy levels of the nodes. Quantiles of the Definition 2.5. Trussness (>ij ) of edge eij is k i.e. >ij = distributions of the two features are used as network signa- k iff eij ∈ Tk ∧ eij ∈ / Tk+1  ture and Canberra distance between signatures is computed. We define trussness of nodes in the graph. A node that In order to evaluate, fifteen large public real-world net- belongs to the k-truss but not to the (k+1)-truss has truss- works2 from three genres were clustered with the assump- ness k. Formally, tion that graphs belonging to the same genre are structurally more similar and hence should be grouped together by an ef- Definition 2.6. Trussness (τi ) of node vi is k iff vi ∈ fective network comparison algorithm. Table 1 reports qual- Tk ∧ vi ∈ / Tk+1  ity metrics of the resultant clustering scheme delivered by our algorithms. Having found the performance of the three Fig. 1b illustrates the k-truss decomposition of a toy net- algorithms better than the state-of-the-art algorithms, we work. We use the elegant in-memory O(m1.5 ) k-truss de- compared NSD-C, NSD-T and NCKD for accuracy, speed composition algorithm proposed in [18] so that large network and sensitivity to noise and missing data. Further experi- decomposition is feasible on a consumer-grade machine. ments lead to the following conclusions. Network compari- son using i) network signatures based on simple probability 3. NETWORK ANALYSIS ALGORITHMS distributions of hierarchy levels of nodes and edges is the Next, we outline the solutions to the three network anal- 2 ysis tasks mentioned in Sec. 1. Refer to paper [16] for details of the networks. 2 fastest, but relatively least accurate, ii) quantile-based ag- compared with four measures - degree centrality (DC), k- gregation of core levels of nodes and their assortativity is the core (KC), k-truss (KT), Trust-Oriented Social Influencers best of the three measures, but is less sensitive to noise and (TOSI). For each competing measure, top 20% nodes are missing data, iii) quantile-based aggregation of truss levels taken as initial spreaders and 100 simulations of SIR model of nodes and their assortativity is the slowest of the three are run to capture the average spreading ability (SA) of top- methods, but most sensitive to noise and missing data. rankers. SA of the initial set of infected nodes is quantified as the percentage of nodes infected during spreading pro- 3.2 Influential Spreader cess. Figure 2a shows that average SA of IPRI algorithm Predicting individuals who influence the spread of infor- is higher than that of competing measures for all networks mation is another important task in social network analysis. afffirming its better effectiveness. Prerequisite for understanding the spreading dynamics in online social networks, the task also finds applications in product marketing, promotion of innovative ideas, restrict- ing negative information etc.. State-of-the-art methods for predicting influential actors use facets such as - strength of interaction with neighbors [1], community structure [19], or hierarchy [10] in the network. These methods for finding influential spreaders miss out on the advantage of the interplay of the three facets. We address the research gap by exploiting the synergy between the three facets, and demonstrate significant improvement over existing methods for prediction of influential spreaders. The proposed influence scoring method IPRI (Influence scoring using Position, Reachability, and Interaction) [8] (a) Spreading Ability of IPRI (b) SC execution time averaged over 10 runs uses i) position of the actor in the network hierarchy, ii) intensity of his interactions with neighbors and iii) extent of actor’s connectivity in different communities. The algo- Figure 2: Results of evaluation of IPRI and SC. rithm uses k-truss decomposition method, which confers the dual advantage of revealing hierarchy and homophilic groups (approximate communities) in the network. IPRI algorithm 3.3 Social Centrality computes the following three indices for every vertex: Centrality is widely-used for identifying important nodes i) Positional Index (τ ): Trussness of a node obtained in a network [5]. Existing methods for discovering impor- by decomposition of the network proxies for its relative po- tant nodes do not take cognizance of inherent hierarchy and sition in network hierarchy. A higher level indicates larger community structure in human-centric networks for deter- neighborhood span that aids wider spread of information. mining centrality of actors. Since humans derive benefits Positional Index of node vi is same as its trussness τi . concomitant with their position in the network hierarchy, ii) Reachability Index (ρ): A node having connections and with the strength of their intra– and inter–community with more truss levels has higher reachability in terms of connections, we posit that a centrality measure that takes information propagation, compared to a node having con- these aspects into account gauges the importance of individ- nections with fewer truss levels [19]. We quantify a node’s uals more realistically in human-centric networks. reachability to diverse communities as the entropy of truss- Based on the mature theory of social capital, the proposed ness of its neighbors. Social Centrality score (SC) emulates the real-life behavior iii) Interaction Index (µ): The propagation of infor- of social actors to bond within community and bridge be- mation is governed by the strength of interaction not only tween communities [15]. SC score of a node quantifies its with neighbors but also with 2-steps neighbors [11]3 . Based ability to mobilize resources in the network based on its lo- on this observation, the interaction index of a node is com- cation in the hierarchy, embeddedness in community and puted as the sum of the sum of weights of edges incident on intensity of relations with neighbors. Application of k-truss neighbors scaled by their respective positional index. decomposition elicits network hierarchy and approximated IPRI algorithm computes the influence score by integrat- community structure. We posit that if two nodes and the ing positional index, reachability index and interaction index connecting edge all have the same trussness, they are part of a node using a multiplicative function. The score is indi- of the same community, and the connecting edge is an intra- cator of the power to influence other users in the network, community edge. SC score is computed by extracting fol- with higher score indicating more influence. lowing three nodal properties from the hierarchical decom- Experimentation with large real-world social networks es- position of a graph. tablishes the validity and accuracy of the scoring method. i) Sociability Index (ω): Sociability quantifies the ex- We evaluate effectiveness of competing methods by using tent to which an actor can leverage the resources controlled SIR epidemic model on three large real-world networks (Col- by its immediate neighbors. It takes into consideration size legeMsg, WikiVote, and Epinions)4 . Performance of IPRI is of the immediate neighborhood (ego-network) and intensity of relationships (edge-weights). Sociability index of a node 3 Liu et al. [11] establish that the 2-step neighborhood of is defined as sum of weights of edges incident on it. nodes is a good choice that balances cost and performance ii) Bonding Potential (β): Bonding potential of an when identifying influencers. individual is determined by his hierarchical position in G as 4 Refer to paper [8] for details of the networks. well as by the sociability of his intra-community neighbors. 3 Bonding potential of an actor is quantified as the sum of Experimental Evaluation. Cluster Computing, sociability index of its intra-community neighbors weighted 18(3):1189–1213, Sep 2015. by his position in the hierarchy. [4] M. Berlingerio, D. Koutra, T. Eliassi-Rad, and iii) Bridging Potential (γ): An actor with links in di- C. Faloutsos. Network Similarity via Multiple Social verse communities can draw advantages that are not avail- Theories. In Proceedings of IEEE/ACM ASONAM, able within his community. The bridging potential of an pages 1439–1440, 2013. actor is the sum of the intensity of relationship with the [5] F. Bloch, M. O. Jackson, and P. Tebaldi. Centrality inter-community neighbors scaled by their respective posi- Measures in Networks. CoRR, abs/1608.05845, 2016. tions in the hierarchy. [6] J. Cohen. Trusses: Cohesive Subgraphs for Social SC score of a node is computed by aggregating its socia- Network Analysis. NSA:Technical report, 2008. bility index, bonding and bridging potential using a multi- [7] K. D. Farnsworth, L. Albantakis, and T. Caruso. plicative function. The empirical study based on diverse, Unifying Concepts of Biological Function from human-centric networks vindicates the propositional basis Molecules to Ecosystems. Oikos, 126(10):1367–1376, of the model and demonstrates its validity, effectiveness, 2017. and superior performance compared to prevailing centrality [8] S. Kaur, R. Saxena, and V. Bhatnagar. Leveraging measures. We also performed scalability tests on synthetic Hierarchy and Community Structure for Determining networks generated from Erdös-Rényi (ER), Watts-Strogatz Influencers in Networks. In Proceedings of DaWaK, (WS) and Forest Fire (FF) models with the number of nodes pages 383–390, 2017. varying from 1 million to 10 million and edges ranging from 2 million to ≈60 million. The results (Figure 2b) endorse our [9] W. Khaouid, M. Barsky, V. Srinivasan, and claim of the ability of hierarchy-based algorithms to process A. Thomo. K-Core Decomposition of Large Networks large graphs on consumer-grade PC. on a Single PC. PVLDB, 9(1):13–23, 2015. [10] M. Kitsak, L. K. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. E. Stanley, and H. A. Makse. 4. CONCLUSION AND FUTURE WORK Identification of Influential Spreaders in Complex In this doctoral study, we find that large networks can Networks. Nature Physics, 6(11):888–893, Aug 2010. be analyzed effectively on a single consumer-grade machine [11] Y. Liu, M. Tang, T. Zhou, and Y. Do. Identify after hierarchical decomposition. Specifically, we explored Influential Spreaders in Complex Networks, the Role three network analysis problems - network comparison, iden- of Neighborhood. Physica A: Statistical Mechanics tifying influence spreaders and computing centrality in human- and its Applications, 452:289–298, 2016. centric social networks. Existing efficient algorithms for k- [12] S. Lu, J. Kang, W. Gong, and D. Towsley. Complex core and k-truss decomposition are fundamental to our so- Network Comparison using Random Walks. In lutions. The hierarchy of nodes and links between levels of Proceedings on WWW, pages 727–730, 2014. the hierarchy are reasonably effective signals to quantify net- [13] H. Mengistu, J. Huizinga, J. Mouret, and J. Clune. work similarity. We also observe that the cohesive regions The Evolutionary Origins of Hierarchy. PLoS of the network revealed by the decomposition make a good Computational Biology, 12(6):e1004829, 2016. approximation of community structure. Integrating hier- [14] R. Saxena, S. Kaur, and V. Bhatnagar. Identifying archy, and the resulting approximate community structure Similar Networks using Structural Hierarchy. Ready is superior to the state-of-the-art algorithms for computing for Submission; Avaialable on Request. centrality and identifying influential spreaders. [15] R. Saxena, S. Kaur, and V. Bhatnagar. Social The study opens few new questions - What other net- Centrality using Network Hierarchy and Community work analysis tasks can be solved effectively and efficiently Structure. Data Mining and Knowledge Discovery, for massive graphs using hierarchical graph decomposition Accepted for publicaton, 2018. methods? Is the divide-and-conquer approach in general, practical for analysis of large graphs? Can horizontal par- [16] R. Saxena, S. Kaur, D. Dash, and V. Bhatnagar. titioning methods be leveraged for designing scalable algo- Leveraging Structural Hierarchy for Scalable Network rithms? Answering these questions demands intense involve- Comparison. In Proceedings of DEXA, pages 287–302, ment in the theoretical underpinnings of graph decomposi- 2016. tion methods. [17] S. B. Seidman. Network Structure and Minimum Degree. Social Networks, 5:269–287, 1983. [18] J. Wang and J. Cheng. Truss Decomposition in 5. REFERENCES Massive Networks. Proceedings of the VLDB [1] M. A. Al-garadi, K. D. Varathan, and S. D. Ravana. Endowment, 5(9):812–823, 2012. Identification of Influential Spreaders in Online Social [19] S. Wang, F. Wang, Y. Chen, C. Liu, Z. Li, and Networks using Interaction Weighted K-core X. Zhang. Exploiting Social Circle Broadness for Decomposition Method. Physica A, 468(C):278–288, Influential Spreaders Identification in Social Networks. 2017. World Wide Web, 18(3):681–705, 2015. [2] V. Batagelj and M. Zaveršnik. Fast Algorithms for [20] A. E. Wegner, L. Ospina-Forero, R. E. Gaunt, C. M. Determining (Generalized) Core Groups in Social Deane, and G. Reinert. Identifying Networks with Networks. Advances in Data Analysis and Common Organizational Principles. Journal of Classification, 5(2):129–145, Jul 2011. Complex Networks, 2018. [3] O. Batarfi, R. E. Shawi, A. G. Fayoumi, R. Nouri, S. Beheshti, A. Barnawi, and S. Sakr. Large Scale Graph Processing Systems: Survey and an 4