Capitalizing on Hierarchical Graph Decomposition for
                    Scalable Network Analysis

                                                               Rakhi Saxena
                                       Supervised by Sharanjit Kaur and Vasudha Bhatnagar
                                              University of Delhi at New Delhi, India
                                                         rsaxena@db.du.ac.in

ABSTRACT                                                                      [13]. Hierarchy has been recognized as a critical organi-
Processing large graphs has become commonplace across                         zational property for many complex systems, ranging from
many academic and industrial applications. We address the                     biological networks, societies, to road networks and the In-
computational challenge of analyzing large networks on a                      ternet [7]. Massive networks of the current technology era
single consumer-grade machine. Our strategy involves ar-                      have been analyzed, visualized and understood better after
ranging networks into layers of smaller, increasingly cohe-                   hierarchical decomposition [1, 10].
sive subgraphs, which is motivated by the observation that                       My Ph.D. dissertation focuses on developing effective and
real-world networks exhibit a hierarchical organization. We                   fast algorithms for network analysis using a single PC by
decompose large networks to reveal the underlying hierar-                     leveraging signals acquired from hierarchically decomposed
chy and extract signals from this hierarchical topology to                    networks. The approach is to decompose the given graph
solve three network analysis problems viz., network compar-                   into a nested hierarchy of increasingly cohesive subgraphs.
ison, determining influential spreaders, and centrality com-                  The hierarchy imparts a natural ordering to the vertices, and
putation. Empirical investigation reveals that our approach                   the cohesive regions revealed by the decomposition mimic
is effective and faster than state-of-the-art competing algo-                 community structures.
rithms.                                                                          We leverage hierarchy and approximated community struc-
                                                                              ture to develop algorithms for three common problems in
                                                                              network analysis, namely, network comparison, finding in-
1.    INTRODUCTION                                                            fluential spreaders and computing node centrality. The ef-
   Complex networks have attracted immense attention be-                      fectiveness of these algorithms establishes that i) hierarchy
cause of their ability to model a wide variety of associations                in networks is a potent property for network discrimination
between entities in social networks, power-grids, transporta-                 ii) links between levels of the hierarchy are a fair approxi-
tion, biological systems etc. Many of these networks contain                  mation of intra- and inter-community ties iii) vertices at the
millions of nodes and billions of edges, and the data is easily               same level have similar importance and similar capability
available on the world wide web. Several distributed as well                  to diffuse information. In summary, we find that hierarchi-
as disk-based frameworks [3] have been specifically designed                  cal decomposition of networks is a meritorious approach for
to analyze such large networks. However, developing algo-                     three network analysis tasks.
rithms for these frameworks requires expertise in the use of                  Organization: Sec. 2 introduces two graph decomposition
highly specialized programming paradigms.                                     methods. Sec. 3 presents three algorithms for network anal-
   In this study, we explore the question: Is graph decompo-                  ysis. Sec. 4 delineates directions for future research.
sition a viable strategy for effective network analysis using
a single consumer-grade machine? We address this ques-                        2.    HIERARCHICAL GRAPH DECOMPOSI-
tion by decomposing graphs into increasingly cohesive parts
using two well-known hierarchical graph decomposition al-                           TION METHODS
gorithms [6, 17]. The choice of this strategy is emboldened                      In this section, we introduce k-core [17] and k-truss [6]
by the fact that it is viable to compute graph decomposi-                     decomposition methods that we use for eliciting network hi-
tion for networks of billions of edges on a consumer-grade                    erarchy.
PC [9]. Our approach is simple, intuitive and motivated by                       Let G = (V, E) be a simple, connected, undirected graph,
the observation that real-world networks from a wide vari-                    where V represents the set of vertices and E ⊆ V × V rep-
ety of domains have an inherently hierarchical organization                   resents the set of edges1 . An edge eij ∈ E iff it connects
                                                                              vertices vi , vj ∈ V . Set Ni = {vj ∈ V |eij ∈ E} denotes
                                                                              the set of neighbours of vertex vi . The degree of a vertex
                                                                              δi = |Ni | denotes the number of neighbours of vi .

                                                                              2.1   k-core Decomposition
                                                                                The k-core decomposition organizes the graph into a hier-
Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de          archy of subgraphs (called k-cores) such that the degree of
Janeiro, Brazil.                                                              1
Copyright (C) 2018 for this paper by its authors. Copying permitted for         We use terms network/graph, node/vertex, and edge/link
private and academic purposes.                                                interchangeably.


                                                                          1
                                                                          Algorithm    Purity   Precision   Recall   Accuracy   NMI
                                                                          NSD-C        1.0      1.0         1.0      1.0        1.0
                                                                          NSD-T        1.0      1.0         1.0      1.0        1.0
                                                                          NCKD         0.67     0.51        0.87     0.73       0.67


                                                                      Table 1:      Quality metrics for hierarchical clustering of 15
                                                                      large real-world networks.


                                                                      3.1      Network Comparison
                                                                         The task of network comparison is encountered in several
                                                                      domains such as pattern recognition, analyses of the func-
     (a) k-core Decomposition        (b) k-truss Decomposition        tionality of biological networks, and study of the temporal
                                                                      evolution of networks. Graph comparison entails computa-
Figure 1:      Hierarchical decomposition of a toy network.           tion of distance between a pair of networks to measure the
Nodes with the same coreness/trussness have the same color.           extent of similarity between them. State-of-the-art network
Edges colored red connect nodes in different hierarchy layers.        comparison methods extract network features and the dis-
                                                                      tance between features quantifies network similarity.
                                                                         Existing algorithms make use of features extracted from
                                                                      either local neighborhood of vertices [4, 20] or global net-
every vertex in a k-core is at least k. A vertex that belongs         work topology [12]. Both k-core and k-truss algorithms pro-
to a k-core but not to a k+1 core has coreness k. Definitions         mote local (node/edge level) features to obtain global (graph
adapted from [17] follow.                                             level) feature (i.e. hierarchy) of the network, which forms
                                                                      the basis of characterizing networks. In this way our ap-
  Definition 2.1. A subgraph, Ck = (Vk , Ek |Vk ) of G is
                                                                      proach plugs the gap between the use of myopic local fea-
a k-core iff ∀vi ∈ Vk : δi >= k and Ck is the maximal
                                                                      tures, and non-scalable global features.
subgraph with this property.                           
                                                                         We first experimented with network signatures extracted
   Definition 2.2. Coreness (κi ) of vertex vi is k i.e. κi =         from simple but efficient features from k-core decomposition
k iff vi ∈ Ck ∧ vi ∈
                   / Ck+1 .                                          [16]. Encouraged by the results, we extended the study to
                                                                      improve effectiveness by using sophisticated aggregation of
  Fig. 1a illustrates the k-core decomposition of a toy net-          features derived from k-core and k-truss decompositions [14].
work with 16 nodes and 34 edges. We use an efficient O(|E|)              Algorithm NCKD uses probability distribution of nodes
k-core decomposition algorithm as proposed in [2].                    at each level of the hierarchy, and the distribution of edges
                                                                      within the level and between different levels [16]. Jensen-
2.2      k-truss Decomposition                                        Shannon distance between two probability distributions ex-
  The k-truss decomposition organizes a given graph into              tracted from a given pair of networks quantifies structural
a hierarchy of subgraphs (called k-trusses) such that every           differences between them. Empirical investigations estab-
edge in a k-truss is part of at least (k − 2) triangles [6]. An       lish that these two distributions are capable of capturing
edge that belongs to a k-truss but not to a (k+1)-truss has           structural differences between networks and discriminating
trussness k. Definitions adapted from [18] follow.                    between different genres of networks reasonably well.
                                                                         Next, we use more expressive methods of distribution ag-
     Definition 2.3. Support (σij ) of edge eij is |Ni ∩ Nj |        gregation and additionally examine truss based decompo-
                                                                      sition for extracting network signatures [14]. NSD-C and
   Definition 2.4. Subgraph, Tk = (Vk , Ek |Vk ) of G is a k-         NSD-T algorithms examine core-based and truss-based de-
truss iff ∀eij ∈ Ek : σij >= (k − 2), and Tk is the maximal           composition respectively, to asses node-level assortativity
subgraph with this property.                                         (propensity to connect with other nodes at the same level)
                                                                      in addition to hierarchy levels of the nodes. Quantiles of the
   Definition 2.5. Trussness (>ij ) of edge eij is k i.e. >ij =
                                                                      distributions of the two features are used as network signa-
k iff eij ∈ Tk ∧ eij ∈
                     / Tk+1                                 
                                                                      ture and Canberra distance between signatures is computed.
  We define trussness of nodes in the graph. A node that                 In order to evaluate, fifteen large public real-world net-
belongs to the k-truss but not to the (k+1)-truss has truss-          works2 from three genres were clustered with the assump-
ness k. Formally,                                                     tion that graphs belonging to the same genre are structurally
                                                                      more similar and hence should be grouped together by an ef-
  Definition 2.6. Trussness (τi ) of node vi is k iff vi ∈            fective network comparison algorithm. Table 1 reports qual-
Tk ∧ vi ∈
        / Tk+1                                                       ity metrics of the resultant clustering scheme delivered by
                                                                      our algorithms. Having found the performance of the three
  Fig. 1b illustrates the k-truss decomposition of a toy net-
                                                                      algorithms better than the state-of-the-art algorithms, we
work. We use the elegant in-memory O(m1.5 ) k-truss de-
                                                                      compared NSD-C, NSD-T and NCKD for accuracy, speed
composition algorithm proposed in [18] so that large network
                                                                      and sensitivity to noise and missing data. Further experi-
decomposition is feasible on a consumer-grade machine.
                                                                      ments lead to the following conclusions. Network compari-
                                                                      son using i) network signatures based on simple probability
3.     NETWORK ANALYSIS ALGORITHMS                                    distributions of hierarchy levels of nodes and edges is the
  Next, we outline the solutions to the three network anal-
                                                                      2
ysis tasks mentioned in Sec. 1.                                           Refer to paper [16] for details of the networks.


                                                                  2
fastest, but relatively least accurate, ii) quantile-based ag-         compared with four measures - degree centrality (DC), k-
gregation of core levels of nodes and their assortativity is the       core (KC), k-truss (KT), Trust-Oriented Social Influencers
best of the three measures, but is less sensitive to noise and         (TOSI). For each competing measure, top 20% nodes are
missing data, iii) quantile-based aggregation of truss levels          taken as initial spreaders and 100 simulations of SIR model
of nodes and their assortativity is the slowest of the three           are run to capture the average spreading ability (SA) of top-
methods, but most sensitive to noise and missing data.                 rankers. SA of the initial set of infected nodes is quantified
                                                                       as the percentage of nodes infected during spreading pro-
3.2    Influential Spreader                                            cess. Figure 2a shows that average SA of IPRI algorithm
   Predicting individuals who influence the spread of infor-           is higher than that of competing measures for all networks
mation is another important task in social network analysis.           afffirming its better effectiveness.
Prerequisite for understanding the spreading dynamics in
online social networks, the task also finds applications in
product marketing, promotion of innovative ideas, restrict-
ing negative information etc.. State-of-the-art methods for
predicting influential actors use facets such as - strength of
interaction with neighbors [1], community structure [19], or
hierarchy [10] in the network.
   These methods for finding influential spreaders miss out
on the advantage of the interplay of the three facets. We
address the research gap by exploiting the synergy between
the three facets, and demonstrate significant improvement
over existing methods for prediction of influential spreaders.
   The proposed influence scoring method IPRI (Influence
scoring using Position, Reachability, and Interaction) [8]               (a) Spreading Ability of IPRI   (b) SC execution time averaged
                                                                                                         over 10 runs
uses i) position of the actor in the network hierarchy, ii)
intensity of his interactions with neighbors and iii) extent
of actor’s connectivity in different communities. The algo-                  Figure 2: Results of evaluation of IPRI and SC.
rithm uses k-truss decomposition method, which confers the
dual advantage of revealing hierarchy and homophilic groups
(approximate communities) in the network. IPRI algorithm               3.3    Social Centrality
computes the following three indices for every vertex:                    Centrality is widely-used for identifying important nodes
   i) Positional Index (τ ): Trussness of a node obtained              in a network [5]. Existing methods for discovering impor-
by decomposition of the network proxies for its relative po-           tant nodes do not take cognizance of inherent hierarchy and
sition in network hierarchy. A higher level indicates larger           community structure in human-centric networks for deter-
neighborhood span that aids wider spread of information.               mining centrality of actors. Since humans derive benefits
Positional Index of node vi is same as its trussness τi .              concomitant with their position in the network hierarchy,
   ii) Reachability Index (ρ): A node having connections               and with the strength of their intra– and inter–community
with more truss levels has higher reachability in terms of             connections, we posit that a centrality measure that takes
information propagation, compared to a node having con-                these aspects into account gauges the importance of individ-
nections with fewer truss levels [19]. We quantify a node’s            uals more realistically in human-centric networks.
reachability to diverse communities as the entropy of truss-              Based on the mature theory of social capital, the proposed
ness of its neighbors.                                                 Social Centrality score (SC) emulates the real-life behavior
   iii) Interaction Index (µ): The propagation of infor-               of social actors to bond within community and bridge be-
mation is governed by the strength of interaction not only             tween communities [15]. SC score of a node quantifies its
with neighbors but also with 2-steps neighbors [11]3 . Based           ability to mobilize resources in the network based on its lo-
on this observation, the interaction index of a node is com-           cation in the hierarchy, embeddedness in community and
puted as the sum of the sum of weights of edges incident on            intensity of relations with neighbors. Application of k-truss
neighbors scaled by their respective positional index.                 decomposition elicits network hierarchy and approximated
   IPRI algorithm computes the influence score by integrat-            community structure. We posit that if two nodes and the
ing positional index, reachability index and interaction index         connecting edge all have the same trussness, they are part
of a node using a multiplicative function. The score is indi-          of the same community, and the connecting edge is an intra-
cator of the power to influence other users in the network,            community edge. SC score is computed by extracting fol-
with higher score indicating more influence.                           lowing three nodal properties from the hierarchical decom-
   Experimentation with large real-world social networks es-           position of a graph.
tablishes the validity and accuracy of the scoring method.                i) Sociability Index (ω): Sociability quantifies the ex-
We evaluate effectiveness of competing methods by using                tent to which an actor can leverage the resources controlled
SIR epidemic model on three large real-world networks (Col-            by its immediate neighbors. It takes into consideration size
legeMsg, WikiVote, and Epinions)4 . Performance of IPRI is             of the immediate neighborhood (ego-network) and intensity
                                                                       of relationships (edge-weights). Sociability index of a node
3
  Liu et al. [11] establish that the 2-step neighborhood of            is defined as sum of weights of edges incident on it.
nodes is a good choice that balances cost and performance                 ii) Bonding Potential (β): Bonding potential of an
when identifying influencers.                                          individual is determined by his hierarchical position in G as
4
  Refer to paper [8] for details of the networks.                      well as by the sociability of his intra-community neighbors.


                                                                   3
Bonding potential of an actor is quantified as the sum of                  Experimental Evaluation. Cluster Computing,
sociability index of its intra-community neighbors weighted                18(3):1189–1213, Sep 2015.
by his position in the hierarchy.                                      [4] M. Berlingerio, D. Koutra, T. Eliassi-Rad, and
   iii) Bridging Potential (γ): An actor with links in di-                 C. Faloutsos. Network Similarity via Multiple Social
verse communities can draw advantages that are not avail-                  Theories. In Proceedings of IEEE/ACM ASONAM,
able within his community. The bridging potential of an                    pages 1439–1440, 2013.
actor is the sum of the intensity of relationship with the             [5] F. Bloch, M. O. Jackson, and P. Tebaldi. Centrality
inter-community neighbors scaled by their respective posi-                 Measures in Networks. CoRR, abs/1608.05845, 2016.
tions in the hierarchy.                                                [6] J. Cohen. Trusses: Cohesive Subgraphs for Social
   SC score of a node is computed by aggregating its socia-                Network Analysis. NSA:Technical report, 2008.
bility index, bonding and bridging potential using a multi-            [7] K. D. Farnsworth, L. Albantakis, and T. Caruso.
plicative function. The empirical study based on diverse,                  Unifying Concepts of Biological Function from
human-centric networks vindicates the propositional basis                  Molecules to Ecosystems. Oikos, 126(10):1367–1376,
of the model and demonstrates its validity, effectiveness,                 2017.
and superior performance compared to prevailing centrality
                                                                       [8] S. Kaur, R. Saxena, and V. Bhatnagar. Leveraging
measures. We also performed scalability tests on synthetic
                                                                           Hierarchy and Community Structure for Determining
networks generated from Erdös-Rényi (ER), Watts-Strogatz
                                                                           Influencers in Networks. In Proceedings of DaWaK,
(WS) and Forest Fire (FF) models with the number of nodes
                                                                           pages 383–390, 2017.
varying from 1 million to 10 million and edges ranging from
2 million to ≈60 million. The results (Figure 2b) endorse our          [9] W. Khaouid, M. Barsky, V. Srinivasan, and
claim of the ability of hierarchy-based algorithms to process              A. Thomo. K-Core Decomposition of Large Networks
large graphs on consumer-grade PC.                                         on a Single PC. PVLDB, 9(1):13–23, 2015.
                                                                      [10] M. Kitsak, L. K. Gallos, S. Havlin, F. Liljeros,
                                                                           L. Muchnik, H. E. Stanley, and H. A. Makse.
4.   CONCLUSION AND FUTURE WORK                                            Identification of Influential Spreaders in Complex
   In this doctoral study, we find that large networks can                 Networks. Nature Physics, 6(11):888–893, Aug 2010.
be analyzed effectively on a single consumer-grade machine            [11] Y. Liu, M. Tang, T. Zhou, and Y. Do. Identify
after hierarchical decomposition. Specifically, we explored                Influential Spreaders in Complex Networks, the Role
three network analysis problems - network comparison, iden-                of Neighborhood. Physica A: Statistical Mechanics
tifying influence spreaders and computing centrality in human-             and its Applications, 452:289–298, 2016.
centric social networks. Existing efficient algorithms for k-         [12] S. Lu, J. Kang, W. Gong, and D. Towsley. Complex
core and k-truss decomposition are fundamental to our so-                  Network Comparison using Random Walks. In
lutions. The hierarchy of nodes and links between levels of                Proceedings on WWW, pages 727–730, 2014.
the hierarchy are reasonably effective signals to quantify net-       [13] H. Mengistu, J. Huizinga, J. Mouret, and J. Clune.
work similarity. We also observe that the cohesive regions                 The Evolutionary Origins of Hierarchy. PLoS
of the network revealed by the decomposition make a good                   Computational Biology, 12(6):e1004829, 2016.
approximation of community structure. Integrating hier-
                                                                      [14] R. Saxena, S. Kaur, and V. Bhatnagar. Identifying
archy, and the resulting approximate community structure
                                                                           Similar Networks using Structural Hierarchy. Ready
is superior to the state-of-the-art algorithms for computing
                                                                           for Submission; Avaialable on Request.
centrality and identifying influential spreaders.
                                                                      [15] R. Saxena, S. Kaur, and V. Bhatnagar. Social
   The study opens few new questions - What other net-
                                                                           Centrality using Network Hierarchy and Community
work analysis tasks can be solved effectively and efficiently
                                                                           Structure. Data Mining and Knowledge Discovery,
for massive graphs using hierarchical graph decomposition
                                                                           Accepted for publicaton, 2018.
methods? Is the divide-and-conquer approach in general,
practical for analysis of large graphs? Can horizontal par-           [16] R. Saxena, S. Kaur, D. Dash, and V. Bhatnagar.
titioning methods be leveraged for designing scalable algo-                Leveraging Structural Hierarchy for Scalable Network
rithms? Answering these questions demands intense involve-                 Comparison. In Proceedings of DEXA, pages 287–302,
ment in the theoretical underpinnings of graph decomposi-                  2016.
tion methods.                                                         [17] S. B. Seidman. Network Structure and Minimum
                                                                           Degree. Social Networks, 5:269–287, 1983.
                                                                      [18] J. Wang and J. Cheng. Truss Decomposition in
5.   REFERENCES                                                            Massive Networks. Proceedings of the VLDB
 [1] M. A. Al-garadi, K. D. Varathan, and S. D. Ravana.                    Endowment, 5(9):812–823, 2012.
     Identification of Influential Spreaders in Online Social         [19] S. Wang, F. Wang, Y. Chen, C. Liu, Z. Li, and
     Networks using Interaction Weighted K-core                            X. Zhang. Exploiting Social Circle Broadness for
     Decomposition Method. Physica A, 468(C):278–288,                      Influential Spreaders Identification in Social Networks.
     2017.                                                                 World Wide Web, 18(3):681–705, 2015.
 [2] V. Batagelj and M. Zaveršnik. Fast Algorithms for               [20] A. E. Wegner, L. Ospina-Forero, R. E. Gaunt, C. M.
     Determining (Generalized) Core Groups in Social                       Deane, and G. Reinert. Identifying Networks with
     Networks. Advances in Data Analysis and                               Common Organizational Principles. Journal of
     Classification, 5(2):129–145, Jul 2011.                               Complex Networks, 2018.
 [3] O. Batarfi, R. E. Shawi, A. G. Fayoumi, R. Nouri,
     S. Beheshti, A. Barnawi, and S. Sakr. Large Scale
     Graph Processing Systems: Survey and an


                                                                  4