=Paper=
{{Paper
|id=Vol-2203/171
|storemode=property
|title=Probabilistic Analysis of Graphlets in Sparse E-R Random Graphs
|pdfUrl=https://ceur-ws.org/Vol-2203/171.pdf
|volume=Vol-2203
|authors=Martin Nehez
|dblpUrl=https://dblp.org/rec/conf/itat/Nehez18
}}
==Probabilistic Analysis of Graphlets in Sparse E-R Random Graphs==
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 171–175 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Martin Nehéz Probabilistic Analysis of Graphlet Frequency Distribution in Sparse E-R Random Graphs Martin Nehéz Institute of Information Engineering, Automation and Mathematics, Faculty of Chemical and Food Technology Slovak University of Technology in Bratislava Slovak Republic martin.nehez@stuba.sk WWW home page: https://www.kirp.chtf.stuba.sk/index.php?menu=2& show_id=3& person_id=300011 Abstract: Frequency counting of graphlets (i.e. small Some previous results regarding random graph theory connected induced subgraphs) is a prominent experimen- [1, 3, 10, 11, 17] are relied on this purpose. Our re- tal approach to network analysis which became favorable sult is significant especially from the complexity point of in bioinformatics. In this paper, we use the probabilistic view because it might lead, in some cases, to elimination method for graphlets counting. We show that it is possible of high requirements on computational resources needed to enumerate the graphlets (or isolated graphlets) in sparse for graphlet frequency analysis. It means that instead of Erdős-Rényi graph model analytically. Obtained frequen- computer-based graphlet frequency enumeration in Erdős- cies are exploited to estimate bounds on domination num- Rényi random graphs one may use analytical formulas. ber in the above mentioned random graph model. The organization of the paper is as follows. Sect. 2 contains definitions and preliminary facts. The threshold functions for presence of graphlets in random graphs are 1 Introduction derived in Sect. 3. Average counts of graphlets are ex- pressed in the same section as well. Average counts of Network science which is focused on modeling and analy- isolated graphlets and corresponding estimations in ran- sis of real-world networks became a significant research dom graphs with edge probability p = c/n are determined area in last two decades. Instances of networks un- in Sect. 4. Application of these results to estimation of der study come from many fields of human activities, the domination number in sparse random graphs is out- e.g. electrical engineering, transportation, social sci- lined in Sect. 5. Possible direction for future research are ence, biology, medicine, etc. Currently, a frequently used discussed in the last section. algorithmic-experimental method for similarity detection and comparison of protein-protein interaction networks (shortly PPI networks) was invented in bioinformatics by 2 Definitions and Preliminaries N. Pržulj et al. [16]. It is based on frequency analysis of small connected induced subgraphs (called graphlets) oc- 2.1 Fundamentals curring in networks to be compared. Such an approach was used to show e.g., that yeast PPI networks are struc- The asymptotic notation such as o, O, Θ is used in the turally closer to geometric random graphs than to scale- usual way [10], nevertheless, the most important notions free or Erdős-Rényi random graphs [16]. are listed below. Frequency counting of graphlets is a non-trivial algo- O(g(n)) = { f (n) | ∃c, n0 > 0 ∀n ≥ n0 | f (n)| ≤ c|g(n)| } rithmic problem which is intensively studied both from theoretical and application point of view. One of the most Ω(g(n)) = { f (n) | ∃c, n0 > 0 ∀n ≥ n0 | f (n)| ≥ c|g(n)| } powerful softwares used for this purpose is currently the Orbit Counting Algorithm - ORCA [8, 9]. Roughly speak- o(g(n)) = { f (n) | ∀ε > 0 ∃n0 > 0 ∀n ≥ n0 | f (n)| ≤ ε |g(n)|} ing, ORCA writes numbers of all graphlets (with 2 − 5 f (n) = Θ(g(n))) ⇔ f (n) = O(g(n)) ∧ f (n) = Ω(g(n)) nodes) which occurred in a given input network on out- put. This part of the graphlet-based analysis is the most Moreover, for two sequences (or equivalently functions) difficult since doing it without a computer program is un- a = (an )∞ ∞ n=0 and b = (bn )n=0 , we will write an ≪ bn if realistic even for relatively small networks. Due to high an ≥ 0 and an = o(bn ). computational complexity of the graphlet enumeration the Throughout this paper, all graphs are simple, undirected number of their nodes is restricted to at most 4 in some and without weights. Standard notions of the graph the- current softwares. ory are used without definitions or further comments. We In this paper, we show that the problem of frequency address [6, 7] for references however, the usage of some counting of graphlets can be solved analytically in Erdős- symbols is mentioned bellow. Let G = (V, E) be a graph Rényi random graph model by probabilistic methods. with a nonempty finite set of vertices V (G) (or nodes) and 172 Martin Nehéz Figure 1: Graphlets g0 , g1 , . . . , g9 . a finite set of edges E(G). Usually |V (G)| = n, where The probability space (Ω, IF, Pr) is denoted by G(n, p) or |.| stands for the cardinality of a given set. For a vertex Gn,p and called the probability space of all random graphs v ∈ V (G), its degree (the number of adjacent vertices of v) of order n or E-R random graph model. is denoted by deg(v). The maximum (minimum) degree A statement about a random graph from G(n, p) is said of a graph G is denoted by ∆ (δ ). Let G = (V, E) be a to hold asymptotically almost surely (a.a.s.) if it holds graph and let S ⊆ V (G), an induced subgraph G[S] is the with probability approaching 1 as n → ∞. A graph G with graph whose vertex set is S and the edge set of G[S] con- n vertices is said to be dense if it has Θ(n2 ) edges (i.e., sists of all edges whose endpoints are both in S. A graph asymptotically equal to n2 ) and G is said to be sparse if it is said to be connected if there is a path joining every pair has o(n2 ) edges (i.e., asymptotically less than n2 ). of its vertices. A disconnected graph is not connected. A If one considers p to be a non-zero constant, then ran- component is a maximal connected subgraph. It is also re- dom graphs have pn(n − 1)/2 = Θ(n2 ) edges, hence they ferred to as connected component or isolated component. are dense a.a.s. On the other hand, there are such choices Note that each disconnected graph consists of at least two of p that random graphs are sparse. E.g. if p = p(n) is different components. a decreasing function on n such as p = n−ε for any con- Let G = (V, E) be a graph, a graphlet is a connected stant ε > 0 then random graphs have Θ(n2−ε ) edges and induced subgraph of G with at most 5 vertices. Two they are sparse a.a.s. In particular, if p = c/n = Θ(n−1 ) for graphlets are same if there exists an isomorphism such that any constant c > 0, then random graphs have a linear num- it maps one graphlet to the other one. (Two different oc- ber of edges (observe that ε = 1, hence random graphs are currences of the same graphlet are usually referred to as sparse a.a.s.) and, in terms of edge set cardinality, they are its copies.) In this paper, only graphlets with at most 4 similar to real-world networks. On the other hand, some vertices are considered. Their ordering (and indexing) is structural properties of sparse random graphs are usually shown in Fig. 1. It means that the empty graph (a single far from real-world networks [2, 16]. Many works deal vertex) is denoted by g0 , etc. The last graphlet in Fig. 1 with structural properties of real networks and quite re- (i.e. g9 ) is the clique K4 . alistic models are currently represented by e.g. scale-free or random geometric networks [2, 8, 16]. 2.2 Random Graphs 2.3 Monotonicity and Threshold Functions The Erdős-Rényi random graph model (shortly E-R In physics, a phase transition is the transformation of a model) [3] can be introduced as follows. thermodynamic system from one phase to another. Dur- Let n be a positive integer and let p ∈ IR be a constant ing such a transformation, some physical properties of the such that 0 < p < 1. Consider that for n labeled vertices of system change discontinuously. An example is freezing V (G), each unordered pair of vertices introduces one slot (or boiling) water or the emergence of superconductivity available () for an edge. Clearly, the total number of slots in certain metals when cooled below the critical temper- is n2 . Each edge exists in G independently and with the ature. In all phase transitions, there exists a value of a probability p, thus Pr[{u, v} ∈ E(G)] = p, for all u, v ∈ certain quantity (often temperature) in which the physical V (G). properties in question change.1 Such a value is said to be Given n as above, let (Ω, IF, Pr) be a probability space a critical point. where the sample space Ω consists of all (labeled) graphs Similar behavior was also observed in random graphs G of order n and let IF ⊆ 2Ω (be)a set of events. If G has (for the first time in [3]). The critical points in thermody- |E(G)| edges, 0 ≤ |E(G)| ≤ n2 , then the probability of namics have their counterparts in random graphs: they are obtaining G as a result of random edge generation process thresholds functions. is given by: 1 The character of such a change is similar to a jump discontinuity n Pr[G] = p|E(G)| (1 − p)(2)−|E(G)| . (1) function. Probabilistic Analysis of Graphlets in Sparse E-R Random Graphs 173 Let Kn be a clique with n > 0 vertices. Let 2Kn denote section. We shall examine the phase transition behavior of the power set of all spanning subgraphs of Kn . Let G1 , G2 graphlets at first. The occurrence of a given graphlet is not be spanning subgraphs of Kn and let G1 ⊆ G2 denote that a monotone property [10]. Nevertheless, such a property E(G1 ) ⊆ E(G2 ). Let Q ⊆ 2Kn be a family of subsets and has two threshold functions, one in sparse random graphs note that Q contains spanning subgraphs of Kn . A family and the second in very dense random graphs [10]. We are of graphs Q is said to be increasing if G1 ⊆ G2 and G1 ∈ Q interested only in the first case since almost all real net- imply that G2 ∈ Q. A family of graphs Q′ is decreasing if works are sparse. 2Kn \ Q′ is increasing. A family which is either increasing For a given graph G, let τ (G) denote the ratio of the or decreasing is called monotone. A family of graphs G number of edges to the number of vertices in the densest is said to be closed under isomorphism if for all G ∈ G subgraph of G, i.e. and H ∼= G implies that H ∈ G . If a family of graphs from { } 2Kn is closed under isomorphism, it can be identified with |E(H)| τ (G) = max ; H ⊆ G, |V (H)| > 0 . a graph property. |V (H)| Threshold functions are usually defined for monotone The following statement determines threshold functions properties. In this paper, we need to introduce them only for graphlets. for increasing properties. For further details see [10]. Let Q be an increasing property of graphs from 2Kn . Let Theorem 1 ([10]). For an arbitrary graphlet g with at p = p(n) be a function. A function pb = pb(n) is called a least one edge, it holds threshold function for Q iff the following condition holds: { 0 if p ≪ n−1/τ (g) , { lim Pr[ g occurs in G(n, p) ] = 0 if p ≪ pb , n→∞ 1 if p ≫ n−1/τ (g) . Pr[ G(n, p) has Q ] → 1 if p ≫ pb . As a consequence, we obtain Tab. 1 in which the thre- Threshold functions play a crucial role in examina- sholds functions for all nontrivial graphlets are listed. De- tion of the phase transition phenomena in random graphs termination of threshold functions is straightforward be- [3, 10, 11, 15]. A typical property is the connectivity of cause none of graphlets (for i = 1, . . . , 9) contains a denser random graphs and more specifically, size of connected subgraph than itself. Clearly, τ (g1 ) = 1/2 and τ (g2 ) = components. The threshold function for the existence of a τ (P3 ) = 2/3. By the same argument, τ (g3 ) = τ (△) = giant component is pb = 1/n. As stated in [15], Sect. 4, a 3/3 = 1, etc. random graph contains a.a.s. the largest component with Θ(n2/3 ) vertices in its critical phase, i.e. if p = 1/n. The Table 1: Threshold functions of nontrivial graphlets. subcritical phase is for p = c/n where 0 < c < 1. It repre- sents such a state that all components are trees or unicyclic Graphlet Description Threshold and the largest component is of size Θ(log n) a.a.s. Other- gi function wise, if p = c/n where c > 1, then there is a unique giant g1 Edge n−2 component of order Θ(n) a.a.s. in the supercritical phase. g2 Path P3 n−3/2 The rest of the random graph consists of "small" trees and g3 Triangle △ n−1 as c increases, the giant component grows by absorbing g4 Path P4 n−4/3 the trees. g5 3-star n−4/3 g6 Cycle C4 n−1 3 Graphlets g7 △+edge n−1 g8 Chordal-C4 n−4/5 Let Xgi be the random variable on G(n, p) denoting the g9 Clique K4 n−2/3 number of copies of graphlet gi in a random graph for i = 0, . . . , 9. In this paper, we consider 10 graphlets with at most 4 vertices (see Fig. 1) that is why i = 0, . . . , 9. If we consider sparse random graphs with p = c/n, then it is possible to deduce from Tab. 1 which graphlets occur Lemma 1 ([10]). For i = 0, . . . , 9, let ni (mi ) denote the more or less frequently. We can see that the presence of number of vertices (edges) of gi . The expectation of the trees (i.e. graphlets g1 , g2 , g4 and g5 ) is the most probable random variable Xgi is given by of all graphlets in G(n, c/n). On the other hand, dense ( ) graphlets (such as the clique g9 ) rarely occur in G(n, c/n). ni ! n mi ni IE(Xgi ) = p (1 − p)( 2 )−mi , (2) The expected number (or "average counts") of graphlets aut(gi ) ni can be derived from Lemma 1. The probabilities of where aut(gi ) denotes the number of automorphisms of gi . graphlets and corresponding expectations of the random variable Xgi are listed in Tab. 2. The numbers of automor- Vertices (i.e. graphlets g0 ) represent a trivial case, thus phisms aut(gi ) for these small graphs are well-known (see we will consider only random variables Xgi for i > 0 in this [19] for the details). 174 Martin Nehéz By Lemma 2 and 3, we derive the following statement. Table 2: Graphlets, numbers of their automorphisms, It determines the asymptotic estimation for expected num- probabilities of graphlets and values of IE(Xgi ). ber of isolated graphlets in random graphs with p = c/n. Graphlet aut(gi ) Probability IE(Xgi ) Lemma 4. Let c > 1 be constant. There exists a function gi of gi (n) ψc (n) = O(n−1 ) such that for each gi (with ni vertices and g1 2 p ( ) 2 p mi edges) in G(n, c/n) it holds 3! n 2 g2 2 p2 (1 − p) p 2 3 ( (1 ) − p) n −1 −cn 3! n 3 nc i e i /aut(gi ) if mi = ni − 1, g3 6 p3 ( )6 3 p 4! n 3 g4 2 p (1 − p)3 3 2 (4) p (1 − p) 3 g5 6 p3 (1 − p)3 4! n 3 3 IE(Ygi ) ∼ cni e−cni /aut(gi ) if mi = ni , 6 (4) p (1 − p) 4! n 4 g6 8 p4 (1 − p)2 8 (4) p (1 − p) 2 g7 2 p4 (1 − p)2 4! n 4 2 ψc (n) if mi ≥ ni + 1. 2 (4 )p (1 − p) 4! n 5 g8 4 p5 (1 − p) 4 4 p( (1 ) − p) 4! n 6 This lemma allows for asymptotic estimation of ex- g9 24 p6 24 4 p pected numbers of isolated graphlets in random graphs G(n, c/n). Corresponding estimations are listed in Tab. 3. One may observe that the frequency of isolated trees 4 Isolated Graphlets (i.e. graphlets g0 , g1 , g2 , g4 , g5 ) growth linearly with n, the frequency of isolated graphlets g3 , g6 , g7 (i.e. trees with Given a graph G = (V, E), a graphlet is said to be an iso- one additional edge) is constant with respect to n and the lated graphlet if it is a component in G. Let Ygi be the frequency of other isolated graphlets (g8 and g9 ) is neg- random variable on G(n, p) denoting the number of copies ligible. The intuition behind this result is, similarly as in of isolated graphlet gi in a random graph for i = 0, . . . , 9. the previous section, that the contribution of sparse iso- It will be seen later (in Tab. 3) that it is meaningful to take lated graphlets is more significant (even in magnitude) into account the isolated graphlet g0 as well. than of denser ones. Unless as in the previous section, the frequency distribution of isolated vertices g0 can be Lemma 2 ([10]). For i = 0, . . . , 9, let ni (mi ) denote the expressed by the asymptotic formula which depends on n number of vertices (edges) of gi . The expectation of the and c. (Note that the count of graphlets g0 is trivially n.) random variable Ygi is given by IE(Ygi ) = IE(Xgi ) · (1 − p)(n−ni )ni . (3) Table 3: Asymptotic estimations of expected number for isolated graphlets in G(n, c/n) as n is large enough. In order to express an asymptotic estimation for "a- verage counts" of isolated graphlets in random graphs, the Gra- IE(Ygi ) Estimation of following lemma is necessary. phlet IE(Ygi ) for gi p = c/n Lemma 3. Let α , β , c be constants (i.e. α , β , c ≪ n) such g0 n−1 ne−c (n)n(1 − p) 2(n−2) that α , c > 0 and let p = c/n. It holds g1 1 −2c ( 2) p(1 − p) 3(n−3)+1 3! n 2 2 nce 1 2 −3c g2 2 3( ) p (1 − p) 2 nc e (1 − p)α n+β ∼ e−α c as n → ∞ . g3 3! n 3 3(n−3) 1 3 −3c 6( )3 p (1 − p) 6c e 4! n 3 4(n−4)+3 1 3 −4c g4 2 (4) p (1 − p) 2 nc e Proof. By assumptions of Lemma, 4! n 3 4(n−4)+3 1 3 −4c g5 6 (4) p (1 − p) 6 nc e ( c )α n ( c )β g6 4! n 4 4(n−4)+2 1 4 −4c (1 − p)α n+β = 1 − · 1− . 8 (4) p (1 − p) 8c e 4! n 4 4(n−4)+2 1 4 −4c n n g7 2 (4) p (1 − p) 2c e 4! n 5 4(n−4)+1 1 −1 5 −4c g8 4 4( p ) (1 − p) 4(n−4) 4 n c e Thus 4! n 6 1 −2 6 −4c g9 24 4 p (1 − p) 24 n c e ( ) −α nc ( 1 −c c )β (1 − p)α n+β = 1 + n · 1− ∼ e−α c −c n since ( )− n 5 Domination Number in Sparse Random 1 c lim 1 + n =e Graphs n→∞ −c and ( A dominating set of a graph G = (V, E) is a set D ⊆ V (G) c )β such that every vertex not in D is adjacent to at least one lim 1 − =1. n→∞ n vertex of D. The domination number, denoted by γ (G), is the minimum cardinality of a dominating set of G. Probabilistic Analysis of Graphlets in Sparse E-R Random Graphs 175 Results regarding domination problems and domination [2] J. Dreier, P. Kuinke, B. Le Xuan, P. Rossmanith: Local number are surveyed in [7]. Due to significant applica- Structure Theorems for Erdős-Rényi Graphs and Their Al- tions in social, engineering and PPI networks, the area gorithmic Applications. In Proc. of the 44th Int. Confer- of domination became recently attractive and fast growing ence on Current Trends in Theory and Practice of Comp. [5, 12, 13, 14, 18]. Science, SOFSEM 2018, LNCS 10706, Springer Int. Publi- shing, 2018, 125–136 It was shown in [18] that the domination number of ran- [3] P. Erdős, A. Rényi: On the evolution of random graphs. Publ. dom graphs for a constant p may attain one of only two Math. Inst. Hungar. Acad. Sci. 5 (1960) 17–61 possible values. Such a property is called the two-point [4] M. R. Garey, D. S. Johnson: Computers and Intractability. concentration. The two-point concentration result was re- 2 Freeman, New York (1979) cently extended for random graphs G(n, p) with p ≫ ln√nn [5] R. Glebov, A. Liebenau, T. Szabó: On the Concentration (in this case, p = p(n) is assumed to be a function). How- of the Domination Number of the Random Graph. SIAM J. ever, it does not hold if p = O(log n/n) [5]. As mentioned Discrete Math., 29:3 (2015) 1186–1206 in [5], the detailed analysis of the domination number be- [6] J.L. Gross, J. Yellen: Handbook of Graph Theory. CRC havior for random graphs with p ≪ √1n is still an interest- Press (2003) ing open problem. [7] T. W. Haynes, S. T. Hedetniemi, P. J. Slater: Fundamentals In order to pursuit this problem, we suggest to use a of Domination in Graphs. Marcel Dekker, Inc., New York method based on isolated graphlet counting. Recall that a (1998) random graph consists of a single giant component and [8] T. Hočevar: Graphlet Counting. In Proc. of the 17th Con- small isolated trees a.a.s. in its critical and supercriti- ference on Applied Mathematics, APLIMAT 2018, SUT in cal phase if p = Θ(n−1 ). Roughly speaking, our idea re- Bratislava, 2018, 442–449 sides in exact counting of domination numbers for isolated [9] T. Hočevar, J. Demšar: Computation of Graphlet Orbits for trees and its estimation for the giant component. Resulting Nodes and Edges in Sparse Graphs. Journal of Statistical Software 71:10 (2016) bounds could be obtained by a combination of all partic- ular estimations. Our preliminary results exploiting the [10] S. Janson, T. Łuczak, A. Rucinski: Random Graphs. John Wiley & Sons, New York (2000) early version of this idea have been published in [14]. [11] M. Karoński: Random Graphs. Handbook of Combina- torics (R.L. Graham, M. Grötschel, L. Lovász eds.), Vol I, Elsevier, 1995, 351–380 6 Concluding Remarks [12] F. Molnár, S. Sreenivasan, B. K. Szymanski, G. Korniss: Minimum Dominating Sets in Scale-Free Network Ensem- One possible extension of this work may involve analy- bles. Scientific Reports 3, Article number: 1736 (2013) sis for graphlets with 5 vertices. Although the idea is the [13] J. C. Nacher, T. Akutsu: Dominating scale-free networks same as for smaller graphlets, detailed calculations need with variable scaling exponent: heterogeneous networks are an additional effort because there are 21 graphlets with 5 not difficult to control. New Journal of Physics 14, 073005 vertices. (2012) Comparing actual networks to random graphs might be [14] M. Nehéz, D. Bernát, M. Lelovský: Estimation of the dom- a meaningful goal for future research. A resulting know- ination number in sparse random graphs and applications. ledge might be helpful to better understanding of real- In Proc. of the 5th European Conference on the Engineering of Computer-Based Systems, ECBS 2017, ACM New York, world networks structure. 2017, 7:1–7:9 As regards the problem of the domination number esti- [15] E. M. Palmer: Graphical Evolution. John Wiley & Sons, mation, the author is currently working on more accurate Inc., New York (1985) formulas of results published in [14]. The corresponding [16] N. Przulj, D.G. Corneil, I. Jurisica: Modeling interactome: analysis is based on the idea which was explained in the scale-free or geometric?. Bioinformatics 20:18 (2004) 3508– previous section. 3515 [17] A. Rucinski: Small subgraphs of random graphs. Acknowledgement. The support from the Slovak Scientific Presentation of the talk, Joint Mathematics Meet- Grant Agency under the grant VEGA 1/0026/16 is grate- ing #1003, MAA 2005, Atlanta (2005) URL: fully acknowledged. https://www.math.cmu.edu/ãf1p/MAA2005/L4.pdf The author thanks the anonymous referees for their valu- [18] B. Wieland, A.P. Godbole: On the Domination Number of able comments on the manuscript. a Random Graph. Electronic Journal of Combinatorics 8:1 (2001) #R37 [19] Wolfram MathWorld: Graph Automorphism. URL: http: //mathworld.wolfram.com/GraphAutomorphism.html References (2018) [1] B. Bollobás: Random Graphs. (2nd edition) Cambridge Studies in Advanced Mathematics 73, (2001)