Introduction

In uence Maximization with an Unknown Network by Exploiting Community Structure

Bryan Wilder

Nicole Immorlica

Eric Rice

Milind Tambe

tambeg@usc.edu 1 0 Microsoft Research , New England, Cambridge, MA , USA 1 University of Southern California, Center for Arti cial Intelligence in Society Los Angeles , CA , USA

2017

2 7

In many real world applications of in uence maximization, practitioners intervene in a population whose social structure is initially unknown. We formalize this problem by introducing exploratory in uence maximization, in which an algorithm queries individual network nodes to learn their links. The goal is to locate a seed set nearly as in uential as the global optimum using very few queries. We show that this problem is intractable for general graphs. However, real world networks typically have community structure, in which nodes are arranged in densely connected subgroups. We present the ARISEN algorithm, which leverages community structure to nd an in uential seed set by querying only a fraction of the network. Experiments on real world networks of homeless youth, village populations in India, and others validate ARISEN's performance.

Introduction

In contexts ranging from health, to international development, to education, practitioners have used the social network of their target population to rapidly spread information and to change behavior in socially desirable ways. The challenge is to identify the in uential members of the population. While previous work has delivered many computationally e cient algorithms for this in uence maximization problem [7, 21, 12], this work assumes that the social network is given explicitly as input. However, in many real-world domains, the network is not initially known and must be gathered via laborious eld observations. For example, collecting network data from vulnerable populations such as homeless youth, while crucial for health interventions, requires signi cant time spent gathering eld observations [19]. Social media data is often unavailable when access Copyright c 2017 for the individual papers by the papers' authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. to technology is limited, for instance in developing countries or with vulnerable populations. Even when such data is available, it often includes many weak links which are not e ective at spreading in uence [2]. For instance, a person may have hundreds of Facebook friends who they barely know. In principle, the entire network could be reconstructed via surveys, and then existing in uence maximization algorithms applied. However, exhaustively reconstructing the network is very labor-intensive and considered impractical in many situations [22]. For in uence maximization to be relevant to many real-world problems, it must contend with limited information about the network, not just limited computation.

The major informational restriction is the number of nodes which may be surveyed to explore the network. Thus, a key question is: how can we nd in uential nodes with a small number of queries? Existing eld work uses heuristics, such as sampling some percentage of the nodes and asking them to nominate inuencers [22]. We formalize this problem as exploratory in uence maximization and seek a principled algorithmic solution, i.e., an algorithm which makes a small number of queries and returns a set of seed nodes which are approximately as in uential as the the globally optimal seed set. To the best of our knowledge, no previous work directly addresses this question from an algorithmic perspective (we survey the closest work in Section 3).

We show that for general graphs, any algorithm for exploratory in uence maximization may perform arbitrarily badly unless it examines almost the entire network. However, real world networks have useful structure. In particular, social networks often have strong community structure, where nodes are arranged into groups which are connected tightly internally, but only weakly to the rest of the network [10, 16]. Consequently, in uence mostly propagates in a local fashion. Community structure has been used to develop more computationally e cient in uence maximization algorithms [23, 8]. Here, we use it to design a highly information-e cient algorithm. We make three main contributions. First, we introduce exploratory in uence maximization and show that it is intractable for general graphs. Second, we present the ARISEN algorithm, which exploits community structure to nd an in uential seed set. Third, we show experimental results on a variety of networks(both synthetic and real) that verify ARISEN's performance. Our focus here is on introducing the algorithm and showing experimental results; theoretical analysis of ARISEN's performance will be presented in future work. In this paper, we focus on the description of the problem and survey related work. We then brie y present the high-level idea of our algorithm and give an example of experimental results. 2

Exploratory in uence maximization

As a motivating example, consider a homeless youth shelter which wishes to spread HIV prevention information [19]. It would try to harness the youths' social network and select the most in uential peer leaders to spread information, but this network is not initially known. Constructing the network requires a laborious survey [19]. Our motivation is to mitigate this e ort by querying only a few youth. Such queries require much less time than the day-long training peer leaders receive. We now formalize this problem.

In uence maximization: The in uence maximization problem [13], starts with a graph G = (V; E), where jV j = n and jEj = m. We assume throughout that G is undirected; social links are typically reciprocal [20]. An in uencer selects K seed nodes with the aim of maximizing the expected size of the resulting in uence cascade. We assume that in uence propagates according to the independent cascade model (ICM), which is the most prevalent model in the literature. Initially, all nodes are inactive except for the seed set. When a node becomes active, it makes one attempt to activate each neighbor. Each attempt succeeds independently with probability q, where q is typically assumed to be the same for all edges [7, 13, 25]. Let f (S) denote the expected number of activated nodes with seed set S V . The objective is to compute arg maxjSj K f (S).

Local information: The edge set E is not initially known. Instead, the algorithm explores portions of the graph using local operations. We use the popular \Jump-Crawl" model [5], where the algorithm may either jump to a uniformly random node, or crawl along an edge from an already surveyed node to one of its neighbors. When visited, a node reveals all of its edges. We say that the query cost of an algorithm is the total number of nodes visited using either operation. Our goal is to nd in uential nodes with a query cost that is much less than n, the total number of nodes.

Stochastic Block Model: In our formal analysis, we assume that the graph is drawn from the SBM. The SBM originated in sociology [9] and lately has been intensively studied in computer science and statistics (see e.g. [1, 15, 18]). In the SBM, the network is partitioned into disjoint communities C1::::CL. Each within-community edge is present independently with probability pw and each between-community edge is present independently with probability pb. Notice that each community is an Erd}os-Renyi random graph with additional random edges to other communities. We assume that pw lojgCjiCjij for all Ci, since this is necessary for Ci to be internally connected [11]. While the SBM is a simpli ed model, our experimental results show that ARISEN performs well on real-world graphs. ARISEN takes as input the parameters n; pw, and pb, but is not given any prior information about the realized draw of the network. It is reasonable to assume that the model parameters are known since they can be estimated using existing network data from a similar population (in our experiments, we show that this approach works well).

Objective: We compare to the globally optimal solution, i.e, the best performance if the entire network structure were known in advance. Let fE (S) give the expected number of nodes in uenced by seed set S when the set of realized edges are E. Let A(E) be the (possibly random) seed set containing our algorithm's selections given edge set E. Let OP T be the expected value of the globally optimal solution which seeds K nodes. We measure the algorithm's performance by the ratio OP T = E[fE (A(E))], where the expectation is over both the randomness in the graph and the algorithm's choices. 3

Related work

First, Yadav et al. [25] and Wilder et al. [24], studied dynamic in uence maximization over a series of rounds. Some edges are \uncertain" and are only present with some probability; the algorithm can gain information about these edges in each round. However, the majority of potential edges are known in advance. By contrast, our work does not require any known edges. Mihara et al. [17] also consider in uence maximization over a series of rounds, but in their work the network is initially unknown. In each round, the algorithm makes some queries, selects some seed nodes, and observes all of the nodes which are activated by its chosen seeds. The ability to observe activated nodes makes our problem incomparable with theirs because these activations can reveal a great deal about the network and give the algorithm information that even the global optimizer does not have (in their work, the benchmark does not use the activations). Thus, we emphasize that our algorithm uses strictly less information. Further, in many applications, activations correspond to private behaviors (e.g. getting tested for HIV) and cannot be observed for practical or legal reasons.

Another line of work concerns local graph algorithms, where a local algorithm only uses the neighborhoods around individual nodes. Borgs et al. [3] study local algorithms for nding the root node in a preferential attachment graph and for constructing a minimum dominating set. Other work, including Bressen et al. [6] and Borgs et al. [4], aims to nd nodes with high PageRank using local queries. These algorithms are not suitable for our problem since a great deal of previous work has observed that picking high PageRank nodes as seeds can prove highly suboptimal for in uence maximization [14, 7, 12]. Essentially, PageRank identi es a set of nodes that are individually central, while in uence maximization aims to nd a set of nodes which are collectively best at di using information. We also emphasize that our technical approach is entirely distinct from work on PageRank. 4

Proposed algorithm and results

We now provide a brief overview of our algorithm for exploratory in uence maximization. The main idea is to sample a set of T random nodes fv1:::vT g from G and explore a small subgraph Hi around each vi by taking R steps of a random walk. We discard the rst B steps of each walk as burn-in. R; T and B are inputs set by the user according to the size of the network and the number of seeds they wish to select. Intuitively, T should be greater than K so we can be sure of sampling each of the largest K communities. The subgraphs Hi are used to construct a weight vector w where wi gives the weight associated with vi. The algorithm then independently samples each seed from fv1:::vT g with probability proportional to w. Further details will appear in an extended version of the paper.

Figure 1 shows experimental results on a network of households in a village in rural india. We compare our algorithm (in blue with diagonal hatches) to a series of baselines. From left to right, the baselines are ( 1 ) selecting K random node and seeding their highest degree neighbor ( 2 ) starting at a random node and iteratively seeding the highest degree neighbor of the previous node ( 3 ) selecting K seeds at random. The y axis plots the fraction of the optimal value (assuming that the true network were known) attained by each algorithm. We see that the proposed algorithm outperforms all baselines.

T1.00 P O0.75 We introduced exploratory in uence maximization to study in uence maximization when the network is initially unknown. We presented a novel algorithm, which exploits the community structure present in many real world social networks. Experimental results on a real world network show that our algorithm is competitive with the global optimum and outperforms several baselines. 5. Brautbar, M., Kearns, M.J.: Local algorithms for nding interesting individuals in large networks. In: Innovations in Theoretical Computer Science. pp. 188{199 (2010) 6. Bressan, M., Peserico, E., Pretto, L.: The power of local information in pagerank.

In: WWW. pp. 179{180. ACM (2013) 7. Chen, W., Wang, C., Wang, Y.: Scalable in uence maximization for prevalent viral marketing in large-scale social networks. In: KDD. pp. 1029{1038. ACM (2010) 8. Chen, Y.C., Zhu, W.Y., Peng, W.C., Lee, W.C., Lee, S.Y.: Cim: community-based in uence maximization in social networks. ACM Transactions on Intelligent Systems and Technology (TIST) 5( 2 ), 25 (2014) 9. Fienberg, S.E., Wasserman, S.S.: Categorical data analysis of single sociometric relations. Sociological methodology 12, 156{192 (1981) 10. Girvan, M., Newman, M.E.: Community structure in social and biological networks.

PNAS 99(12), 7821{7826 (2002) 11. Janson, S., Luczak, T., Rucinski, A.: Random graphs, vol. 45. John Wiley & Sons (2011) 12. Jung, K., Heo, W., Chen, W.: Irie: Scalable and robust in uence maximization in social networks. In: ICDM. pp. 918{923. IEEE (2012) 13. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of in uence through a social network. In: KDD. pp. 137{146. ACM (2003) 14. Kimura, M., Saito, K., Nakano, R., Motoda, H.: Finding in uential nodes in a social network from information di usion data. In: Social Computing and Behavioral Modeling, pp. 1{8. Springer (2009) 15. Krzakala, F., Moore, C., Mossel, E., Neeman, J., Sly, A., Zdeborova, L., Zhang, P.: Spectral redemption in clustering sparse networks. PNAS 110(52), 20935{20940 (2013) 16. Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Community structure in large networks: Natural cluster sizes and the absence of large well-de ned clusters.

Internet Mathematics 6( 1 ), 29{123 (2009) 17. Mihara, S., Tsugawa, S., Ohsaki, H.: In uence maximization problem for unknown social networks. In: ASONAM. pp. 1539{1546. ACM (2015) 18. Mossel, E., Neeman, J., Sly, A.: Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields 162( 3-4 ), 431{461 (2015) 19. Rice, E., Tulbert, E., Cederbaum, J., Adhikari, A.B., Milburn, N.G.: Mobilizing homeless youth for HIV prevention. Health education research 27( 2 ), 226{236 (2012) 20. Squartini, T., Picciolo, F., Ruzzenenti, F., Garlaschelli, D.: Reciprocity of weighted networks. Scienti c Reports (2012) 21. Tang, Y., Xiao, X., Shi, Y.: In uence maximization: Near-optimal time complexity meets practical e ciency. In: KDD. ACM (2014) 22. Valente, T.W., Pumpuang, P.: Identifying opinion leaders to promote behavior change. Health Education & Behavior (2007) 23. Wang, Y., Cong, G., Song, G., Xie, K.: Community-based greedy algorithm for mining top-k in uential nodes in mobile social networks. In: KDD. pp. 1039{1048.

ACM (2010) 24. Wilder, B., Yadav, A., Immorlica, N., Rice, E., Tambe, M.: Uncharted but not unin uenced: In uence maximization with an uncertain network. In: AAMAS. pp. 740{748 (2017) 25. Yadav, A., Chan, H., Xin Jiang, A., Xu, H., Rice, E., Tambe, M.: Using social networks to aid homeless shelters: Dynamic in uence maximization under uncertainty. In: AAMAS. pp. 740{748 (2016)

1. Abbe , E. , Sandon , C. : Community detection in general stochastic block models: Fundamental limits and e cient algorithms for recovery . In: FOCS . pp. 670 { 688 . IEEE ( 2015 )

2. Bond , R.M. , Fariss , C.J. , Jones , J.J. , Kramer , A.D. , Marlow , C. , Settle , J.E. , Fowler , J.H.: A 61-million-person experiment in social in uence and political mobilization . Nature 489 ( 7415 ), 295 { 298 ( 2012 )

3. Borgs , C. , Brautbar , M. , Chayes , J. , Khanna , S. , Lucier , B. : The power of local information in social networks . In: WINE . pp. 406 { 419 . Springer ( 2012 )

4. Borgs , C. , Brautbar , M. , Chayes , J. , Teng , S.H. : Multiscale matrix sampling and sublinear-time pagerank computation . Internet Mathematics 10 ( 1-2 ), 20 { 48 ( 2014 )